archivebox.index

Submodules

Package Contents

Functions

merge_links

deterministially merge two links, favoring longer field values over shorter, and “cleaner” values over worse ones.

validate_links

archivable_links

remove chrome://, about:// or other schemed links that cant be archived

fix_duplicate_links

ensures that all non-duplicate links have monotonically increasing timestamps

sorted_links

links_after_timestamp

lowest_uniq_timestamp

resolve duplicate timestamps by appending a decimal 1234, 1234 -> 1234.1, 1234.2

timed_index_update

write_main_index

Writes links to sqlite3 file for a given list of links

load_main_index

parse and load existing index with any new links from import_path merged in

load_main_index_meta

parse_links_from_source

fix_duplicate_links_in_index

Given a list of in-memory Links, dedupe and merge them with any conflicting Snapshots in the DB.

dedupe_links

The validation of links happened at a different stage. This method will focus on actual deduplication and timestamp fixing.

write_link_details

load_link_details

check for an existing link archive in the given directory, and load+merge it into the given link dict

q_filter

search_filter

snapshot_filter

get_indexed_folders

indexed links without checking archive status or data directory validity

get_archived_folders

indexed links that are archived with a valid data directory

get_unarchived_folders

indexed links that are unarchived with no data directory or an empty data directory

get_present_folders

dirs that actually exist in the archive/ folder

get_valid_folders

dirs with a valid index matched to the main index and archived content

get_invalid_folders

dirs that are invalid for any reason: corrupted/duplicate/orphaned/unrecognized

get_duplicate_folders

dirs that conflict with other directories that have the same link URL or timestamp

get_orphaned_folders

dirs that contain a valid index but aren’t listed in the main index

get_corrupted_folders

dirs that don’t contain a valid index and aren’t listed in the main index

get_unrecognized_folders

dirs that don’t contain recognizable archive data and aren’t listed in the main index

is_valid

is_corrupt

is_archived

is_unarchived

fix_invalid_folder_locations

Data

LINK_FILTERS

API

deterministially merge two links, favoring longer field values over shorter, and “cleaner” values over worse ones.

remove chrome://, about:// or other schemed links that cant be archived

ensures that all non-duplicate links have monotonically increasing timestamps

archivebox.index.lowest_uniq_timestamp(used_timestamps: collections.OrderedDict, timestamp: str) str[source]

resolve duplicate timestamps by appending a decimal 1234, 1234 -> 1234.1, 1234.2

archivebox.index.timed_index_update(out_path: pathlib.Path)[source]
archivebox.index.write_main_index(links: List[archivebox.index.schema.Link], out_dir: pathlib.Path = DATA_DIR, created_by_id: int | None = None) None[source]

Writes links to sqlite3 file for a given list of links

archivebox.index.load_main_index(out_dir: pathlib.Path | str = DATA_DIR, warn: bool = True) List[archivebox.index.schema.Link][source]

parse and load existing index with any new links from import_path merged in

archivebox.index.load_main_index_meta(out_dir: pathlib.Path = DATA_DIR) Optional[dict][source]

Given a list of in-memory Links, dedupe and merge them with any conflicting Snapshots in the DB.

The validation of links happened at a different stage. This method will focus on actual deduplication and timestamp fixing.

check for an existing link archive in the given directory, and load+merge it into the given link dict

None

archivebox.index.q_filter(snapshots: django.db.models.QuerySet, filter_patterns: List[str], filter_type: str = 'exact') django.db.models.QuerySet[source]
archivebox.index.search_filter(snapshots: django.db.models.QuerySet, filter_patterns: List[str], filter_type: str = 'search') django.db.models.QuerySet[source]
archivebox.index.snapshot_filter(snapshots: django.db.models.QuerySet, filter_patterns: List[str], filter_type: str = 'exact') django.db.models.QuerySet[source]
archivebox.index.get_indexed_folders(snapshots, out_dir: pathlib.Path = DATA_DIR) Dict[str, Optional[archivebox.index.schema.Link]][source]

indexed links without checking archive status or data directory validity

archivebox.index.get_archived_folders(snapshots, out_dir: pathlib.Path = DATA_DIR) Dict[str, Optional[archivebox.index.schema.Link]][source]

indexed links that are archived with a valid data directory

archivebox.index.get_unarchived_folders(snapshots, out_dir: pathlib.Path = DATA_DIR) Dict[str, Optional[archivebox.index.schema.Link]][source]

indexed links that are unarchived with no data directory or an empty data directory

archivebox.index.get_present_folders(snapshots, out_dir: pathlib.Path = DATA_DIR) Dict[str, Optional[archivebox.index.schema.Link]][source]

dirs that actually exist in the archive/ folder

archivebox.index.get_valid_folders(snapshots, out_dir: pathlib.Path = DATA_DIR) Dict[str, Optional[archivebox.index.schema.Link]][source]

dirs with a valid index matched to the main index and archived content

archivebox.index.get_invalid_folders(snapshots, out_dir: pathlib.Path = DATA_DIR) Dict[str, Optional[archivebox.index.schema.Link]][source]

dirs that are invalid for any reason: corrupted/duplicate/orphaned/unrecognized

archivebox.index.get_duplicate_folders(snapshots, out_dir: pathlib.Path = DATA_DIR) Dict[str, Optional[archivebox.index.schema.Link]][source]

dirs that conflict with other directories that have the same link URL or timestamp

archivebox.index.get_orphaned_folders(snapshots, out_dir: pathlib.Path = DATA_DIR) Dict[str, Optional[archivebox.index.schema.Link]][source]

dirs that contain a valid index but aren’t listed in the main index

archivebox.index.get_corrupted_folders(snapshots, out_dir: pathlib.Path = DATA_DIR) Dict[str, Optional[archivebox.index.schema.Link]][source]

dirs that don’t contain a valid index and aren’t listed in the main index

archivebox.index.get_unrecognized_folders(snapshots, out_dir: pathlib.Path = DATA_DIR) Dict[str, Optional[archivebox.index.schema.Link]][source]

dirs that don’t contain recognizable archive data and aren’t listed in the main index

archivebox.index.is_valid(link: archivebox.index.schema.Link) bool[source]
archivebox.index.is_corrupt(link: archivebox.index.schema.Link) bool[source]
archivebox.index.is_archived(link: archivebox.index.schema.Link) bool[source]
archivebox.index.is_unarchived(link: archivebox.index.schema.Link) bool[source]
archivebox.index.fix_invalid_folder_locations(out_dir: pathlib.Path = DATA_DIR) Tuple[List[str], List[str]][source]