archivebox.index
Submodules
Package Contents
Functions
deterministially merge two links, favoring longer field values over shorter, and “cleaner” values over worse ones. |
|
remove chrome://, about:// or other schemed links that cant be archived |
|
ensures that all non-duplicate links have monotonically increasing timestamps |
|
resolve duplicate timestamps by appending a decimal 1234, 1234 -> 1234.1, 1234.2 |
|
Writes links to sqlite3 file for a given list of links |
|
parse and load existing index with any new links from import_path merged in |
|
Given a list of in-memory Links, dedupe and merge them with any conflicting Snapshots in the DB. |
|
The validation of links happened at a different stage. This method will focus on actual deduplication and timestamp fixing. |
|
check for an existing link archive in the given directory, and load+merge it into the given link dict |
|
indexed links without checking archive status or data directory validity |
|
indexed links that are archived with a valid data directory |
|
indexed links that are unarchived with no data directory or an empty data directory |
|
dirs that actually exist in the archive/ folder |
|
dirs with a valid index matched to the main index and archived content |
|
dirs that are invalid for any reason: corrupted/duplicate/orphaned/unrecognized |
|
dirs that conflict with other directories that have the same link URL or timestamp |
|
dirs that contain a valid index but aren’t listed in the main index |
|
dirs that don’t contain a valid index and aren’t listed in the main index |
|
dirs that don’t contain recognizable archive data and aren’t listed in the main index |
|
Data
API
- archivebox.index.merge_links(a: archivebox.index.schema.Link, b: archivebox.index.schema.Link) archivebox.index.schema.Link [source]
deterministially merge two links, favoring longer field values over shorter, and “cleaner” values over worse ones.
- archivebox.index.validate_links(links: Iterable[archivebox.index.schema.Link]) List[archivebox.index.schema.Link] [source]
- archivebox.index.archivable_links(links: Iterable[archivebox.index.schema.Link]) Iterable[archivebox.index.schema.Link] [source]
remove chrome://, about:// or other schemed links that cant be archived
- archivebox.index.fix_duplicate_links(sorted_links: Iterable[archivebox.index.schema.Link]) Iterable[archivebox.index.schema.Link] [source]
ensures that all non-duplicate links have monotonically increasing timestamps
- archivebox.index.sorted_links(links: Iterable[archivebox.index.schema.Link]) Iterable[archivebox.index.schema.Link] [source]
- archivebox.index.links_after_timestamp(links: Iterable[archivebox.index.schema.Link], resume: Optional[float] = None) Iterable[archivebox.index.schema.Link] [source]
- archivebox.index.lowest_uniq_timestamp(used_timestamps: collections.OrderedDict, timestamp: str) str [source]
resolve duplicate timestamps by appending a decimal 1234, 1234 -> 1234.1, 1234.2
- archivebox.index.write_main_index(links: List[archivebox.index.schema.Link], out_dir: pathlib.Path = DATA_DIR, created_by_id: int | None = None) None [source]
Writes links to sqlite3 file for a given list of links
- archivebox.index.load_main_index(out_dir: pathlib.Path | str = DATA_DIR, warn: bool = True) List[archivebox.index.schema.Link] [source]
parse and load existing index with any new links from import_path merged in
- archivebox.index.parse_links_from_source(source_path: str, root_url: Optional[str] = None, parser: str = 'auto') List[archivebox.index.schema.Link] [source]
- archivebox.index.fix_duplicate_links_in_index(snapshots: django.db.models.QuerySet, links: Iterable[archivebox.index.schema.Link]) Iterable[archivebox.index.schema.Link] [source]
Given a list of in-memory Links, dedupe and merge them with any conflicting Snapshots in the DB.
- archivebox.index.dedupe_links(snapshots: django.db.models.QuerySet, new_links: List[archivebox.index.schema.Link]) List[archivebox.index.schema.Link] [source]
The validation of links happened at a different stage. This method will focus on actual deduplication and timestamp fixing.
- archivebox.index.write_link_details(link: archivebox.index.schema.Link, out_dir: Optional[str] = None, skip_sql_index: bool = False) None [source]
- archivebox.index.load_link_details(link: archivebox.index.schema.Link, out_dir: Optional[str] = None) archivebox.index.schema.Link [source]
check for an existing link archive in the given directory, and load+merge it into the given link dict
- archivebox.index.q_filter(snapshots: django.db.models.QuerySet, filter_patterns: List[str], filter_type: str = 'exact') django.db.models.QuerySet [source]
- archivebox.index.search_filter(snapshots: django.db.models.QuerySet, filter_patterns: List[str], filter_type: str = 'search') django.db.models.QuerySet [source]
- archivebox.index.snapshot_filter(snapshots: django.db.models.QuerySet, filter_patterns: List[str], filter_type: str = 'exact') django.db.models.QuerySet [source]
- archivebox.index.get_indexed_folders(snapshots, out_dir: pathlib.Path = DATA_DIR) Dict[str, Optional[archivebox.index.schema.Link]] [source]
indexed links without checking archive status or data directory validity
- archivebox.index.get_archived_folders(snapshots, out_dir: pathlib.Path = DATA_DIR) Dict[str, Optional[archivebox.index.schema.Link]] [source]
indexed links that are archived with a valid data directory
- archivebox.index.get_unarchived_folders(snapshots, out_dir: pathlib.Path = DATA_DIR) Dict[str, Optional[archivebox.index.schema.Link]] [source]
indexed links that are unarchived with no data directory or an empty data directory
- archivebox.index.get_present_folders(snapshots, out_dir: pathlib.Path = DATA_DIR) Dict[str, Optional[archivebox.index.schema.Link]] [source]
dirs that actually exist in the archive/ folder
- archivebox.index.get_valid_folders(snapshots, out_dir: pathlib.Path = DATA_DIR) Dict[str, Optional[archivebox.index.schema.Link]] [source]
dirs with a valid index matched to the main index and archived content
- archivebox.index.get_invalid_folders(snapshots, out_dir: pathlib.Path = DATA_DIR) Dict[str, Optional[archivebox.index.schema.Link]] [source]
dirs that are invalid for any reason: corrupted/duplicate/orphaned/unrecognized
- archivebox.index.get_duplicate_folders(snapshots, out_dir: pathlib.Path = DATA_DIR) Dict[str, Optional[archivebox.index.schema.Link]] [source]
dirs that conflict with other directories that have the same link URL or timestamp
- archivebox.index.get_orphaned_folders(snapshots, out_dir: pathlib.Path = DATA_DIR) Dict[str, Optional[archivebox.index.schema.Link]] [source]
dirs that contain a valid index but aren’t listed in the main index
- archivebox.index.get_corrupted_folders(snapshots, out_dir: pathlib.Path = DATA_DIR) Dict[str, Optional[archivebox.index.schema.Link]] [source]
dirs that don’t contain a valid index and aren’t listed in the main index
- archivebox.index.get_unrecognized_folders(snapshots, out_dir: pathlib.Path = DATA_DIR) Dict[str, Optional[archivebox.index.schema.Link]] [source]
dirs that don’t contain recognizable archive data and aren’t listed in the main index
- archivebox.index.is_valid(link: archivebox.index.schema.Link) bool [source]
- archivebox.index.is_corrupt(link: archivebox.index.schema.Link) bool [source]
- archivebox.index.is_archived(link: archivebox.index.schema.Link) bool [source]
- archivebox.index.is_unarchived(link: archivebox.index.schema.Link) bool [source]