archivebox.misc.legacy

Legacy archive import utilities.

These functions are used to import data from old ArchiveBox archive formats (JSON indexes, archive directory structures) into the new database.

This is separate from the hooks-based parser system which handles importing new URLs from bookmark files, RSS feeds, etc.

Module Contents

Classes

SnapshotDict

Dictionary type representing a snapshot/link, compatible with Snapshot model fields.

Functions

parse_json_main_index

Parse links from the main JSON index file (archive/index.json).

parse_json_links_details

Parse links from individual snapshot index.jsonl/index.json files in archive directories.

API

class archivebox.misc.legacy.SnapshotDict[source]

Bases: typing.TypedDict

Dictionary type representing a snapshot/link, compatible with Snapshot model fields.

Initialization

Initialize self. See help(type(self)) for accurate signature.

url: str[source]

None

timestamp: str[source]

None

title: str[source]

None

tags: str[source]

None

sources: list[str][source]

None

archivebox.misc.legacy.parse_json_main_index(out_dir: pathlib.Path) collections.abc.Iterator[archivebox.misc.legacy.SnapshotDict][source]

Parse links from the main JSON index file (archive/index.json).

This is used to recover links from old archive formats.

Parse links from individual snapshot index.jsonl/index.json files in archive directories.

Walks through archive//index.jsonl and archive//index.json files to discover orphaned snapshots. Prefers index.jsonl (new format) over index.json (legacy format).