archivebox.index package

Submodules

archivebox.index.csv module

archivebox.index.csv.to_csv(obj: Any, cols: List[str], separator: str = ',', ljust: int = 0) str[source]

archivebox.index.html module

archivebox.index.html.parse_html_main_index(out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/dev')) Iterator[str][source]

parse an archive index html file and return the list of urls

archivebox.index.html.main_index_template(links: List[Link], template: str = 'static_index.html') str[source]

render the template for the entire main index

archivebox.index.html.render_django_template(template: str, context: Mapping[str, str]) str[source]

render a given html template string with the given template content

archivebox.index.html.snapshot_icons(snapshot) str[source]

archivebox.index.json module

archivebox.index.json.parse_json_main_index(out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/dev')) Iterator[Link][source]

parse an archive index json file and return the list of links

write a json file with some info about the link

load the json link index from a given directory

read through all the archive data folders and return the parsed links

class archivebox.index.json.ExtendedEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)[source]

Bases: JSONEncoder

Extended json serializer that supports serializing several model fields and objects

default(obj)[source]

Implement this method in a subclass such that it returns a serializable object for o, or calls the base implementation (to raise a TypeError).

For example, to support arbitrary iterators, you could implement default like this:

def default(self, o):
    try:
        iterable = iter(o)
    except TypeError:
        pass
    else:
        return list(iterable)
    # Let the base class default method raise the TypeError
    return JSONEncoder.default(self, o)
archivebox.index.json.to_json(obj: ~typing.Any, indent: int | None = 4, sort_keys: bool = True, cls=<class 'archivebox.index.json.ExtendedEncoder'>) str[source]

archivebox.index.schema module

WARNING: THIS FILE IS ALL LEGACY CODE TO BE REMOVED.

DO NOT ADD ANY NEW FEATURES TO THIS FILE, NEW CODE GOES HERE: core/models.py

exception archivebox.index.schema.ArchiveError(message, hints=None)[source]

Bases: Exception

class archivebox.index.schema.ArchiveResult(cmd: List[str], pwd: str | None, cmd_version: str | None, output: str | Exception | NoneType, status: str, start_ts: datetime.datetime, end_ts: datetime.datetime, index_texts: List[str] | None = None, schema: str = 'ArchiveResult')[source]

Bases: object

cmd: List[str]
pwd: str | None
cmd_version: str | None
output: str | Exception | None
status: str
start_ts: datetime
end_ts: datetime
index_texts: List[str] | None = None
schema: str = 'ArchiveResult'
typecheck() None[source]
classmethod guess_ts(dict_info)[source]
classmethod from_json(json_info, guess=False)[source]
to_dict(*keys) dict[source]
to_json(indent=4, sort_keys=True) str[source]
to_csv(cols: List[str] | None = None, separator: str = ',', ljust: int = 0) str[source]
classmethod field_names()[source]
property duration: int

Bases: object

timestamp: str
url: str
title: str | None
tags: str | None
sources: List[str]
history: Dict[str, List[ArchiveResult]]
updated: datetime | None = None
schema: str = 'Link'
overwrite(**kwargs)[source]

pure functional version of dict.update that returns a new instance

typecheck() None[source]
as_snapshot()[source]
classmethod from_json(json_info, guess=False)[source]
to_json(indent=4, sort_keys=True) str[source]
to_csv(cols: List[str] | None = None, separator: str = ',', ljust: int = 0) str[source]
snapshot_id
classmethod field_names()[source]
property archive_path: str
property archive_size: float
property url_hash
property scheme: str
property extension: str
property domain: str
property path: str
property basename: str
property base_url: str
property bookmarked_date: str | None
property updated_date: str | None
property archive_dates: List[datetime]
property oldest_archive_date: datetime | None
property newest_archive_date: datetime | None
property num_outputs: int
property num_failures: int
property is_static: bool
property is_archived: bool
latest_outputs(status: str = None) Dict[str, str | Exception | None][source]

get the latest output that each archive method produced for link

canonical_outputs() Dict[str, str | None][source]

predict the expected output paths that should be present after archiving

archivebox.index.sql module

archivebox.index.sql.parse_sql_main_index(out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/dev')) Iterator[Link][source]
archivebox.index.sql.remove_from_sql_main_index(snapshots: QuerySet, atomic: bool = False, out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/dev')) None[source]
archivebox.index.sql.write_sql_main_index(links: List[Link], out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/dev')) None[source]
archivebox.index.sql.list_migrations(out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/dev')) List[Tuple[bool, str]][source]
archivebox.index.sql.apply_migrations(out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/dev')) List[str][source]
archivebox.index.sql.get_admins(out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/dev')) List[str][source]

Module contents

deterministially merge two links, favoring longer field values over shorter, and β€œcleaner” values over worse ones.

remove chrome://, about:// or other schemed links that cant be archived

ensures that all non-duplicate links have monotonically increasing timestamps

archivebox.index.lowest_uniq_timestamp(used_timestamps: OrderedDict, timestamp: str) str[source]

resolve duplicate timestamps by appending a decimal 1234, 1234 -> 1234.1, 1234.2

archivebox.index.timed_index_update(out_path: Path)[source]
archivebox.index.write_main_index(links: List[Link], out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/dev')) None[source]

Writes links to sqlite3 file for a given list of links

archivebox.index.load_main_index(out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/dev'), warn: bool = True) List[Link][source]

parse and load existing index with any new links from import_path merged in

archivebox.index.load_main_index_meta(out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/dev')) dict | None[source]

Given a list of in-memory Links, dedupe and merge them with any conflicting Snapshots in the DB.

The validation of links happened at a different stage. This method will focus on actual deduplication and timestamp fixing.

check for an existing link archive in the given directory, and load+merge it into the given link dict

archivebox.index.q_filter(snapshots: QuerySet, filter_patterns: List[str], filter_type: str = 'exact') QuerySet[source]
archivebox.index.search_filter(snapshots: QuerySet, filter_patterns: List[str], filter_type: str = 'search') QuerySet[source]
archivebox.index.snapshot_filter(snapshots: QuerySet, filter_patterns: List[str], filter_type: str = 'exact') QuerySet[source]
archivebox.index.get_indexed_folders(snapshots, out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/dev')) Dict[str, Link | None][source]

indexed links without checking archive status or data directory validity

archivebox.index.get_archived_folders(snapshots, out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/dev')) Dict[str, Link | None][source]

indexed links that are archived with a valid data directory

archivebox.index.get_unarchived_folders(snapshots, out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/dev')) Dict[str, Link | None][source]

indexed links that are unarchived with no data directory or an empty data directory

archivebox.index.get_present_folders(snapshots, out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/dev')) Dict[str, Link | None][source]

dirs that actually exist in the archive/ folder

archivebox.index.get_valid_folders(snapshots, out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/dev')) Dict[str, Link | None][source]

dirs with a valid index matched to the main index and archived content

archivebox.index.get_invalid_folders(snapshots, out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/dev')) Dict[str, Link | None][source]

dirs that are invalid for any reason: corrupted/duplicate/orphaned/unrecognized

archivebox.index.get_duplicate_folders(snapshots, out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/dev')) Dict[str, Link | None][source]

dirs that conflict with other directories that have the same link URL or timestamp

archivebox.index.get_orphaned_folders(snapshots, out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/dev')) Dict[str, Link | None][source]

dirs that contain a valid index but aren’t listed in the main index

archivebox.index.get_corrupted_folders(snapshots, out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/dev')) Dict[str, Link | None][source]

dirs that don’t contain a valid index and aren’t listed in the main index

archivebox.index.get_unrecognized_folders(snapshots, out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/dev')) Dict[str, Link | None][source]

dirs that don’t contain recognizable archive data and aren’t listed in the main index

archivebox.index.is_valid(link: Link) bool[source]
archivebox.index.is_corrupt(link: Link) bool[source]
archivebox.index.is_archived(link: Link) bool[source]
archivebox.index.is_unarchived(link: Link) bool[source]
archivebox.index.fix_invalid_folder_locations(out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/dev')) Tuple[List[str], List[str]][source]