archivebox.index package¶

Submodules¶

archivebox.index.csv module¶

archivebox.index.csv.links_to_csv(links: List[archivebox.index.schema.Link], cols: Optional[List[str]] = None, header: bool = True, separator: str = ',', ljust: int = 0) → str[source]¶

archivebox.index.csv.to_csv(obj: Any, cols: List[str], separator: str = ',', ljust: int = 0) → str[source]¶

archivebox.index.html module¶

archivebox.index.html.join(*paths)¶

archivebox.index.html.parse_html_main_index(out_dir: str = '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.4.0/docs') → Iterator[str][source]¶: parse an archive index html file and return the list of urls

archivebox.index.html.write_html_main_index(links: List[archivebox.index.schema.Link], out_dir: str = '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.4.0/docs', finished: bool = False) → None[source]¶: write the html link index to a given path

archivebox.index.html.main_index_template(links: List[archivebox.index.schema.Link], finished: bool = True) → str[source]¶: render the template for the entire main index

archivebox.index.html.main_index_row_template(link: archivebox.index.schema.Link) → str[source]¶: render the template for an individual link row of the main index

archivebox.index.html.write_html_link_details(link: archivebox.index.schema.Link, out_dir: Optional[str] = None) → None[source]¶

archivebox.index.html.link_details_template(link: archivebox.index.schema.Link) → str[source]¶

archivebox.index.html.render_legacy_template(template_path: str, context: Mapping[str, str]) → str[source]¶: render a given html template string with the given template content

archivebox.index.json module¶

archivebox.index.json.parse_json_main_index(out_dir: str = '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.4.0/docs') → Iterator[archivebox.index.schema.Link][source]¶: parse an archive index json file and return the list of links

archivebox.index.json.write_json_main_index(links: List[archivebox.index.schema.Link], out_dir: str = '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.4.0/docs') → None[source]¶: write the json link index to a given path

archivebox.index.json.write_json_link_details(link: archivebox.index.schema.Link, out_dir: Optional[str] = None) → None[source]¶: write a json file with some info about the link

archivebox.index.json.parse_json_link_details(out_dir: str) → Optional[archivebox.index.schema.Link][source]¶: load the json link index from a given directory

archivebox.index.json.parse_json_links_details(out_dir: str) → Iterator[archivebox.index.schema.Link][source]¶: read through all the archive data folders and return the parsed links

class archivebox.index.json.ExtendedEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)[source]¶

Bases: json.encoder.JSONEncoder

Extended json serializer that supports serializing several model fields and objects

default(obj)[source]¶

Implement this method in a subclass such that it returns a serializable object for o, or calls the base implementation (to raise a TypeError).

For example, to support arbitrary iterators, you could implement default like this:

def default(self, o):
    try:
        iterable = iter(o)
    except TypeError:
        pass
    else:
        return list(iterable)
    # Let the base class default method raise the TypeError
    return JSONEncoder.default(self, o)

archivebox.index.json.to_json(obj: Any, indent: Optional[int] = 4, sort_keys: bool = True, cls=<class 'archivebox.index.json.ExtendedEncoder'>) → str[source]¶

archivebox.index.schema module¶

exception archivebox.index.schema.ArchiveError(message, hints=None)[source]¶: Bases: Exception

class archivebox.index.schema.ArchiveResult(cmd: List[str], pwd: Union[str, NoneType], cmd_version: Union[str, NoneType], output: Union[str, Exception, NoneType], status: str, start_ts: datetime.datetime, end_ts: datetime.datetime, schema: str = 'ArchiveResult')[source]¶

Bases: object

schema = 'ArchiveResult'¶

typecheck() → None[source]¶

classmethod from_json(json_info)[source]¶

to_dict(*keys) → dict[source]¶

to_json(indent=4, sort_keys=True) → str[source]¶

to_csv(cols: Optional[List[str]] = None, separator: str = ', ', ljust: int = 0) → str[source]¶

classmethod field_names()[source]¶

duration¶

class archivebox.index.schema.Link(timestamp: str, url: str, title: Union[str, NoneType], tags: Union[str, NoneType], sources: List[str], history: Dict[str, List[archivebox.index.schema.ArchiveResult]] = <factory>, updated: Union[datetime.datetime, NoneType] = None, schema: str = 'Link')[source]¶

Bases: object

updated = None¶

schema = 'Link'¶

overwrite(**kwargs)[source]¶: pure functional version of dict.update that returns a new instance

typecheck() → None[source]¶

classmethod from_json(json_info)[source]¶

to_json(indent=4, sort_keys=True) → str[source]¶

to_csv(cols: Optional[List[str]] = None, separator: str = ', ', ljust: int = 0) → str[source]¶

classmethod field_names()[source]¶

link_dir¶

archive_path¶

url_hash¶

scheme¶

extension¶

domain¶

path¶

basename¶

base_url¶

bookmarked_date¶

updated_date¶

archive_dates¶

oldest_archive_date¶

newest_archive_date¶

num_outputs¶

num_failures¶

is_static¶

is_archived¶

latest_outputs(status: str = None) → Dict[str, Union[str, Exception, None]][source]¶: get the latest output that each archive method produced for link

canonical_outputs() → Dict[str, Optional[str]][source]¶: predict the expected output paths that should be present after archiving

archivebox.index.sql module¶

archivebox.index.sql.parse_sql_main_index(out_dir: str = '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.4.0/docs') → Iterator[archivebox.index.schema.Link][source]¶

archivebox.index.sql.write_sql_main_index(links: List[archivebox.index.schema.Link], out_dir: str = '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.4.0/docs') → None[source]¶

archivebox.index.sql.list_migrations(out_dir: str = '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.4.0/docs') → List[Tuple[bool, str]][source]¶

archivebox.index.sql.apply_migrations(out_dir: str = '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.4.0/docs') → List[str][source]¶

archivebox.index.sql.get_admins(out_dir: str = '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.4.0/docs') → List[str][source]¶

Module contents¶

archivebox.index.merge_links(a: archivebox.index.schema.Link, b: archivebox.index.schema.Link) → archivebox.index.schema.Link[source]¶: deterministially merge two links, favoring longer field values over shorter, and “cleaner” values over worse ones.

archivebox.index.validate_links(links: Iterable[archivebox.index.schema.Link]) → List[archivebox.index.schema.Link][source]¶

archivebox.index.archivable_links(links: Iterable[archivebox.index.schema.Link]) → Iterable[archivebox.index.schema.Link][source]¶: remove chrome://, about:// or other schemed links that cant be archived

archivebox.index.uniquefied_links(sorted_links: Iterable[archivebox.index.schema.Link]) → Iterable[archivebox.index.schema.Link][source]¶: ensures that all non-duplicate links have monotonically increasing timestamps

archivebox.index.sorted_links(links: Iterable[archivebox.index.schema.Link]) → Iterable[archivebox.index.schema.Link][source]¶

archivebox.index.links_after_timestamp(links: Iterable[archivebox.index.schema.Link], resume: Optional[float] = None) → Iterable[archivebox.index.schema.Link][source]¶

archivebox.index.lowest_uniq_timestamp(used_timestamps: collections.OrderedDict, timestamp: str) → str[source]¶: resolve duplicate timestamps by appending a decimal 1234, 1234 -> 1234.1, 1234.2

archivebox.index.timed_index_update(out_path: str)[source]¶

archivebox.index.write_main_index(links: List[archivebox.index.schema.Link], out_dir: str = '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.4.0/docs', finished: bool = False) → None[source]¶: create index.html file for a given list of links

archivebox.index.load_main_index(out_dir: str = '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.4.0/docs', warn: bool = True) → List[archivebox.index.schema.Link][source]¶: parse and load existing index with any new links from import_path merged in

archivebox.index.load_main_index_meta(out_dir: str = '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.4.0/docs') → Optional[dict][source]¶

archivebox.index.import_new_links(existing_links: List[archivebox.index.schema.Link], import_path: str, out_dir: str = '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.4.0/docs') → Tuple[List[archivebox.index.schema.Link], List[archivebox.index.schema.Link]][source]¶

archivebox.index.patch_main_index(link: archivebox.index.schema.Link, out_dir: str = '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.4.0/docs') → None[source]¶: hack to in-place update one row’s info in the generated index files

archivebox.index.write_link_details(link: archivebox.index.schema.Link, out_dir: Optional[str] = None) → None[source]¶

archivebox.index.load_link_details(link: archivebox.index.schema.Link, out_dir: Optional[str] = None) → archivebox.index.schema.Link[source]¶: check for an existing link archive in the given directory, and load+merge it into the given link dict

archivebox.index.link_matches_filter(link: archivebox.index.schema.Link, filter_patterns: List[str], filter_type: str = 'exact') → bool[source]¶

archivebox.index.get_indexed_folders(links, out_dir: str = '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.4.0/docs') → Dict[str, Optional[archivebox.index.schema.Link]][source]¶: indexed links without checking archive status or data directory validity

archivebox.index.get_archived_folders(links, out_dir: str = '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.4.0/docs') → Dict[str, Optional[archivebox.index.schema.Link]][source]¶: indexed links that are archived with a valid data directory

archivebox.index.get_unarchived_folders(links, out_dir: str = '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.4.0/docs') → Dict[str, Optional[archivebox.index.schema.Link]][source]¶: indexed links that are unarchived with no data directory or an empty data directory

archivebox.index.get_present_folders(links, out_dir: str = '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.4.0/docs') → Dict[str, Optional[archivebox.index.schema.Link]][source]¶: dirs that actually exist in the archive/ folder

archivebox.index.get_valid_folders(links, out_dir: str = '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.4.0/docs') → Dict[str, Optional[archivebox.index.schema.Link]][source]¶: dirs with a valid index matched to the main index and archived content

archivebox.index.get_invalid_folders(links, out_dir: str = '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.4.0/docs') → Dict[str, Optional[archivebox.index.schema.Link]][source]¶: dirs that are invalid for any reason: corrupted/duplicate/orphaned/unrecognized

archivebox.index.get_duplicate_folders(links, out_dir: str = '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.4.0/docs') → Dict[str, Optional[archivebox.index.schema.Link]][source]¶: dirs that conflict with other directories that have the same link URL or timestamp

archivebox.index.get_orphaned_folders(links, out_dir: str = '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.4.0/docs') → Dict[str, Optional[archivebox.index.schema.Link]][source]¶: dirs that contain a valid index but aren’t listed in the main index

archivebox.index.get_corrupted_folders(links, out_dir: str = '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.4.0/docs') → Dict[str, Optional[archivebox.index.schema.Link]][source]¶: dirs that don’t contain a valid index and aren’t listed in the main index

archivebox.index.get_unrecognized_folders(links, out_dir: str = '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.4.0/docs') → Dict[str, Optional[archivebox.index.schema.Link]][source]¶: dirs that don’t contain recognizable archive data and aren’t listed in the main index

archivebox.index.is_valid(link: archivebox.index.schema.Link) → bool[source]¶

archivebox.index.is_corrupt(link: archivebox.index.schema.Link) → bool[source]¶

archivebox.index.is_archived(link: archivebox.index.schema.Link) → bool[source]¶

archivebox.index.is_unarchived(link: archivebox.index.schema.Link) → bool[source]¶

archivebox.index.fix_invalid_folder_locations(out_dir: str = '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.4.0/docs') → Tuple[List[str], List[str]][source]¶