archivebox.index package¶
Submodules¶
archivebox.index.csv module¶
archivebox.index.html module¶
-
archivebox.index.html.
parse_html_main_index
(out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.0/docs')) → Iterator[str][source]¶ parse an archive index html file and return the list of urls
-
archivebox.index.html.
generate_index_from_links
(links: List[archivebox.index.schema.Link], with_headers: bool)[source]¶
-
archivebox.index.html.
main_index_template
(links: List[archivebox.index.schema.Link], template: str = 'static_index.html') → str[source]¶ render the template for the entire main index
-
archivebox.index.html.
write_html_link_details
(link: archivebox.index.schema.Link, out_dir: Optional[str] = None) → None[source]¶
archivebox.index.json module¶
-
archivebox.index.json.
generate_json_index_from_links
(links: List[archivebox.index.schema.Link], with_headers: bool)[source]¶
-
archivebox.index.json.
parse_json_main_index
(out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.0/docs')) → Iterator[archivebox.index.schema.Link][source]¶ parse an archive index json file and return the list of links
-
archivebox.index.json.
write_json_link_details
(link: archivebox.index.schema.Link, out_dir: Optional[str] = None) → None[source]¶ write a json file with some info about the link
-
archivebox.index.json.
parse_json_link_details
(out_dir: Union[pathlib.Path, str], guess: Optional[bool] = False) → Optional[archivebox.index.schema.Link][source]¶ load the json link index from a given directory
-
archivebox.index.json.
parse_json_links_details
(out_dir: Union[pathlib.Path, str]) → Iterator[archivebox.index.schema.Link][source]¶ read through all the archive data folders and return the parsed links
-
class
archivebox.index.json.
ExtendedEncoder
(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)[source]¶ Bases:
json.encoder.JSONEncoder
Extended json serializer that supports serializing several model fields and objects
-
default
(obj)[source]¶ Implement this method in a subclass such that it returns a serializable object for
o
, or calls the base implementation (to raise aTypeError
).For example, to support arbitrary iterators, you could implement default like this:
def default(self, o): try: iterable = iter(o) except TypeError: pass else: return list(iterable) # Let the base class default method raise the TypeError return JSONEncoder.default(self, o)
-
archivebox.index.schema module¶
WARNING: THIS FILE IS ALL LEGACY CODE TO BE REMOVED.
DO NOT ADD ANY NEW FEATURES TO THIS FILE, NEW CODE GOES HERE: core/models.py
-
class
archivebox.index.schema.
ArchiveResult
(cmd: List[str], pwd: Union[str, NoneType], cmd_version: Union[str, NoneType], output: Union[str, Exception, NoneType], status: str, start_ts: datetime.datetime, end_ts: datetime.datetime, index_texts: Union[List[str], NoneType] = None, schema: str = 'ArchiveResult')[source]¶ Bases:
object
-
index_texts
= None¶
-
schema
= 'ArchiveResult'¶
-
duration
¶
-
-
class
archivebox.index.schema.
Link
(timestamp: str, url: str, title: Union[str, NoneType], tags: Union[str, NoneType], sources: List[str], history: Dict[str, List[archivebox.index.schema.ArchiveResult]] = <factory>, updated: Union[datetime.datetime, NoneType] = None, schema: str = 'Link')[source]¶ Bases:
object
-
updated
= None¶
-
schema
= 'Link'¶
-
snapshot_id
¶
-
link_dir
¶
-
archive_path
¶
-
archive_size
¶
-
url_hash
¶
-
scheme
¶
-
extension
¶
-
domain
¶
-
path
¶
-
basename
¶
-
base_url
¶
-
bookmarked_date
¶
-
updated_date
¶
-
archive_dates
¶
-
oldest_archive_date
¶
-
newest_archive_date
¶
-
num_outputs
¶
-
num_failures
¶
-
is_static
¶
-
is_archived
¶
-
archivebox.index.sql module¶
-
archivebox.index.sql.
parse_sql_main_index
(out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.0/docs')) → Iterator[archivebox.index.schema.Link][source]¶
-
archivebox.index.sql.
remove_from_sql_main_index
(snapshots: django.db.models.query.QuerySet, atomic: bool = False, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.0/docs')) → None[source]¶
-
archivebox.index.sql.
write_sql_main_index
(links: List[archivebox.index.schema.Link], out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.0/docs')) → None[source]¶
-
archivebox.index.sql.
write_sql_link_details
(link: archivebox.index.schema.Link, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.0/docs')) → None[source]¶
-
archivebox.index.sql.
list_migrations
(out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.0/docs')) → List[Tuple[bool, str]][source]¶
Module contents¶
-
archivebox.index.
merge_links
(a: archivebox.index.schema.Link, b: archivebox.index.schema.Link) → archivebox.index.schema.Link[source]¶ deterministially merge two links, favoring longer field values over shorter, and “cleaner” values over worse ones.
-
archivebox.index.
validate_links
(links: Iterable[archivebox.index.schema.Link]) → List[archivebox.index.schema.Link][source]¶
-
archivebox.index.
archivable_links
(links: Iterable[archivebox.index.schema.Link]) → Iterable[archivebox.index.schema.Link][source]¶ remove chrome://, about:// or other schemed links that cant be archived
-
archivebox.index.
fix_duplicate_links
(sorted_links: Iterable[archivebox.index.schema.Link]) → Iterable[archivebox.index.schema.Link][source]¶ ensures that all non-duplicate links have monotonically increasing timestamps
-
archivebox.index.
sorted_links
(links: Iterable[archivebox.index.schema.Link]) → Iterable[archivebox.index.schema.Link][source]¶
-
archivebox.index.
links_after_timestamp
(links: Iterable[archivebox.index.schema.Link], resume: Optional[float] = None) → Iterable[archivebox.index.schema.Link][source]¶
-
archivebox.index.
lowest_uniq_timestamp
(used_timestamps: collections.OrderedDict, timestamp: str) → str[source]¶ resolve duplicate timestamps by appending a decimal 1234, 1234 -> 1234.1, 1234.2
-
archivebox.index.
write_main_index
(links: List[archivebox.index.schema.Link], out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.0/docs')) → None[source]¶ Writes links to sqlite3 file for a given list of links
-
archivebox.index.
load_main_index
(out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.0/docs'), warn: bool = True) → List[archivebox.index.schema.Link][source]¶ parse and load existing index with any new links from import_path merged in
-
archivebox.index.
load_main_index_meta
(out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.0/docs')) → Optional[dict][source]¶
-
archivebox.index.
parse_links_from_source
(source_path: str, root_url: Optional[str] = None, parser: str = 'auto') → Tuple[List[archivebox.index.schema.Link], List[archivebox.index.schema.Link]][source]¶
-
archivebox.index.
fix_duplicate_links_in_index
(snapshots: django.db.models.query.QuerySet, links: Iterable[archivebox.index.schema.Link]) → Iterable[archivebox.index.schema.Link][source]¶ Given a list of in-memory Links, dedupe and merge them with any conflicting Snapshots in the DB.
-
archivebox.index.
dedupe_links
(snapshots: django.db.models.query.QuerySet, new_links: List[archivebox.index.schema.Link]) → List[archivebox.index.schema.Link][source]¶ The validation of links happened at a different stage. This method will focus on actual deduplication and timestamp fixing.
-
archivebox.index.
write_link_details
(link: archivebox.index.schema.Link, out_dir: Optional[str] = None, skip_sql_index: bool = False) → None[source]¶
-
archivebox.index.
load_link_details
(link: archivebox.index.schema.Link, out_dir: Optional[str] = None) → archivebox.index.schema.Link[source]¶ check for an existing link archive in the given directory, and load+merge it into the given link dict
-
archivebox.index.
q_filter
(snapshots: django.db.models.query.QuerySet, filter_patterns: List[str], filter_type: str = 'exact') → django.db.models.query.QuerySet[source]¶
-
archivebox.index.
search_filter
(snapshots: django.db.models.query.QuerySet, filter_patterns: List[str], filter_type: str = 'search') → django.db.models.query.QuerySet[source]¶
-
archivebox.index.
snapshot_filter
(snapshots: django.db.models.query.QuerySet, filter_patterns: List[str], filter_type: str = 'exact') → django.db.models.query.QuerySet[source]¶
-
archivebox.index.
get_indexed_folders
(snapshots, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.0/docs')) → Dict[str, Optional[archivebox.index.schema.Link]][source]¶ indexed links without checking archive status or data directory validity
-
archivebox.index.
get_archived_folders
(snapshots, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.0/docs')) → Dict[str, Optional[archivebox.index.schema.Link]][source]¶ indexed links that are archived with a valid data directory
-
archivebox.index.
get_unarchived_folders
(snapshots, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.0/docs')) → Dict[str, Optional[archivebox.index.schema.Link]][source]¶ indexed links that are unarchived with no data directory or an empty data directory
-
archivebox.index.
get_present_folders
(snapshots, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.0/docs')) → Dict[str, Optional[archivebox.index.schema.Link]][source]¶ dirs that actually exist in the archive/ folder
-
archivebox.index.
get_valid_folders
(snapshots, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.0/docs')) → Dict[str, Optional[archivebox.index.schema.Link]][source]¶ dirs with a valid index matched to the main index and archived content
-
archivebox.index.
get_invalid_folders
(snapshots, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.0/docs')) → Dict[str, Optional[archivebox.index.schema.Link]][source]¶ dirs that are invalid for any reason: corrupted/duplicate/orphaned/unrecognized
-
archivebox.index.
get_duplicate_folders
(snapshots, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.0/docs')) → Dict[str, Optional[archivebox.index.schema.Link]][source]¶ dirs that conflict with other directories that have the same link URL or timestamp
-
archivebox.index.
get_orphaned_folders
(snapshots, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.0/docs')) → Dict[str, Optional[archivebox.index.schema.Link]][source]¶ dirs that contain a valid index but aren’t listed in the main index
-
archivebox.index.
get_corrupted_folders
(snapshots, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.0/docs')) → Dict[str, Optional[archivebox.index.schema.Link]][source]¶ dirs that don’t contain a valid index and aren’t listed in the main index
-
archivebox.index.
get_unrecognized_folders
(snapshots, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.0/docs')) → Dict[str, Optional[archivebox.index.schema.Link]][source]¶ dirs that don’t contain recognizable archive data and aren’t listed in the main index