archivebox package

Subpackages

Submodules

archivebox.main module

archivebox.main.help(out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/dev')) → None[source]: Print the ArchiveBox help message and usage

archivebox.main.version(quiet: bool = False, out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/dev')) → None[source]: Print the ArchiveBox version and dependency information

archivebox.main.run(subcommand: str, subcommand_args: List[str] | None, stdin: IO | None = None, out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/dev')) → None[source]: Run a given ArchiveBox subcommand with the given list of args

archivebox.main.init(force: bool = False, quick: bool = False, setup: bool = False, out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/dev')) → None[source]: Initialize a new ArchiveBox collection in the current directory

archivebox.main.status(out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/dev')) → None[source]: Print out some info and statistics about the archive collection

archivebox.main.oneshot(url: str, extractors: str = '', out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/dev'))[source]: Create a single URL archive folder with an index.json and index.html, and all the archive method outputs. You can run this to archive single pages without needing to create a whole collection with archivebox init.

archivebox.main.add(urls: str | List[str], tag: str = '', depth: int = 0, update: bool = False, update_all: bool = False, index_only: bool = False, overwrite: bool = False, init: bool = False, extractors: str = '', parser: str = 'auto', out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/dev')) → List[Link][source]: Add a new URL or list of URLs to your archive

archivebox.main.remove(filter_str: str | None = None, filter_patterns: List[str] | None = None, filter_type: str = 'exact', snapshots: QuerySet | None = None, after: float | None = None, before: float | None = None, yes: bool = False, delete: bool = False, out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/dev')) → List[Link][source]: Remove the specified URLs from the archive

archivebox.main.update(resume: float | None = None, only_new: bool = True, index_only: bool = False, overwrite: bool = False, filter_patterns_str: str | None = None, filter_patterns: List[str] | None = None, filter_type: str | None = None, status: str | None = None, after: str | None = None, before: str | None = None, extractors: str = '', out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/dev')) → List[Link][source]: Import any new links from subscriptions and retry any previously failed/skipped links

archivebox.main.list_all(filter_patterns_str: str | None = None, filter_patterns: List[str] | None = None, filter_type: str = 'exact', status: str | None = None, after: float | None = None, before: float | None = None, sort: str | None = None, csv: str | None = None, json: bool = False, html: bool = False, with_headers: bool = False, out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/dev')) → Iterable[Link][source]: List, filter, and export information about archive entries

archivebox.main.list_links(snapshots: QuerySet | None = None, filter_patterns: List[str] | None = None, filter_type: str = 'exact', after: float | None = None, before: float | None = None, out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/dev')) → Iterable[Link][source]

archivebox.main.list_folders(links: List[Link], status: str, out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/dev')) → Dict[str, Link | None][source]

archivebox.main.setup(out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/dev')) → None[source]: Automatically install all ArchiveBox dependencies and extras

archivebox.main.config(config_options_str: str | None = None, config_options: List[str] | None = None, get: bool = False, set: bool = False, reset: bool = False, out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/dev')) → None[source]: Get and set your ArchiveBox project configuration values

archivebox.main.schedule(add: bool = False, show: bool = False, clear: bool = False, foreground: bool = False, run_all: bool = False, quiet: bool = False, every: str | None = None, tag: str = '', depth: int = 0, overwrite: bool = False, update: bool = False, import_path: str | None = None, out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/dev'))[source]: Set ArchiveBox to regularly import URLs at specific times using cron

archivebox.main.server(runserver_args: List[str] | None = None, reload: bool = False, debug: bool = False, init: bool = False, quick_init: bool = False, createsuperuser: bool = False, out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/dev')) → None[source]: Run the ArchiveBox HTTP server

archivebox.main.manage(args: List[str] | None = None, out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/dev')) → None[source]: Run an ArchiveBox Django management command

archivebox.main.shell(out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/dev')) → None[source]: Enter an interactive ArchiveBox Django shell

archivebox.manage module

archivebox.system module

archivebox.system.run(cmd, *args, input=None, capture_output=True, timeout=None, check=False, text=False, start_new_session=True, **kwargs)[source]: Patched of subprocess.run to kill forked child subprocesses and fix blocking io making timeout=innefective Mostly copied from https://github.com/python/cpython/blob/master/Lib/subprocess.py

archivebox.system.atomic_write(path: Path | str, contents: dict | str | bytes, overwrite: bool = True) → None[source]: Safe atomic write to filesystem by writing to temp file + atomic rename

archivebox.system.chmod_file(path: str, cwd: str = '.') → None[source]: chmod -R <permissions> <cwd>/<path>

archivebox.system.copy_and_overwrite(from_path: str | Path, to_path: str | Path)[source]: copy a given file or directory to a given path, overwriting the destination

archivebox.system.get_dir_size(path: str | Path, recursive: bool = True, pattern: str | None = None) → Tuple[int, int, int][source]: get the total disk size of a given directory, optionally summing up recursively and limiting to a given filter list

archivebox.system.dedupe_cron_jobs(cron: CronTab) → CronTab[source]

class archivebox.system.suppress_output(stdout=True, stderr=True)[source]

Bases: object

A context manager for doing a “deep suppression” of stdout and stderr in Python, i.e. will suppress all print, even if the print originates in a compiled C/Fortran sub-function.

This will not suppress raised exceptions, since exceptions are printed to stderr just before a script exits, and after the context manager has exited (at least, I think that is why it lets exceptions through).

with suppress_stdout_stderr():: rogue_function()

archivebox.util module

archivebox.util.detect_encoding(rawdata)

archivebox.util.scheme(url)

archivebox.util.without_scheme(url)

archivebox.util.without_query(url)

archivebox.util.without_fragment(url)

archivebox.util.without_path(url)

archivebox.util.path(url)

archivebox.util.basename(url)

archivebox.util.domain(url)

archivebox.util.query(url)

archivebox.util.fragment(url)

archivebox.util.extension(url)

archivebox.util.base_url(url)

archivebox.util.without_www(url)

archivebox.util.without_trailing_slash(url)

archivebox.util.hashurl(url)

archivebox.util.urlencode(s)

archivebox.util.urldecode(s)

archivebox.util.htmlencode(s)

archivebox.util.htmldecode(s)

archivebox.util.short_ts(ts)

archivebox.util.ts_to_date_str(ts)

archivebox.util.ts_to_iso(ts)

archivebox.util.is_static_file(url: str)[source]

archivebox.util.enforce_types(func)[source]: Enforce function arg and kwarg types at runtime using its python3 type hints

archivebox.util.docstring(text: str | None)[source]: attach the given docstring to the decorated function

archivebox.util.str_between(string: str, start: str, end: str = None) → str[source]: (<abc>12345</def>, <abc>, </def>) -> 12345

archivebox.util.parse_date(date: Any) → datetime | None[source]: Parse unix timestamps, iso format, and human-readable strings

archivebox.util.download_url(url: str, timeout: int = None) → str[source]: Download the contents of a remote url and return the text

archivebox.util.get_headers(url: str, timeout: int = None) → str[source]: Download the contents of a remote url and return the headers

archivebox.util.chrome_args(**options) → List[str][source]: helper to build up a chrome shell command with arguments

archivebox.util.chrome_cleanup()[source]: Cleans up any state or runtime files that chrome leaves behind when killed by a timeout or other error

archivebox.util.ansi_to_html(text)[source]: Based on: https://stackoverflow.com/questions/19212665/python-converting-ansi-color-codes-to-html

class archivebox.util.AttributeDict(*args, **kwargs)[source]

Bases: dict

Helper to allow accessing dict values via Example.key or Example[‘key’]

class archivebox.util.ExtendedEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)[source]

Bases: JSONEncoder

Extended json serializer that supports serializing several model fields and objects

default(obj)[source]

Implement this method in a subclass such that it returns a serializable object for o, or calls the base implementation (to raise a TypeError).

For example, to support arbitrary iterators, you could implement default like this:

def default(self, o):
    try:
        iterable = iter(o)
    except TypeError:
        pass
    else:
        return list(iterable)
    # Let the base class default method raise the TypeError
    return JSONEncoder.default(self, o)