archivebox package

Subpackages

Submodules

archivebox.main module

archivebox.main.help(out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/latest/docs')) → None[source]

Print the ArchiveBox help message and usage

archivebox.main.version(quiet: bool = False, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/latest/docs')) → None[source]

Print the ArchiveBox version and dependency information

archivebox.main.run(subcommand: str, subcommand_args: Optional[List[str]], stdin: Optional[IO] = None, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/latest/docs')) → None[source]

Run a given ArchiveBox subcommand with the given list of args

archivebox.main.init(force: bool = False, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/latest/docs')) → None[source]

Initialize a new ArchiveBox collection in the current directory

archivebox.main.status(out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/latest/docs')) → None[source]

Print out some info and statistics about the archive collection

archivebox.main.oneshot(url: str, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/latest/docs'))[source]

Create a single URL archive folder with an index.json and index.html, and all the archive method outputs. You can run this to archive single pages without needing to create a whole collection with archivebox init.

archivebox.main.add(urls: Union[str, List[str]], depth: int = 0, update_all: bool = False, index_only: bool = False, overwrite: bool = False, init: bool = False, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/latest/docs'), extractors: str = '') → List[archivebox.index.schema.Link][source]

Add a new URL or list of URLs to your archive

archivebox.main.remove(filter_str: Optional[str] = None, filter_patterns: Optional[List[str]] = None, filter_type: str = 'exact', snapshots: Optional[django.db.models.query.QuerySet] = None, after: Optional[float] = None, before: Optional[float] = None, yes: bool = False, delete: bool = False, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/latest/docs')) → List[archivebox.index.schema.Link][source]

Remove the specified URLs from the archive

archivebox.main.update(resume: Optional[float] = None, only_new: bool = True, index_only: bool = False, overwrite: bool = False, filter_patterns_str: Optional[str] = None, filter_patterns: Optional[List[str]] = None, filter_type: Optional[str] = None, status: Optional[str] = None, after: Optional[str] = None, before: Optional[str] = None, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/latest/docs')) → List[archivebox.index.schema.Link][source]

Import any new links from subscriptions and retry any previously failed/skipped links

archivebox.main.list_all(filter_patterns_str: Optional[str] = None, filter_patterns: Optional[List[str]] = None, filter_type: str = 'exact', status: Optional[str] = None, after: Optional[float] = None, before: Optional[float] = None, sort: Optional[str] = None, csv: Optional[str] = None, json: bool = False, html: bool = False, with_headers: bool = False, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/latest/docs')) → Iterable[archivebox.index.schema.Link][source]

List, filter, and export information about archive entries

archivebox.main.list_folders(links: List[archivebox.index.schema.Link], status: str, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/latest/docs')) → Dict[str, Optional[archivebox.index.schema.Link]][source]
archivebox.main.config(config_options_str: Optional[str] = None, config_options: Optional[List[str]] = None, get: bool = False, set: bool = False, reset: bool = False, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/latest/docs')) → None[source]

Get and set your ArchiveBox project configuration values

archivebox.main.schedule(add: bool = False, show: bool = False, clear: bool = False, foreground: bool = False, run_all: bool = False, quiet: bool = False, every: Optional[str] = None, depth: int = 0, import_path: Optional[str] = None, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/latest/docs'))[source]

Set ArchiveBox to regularly import URLs at specific times using cron

archivebox.main.server(runserver_args: Optional[List[str]] = None, reload: bool = False, debug: bool = False, init: bool = False, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/latest/docs')) → None[source]

Run the ArchiveBox HTTP server

archivebox.main.manage(args: Optional[List[str]] = None, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/latest/docs')) → None[source]

Run an ArchiveBox Django management command

archivebox.main.shell(out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/latest/docs')) → None[source]

Enter an interactive ArchiveBox Django shell

archivebox.manage module

archivebox.system module

archivebox.system.run(*args, input=None, capture_output=True, text=False, **kwargs)[source]

Patched of subprocess.run to fix blocking io making timeout=innefective

archivebox.system.atomic_write(path: Union[pathlib.Path, str], contents: Union[dict, str, bytes], overwrite: bool = True) → None[source]

Safe atomic write to filesystem by writing to temp file + atomic rename

archivebox.system.chmod_file(path: str, cwd: str = '.', permissions: str = '755') → None[source]

chmod -R <permissions> <cwd>/<path>

archivebox.system.copy_and_overwrite(from_path: Union[str, pathlib.Path], to_path: Union[str, pathlib.Path])[source]

copy a given file or directory to a given path, overwriting the destination

archivebox.system.get_dir_size(path: Union[str, pathlib.Path], recursive: bool = True, pattern: Optional[str] = None) → Tuple[int, int, int][source]

get the total disk size of a given directory, optionally summing up recursively and limiting to a given filter list

archivebox.system.dedupe_cron_jobs(cron: crontab.CronTab) → crontab.CronTab[source]
class archivebox.system.suppress_output(stdout=True, stderr=True)[source]

Bases: object

A context manager for doing a “deep suppression” of stdout and stderr in Python, i.e. will suppress all print, even if the print originates in a compiled C/Fortran sub-function.

This will not suppress raised exceptions, since exceptions are printed

to stderr just before a script exits, and after the context manager has exited (at least, I think that is why it lets exceptions through).

with suppress_stdout_stderr():
rogue_function()

archivebox.util module

archivebox.util.detect_encoding(rawdata)
archivebox.util.scheme(url)
archivebox.util.without_scheme(url)
archivebox.util.without_query(url)
archivebox.util.without_fragment(url)
archivebox.util.without_path(url)
archivebox.util.path(url)
archivebox.util.basename(url)
archivebox.util.domain(url)
archivebox.util.query(url)
archivebox.util.fragment(url)
archivebox.util.extension(url)
archivebox.util.base_url(url)
archivebox.util.without_www(url)
archivebox.util.without_trailing_slash(url)
archivebox.util.hashurl(url)
archivebox.util.urlencode(s)
archivebox.util.urldecode(s)
archivebox.util.htmlencode(s)
archivebox.util.htmldecode(s)
archivebox.util.short_ts(ts)
archivebox.util.ts_to_date(ts)
archivebox.util.ts_to_iso(ts)
archivebox.util.is_static_file(url: str)[source]
archivebox.util.enforce_types(func)[source]

Enforce function arg and kwarg types at runtime using its python3 type hints

archivebox.util.docstring(text: Optional[str])[source]

attach the given docstring to the decorated function

archivebox.util.str_between(string: str, start: str, end: str = None) → str[source]

(<abc>12345</def>, <abc>, </def>) -> 12345

archivebox.util.parse_date(date: Any) → Optional[datetime.datetime][source]

Parse unix timestamps, iso format, and human-readable strings

archivebox.util.download_url(url: str, timeout: int = None) → str[source]

Download the contents of a remote url and return the text

archivebox.util.get_headers(url: str, timeout: int = None) → str[source]

Download the contents of a remote url and return the headers

archivebox.util.chrome_args(**options) → List[str][source]

helper to build up a chrome shell command with arguments

archivebox.util.ansi_to_html(text)[source]

Based on: https://stackoverflow.com/questions/19212665/python-converting-ansi-color-codes-to-html

class archivebox.util.AttributeDict(*args, **kwargs)[source]

Bases: dict

Helper to allow accessing dict values via Example.key or Example[‘key’]

class archivebox.util.ExtendedEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)[source]

Bases: json.encoder.JSONEncoder

Extended json serializer that supports serializing several model fields and objects

default(obj)[source]

Implement this method in a subclass such that it returns a serializable object for o, or calls the base implementation (to raise a TypeError).

For example, to support arbitrary iterators, you could implement default like this:

def default(self, o):
    try:
        iterable = iter(o)
    except TypeError:
        pass
    else:
        return list(iterable)
    # Let the base class default method raise the TypeError
    return JSONEncoder.default(self, o)

Module contents