archivebox packageο
Subpackagesο
- archivebox.cli package
- Submodules
- archivebox.cli.archivebox_add module
- archivebox.cli.archivebox_config module
- archivebox.cli.archivebox_help module
- archivebox.cli.archivebox_status module
- archivebox.cli.archivebox_init module
- archivebox.cli.archivebox_list module
- archivebox.cli.archivebox_manage module
- archivebox.cli.archivebox_remove module
- archivebox.cli.archivebox_schedule module
- archivebox.cli.archivebox_server module
- archivebox.cli.archivebox_shell module
- archivebox.cli.archivebox_update module
- archivebox.cli.archivebox_version module
- Module contents
- archivebox.config package
- Submodules
- Module contents
get_real_name()
get_version()
get_commit_hash()
load_config_val()
load_config_file()
write_config_file()
load_config()
stdout()
stderr()
hint()
bin_version()
bin_path()
bin_hash()
find_chrome_binary()
find_chrome_data_dir()
wget_supports_compression()
get_code_locations()
get_external_locations()
get_data_locations()
get_dependency_info()
get_chrome_info()
load_all_config()
check_system_config()
check_dependencies()
check_data_folder()
check_migrations()
TERM_WIDTH()
setup_django()
- archivebox.config.stubs module
BaseConfig
ConfigDict
ConfigDict.IS_TTY
ConfigDict.USE_COLOR
ConfigDict.SHOW_PROGRESS
ConfigDict.IN_DOCKER
ConfigDict.PACKAGE_DIR
ConfigDict.OUTPUT_DIR
ConfigDict.CONFIG_FILE
ConfigDict.ONLY_NEW
ConfigDict.TIMEOUT
ConfigDict.MEDIA_TIMEOUT
ConfigDict.OUTPUT_PERMISSIONS
ConfigDict.RESTRICT_FILE_NAMES
ConfigDict.URL_DENYLIST
ConfigDict.SECRET_KEY
ConfigDict.BIND_ADDR
ConfigDict.ALLOWED_HOSTS
ConfigDict.DEBUG
ConfigDict.PUBLIC_INDEX
ConfigDict.PUBLIC_SNAPSHOTS
ConfigDict.FOOTER_INFO
ConfigDict.SAVE_TITLE
ConfigDict.SAVE_FAVICON
ConfigDict.SAVE_WGET
ConfigDict.SAVE_WGET_REQUISITES
ConfigDict.SAVE_SINGLEFILE
ConfigDict.SAVE_READABILITY
ConfigDict.SAVE_MERCURY
ConfigDict.SAVE_PDF
ConfigDict.SAVE_SCREENSHOT
ConfigDict.SAVE_DOM
ConfigDict.SAVE_WARC
ConfigDict.SAVE_GIT
ConfigDict.SAVE_MEDIA
ConfigDict.SAVE_ARCHIVE_DOT_ORG
ConfigDict.RESOLUTION
ConfigDict.GIT_DOMAINS
ConfigDict.CHECK_SSL_VALIDITY
ConfigDict.CURL_USER_AGENT
ConfigDict.WGET_USER_AGENT
ConfigDict.CHROME_USER_AGENT
ConfigDict.COOKIES_FILE
ConfigDict.CHROME_USER_DATA_DIR
ConfigDict.CHROME_TIMEOUT
ConfigDict.CHROME_HEADLESS
ConfigDict.CHROME_SANDBOX
ConfigDict.USE_CURL
ConfigDict.USE_WGET
ConfigDict.USE_SINGLEFILE
ConfigDict.USE_READABILITY
ConfigDict.USE_MERCURY
ConfigDict.USE_GIT
ConfigDict.USE_CHROME
ConfigDict.USE_YOUTUBEDL
ConfigDict.CURL_BINARY
ConfigDict.GIT_BINARY
ConfigDict.WGET_BINARY
ConfigDict.SINGLEFILE_BINARY
ConfigDict.READABILITY_BINARY
ConfigDict.MERCURY_BINARY
ConfigDict.YOUTUBEDL_BINARY
ConfigDict.CHROME_BINARY
ConfigDict.YOUTUBEDL_ARGS
ConfigDict.WGET_ARGS
ConfigDict.CURL_ARGS
ConfigDict.GIT_ARGS
ConfigDict.TAG_SEPARATOR_PATTERN
ConfigDefault
- archivebox.core package
- Subpackages
- Submodules
- archivebox.core.admin module
ArchiveResultInline
TagInline
AutocompleteTags
AutocompleteTagsAdminStub
SnapshotActionForm
SnapshotAdmin
SnapshotAdmin.list_display
SnapshotAdmin.sort_fields
SnapshotAdmin.readonly_fields
SnapshotAdmin.search_fields
SnapshotAdmin.fields
SnapshotAdmin.list_filter
SnapshotAdmin.ordering
SnapshotAdmin.actions
SnapshotAdmin.autocomplete_fields
SnapshotAdmin.inlines
SnapshotAdmin.list_per_page
SnapshotAdmin.action_form
SnapshotAdmin.get_urls()
SnapshotAdmin.get_queryset()
SnapshotAdmin.tag_list()
SnapshotAdmin.info()
SnapshotAdmin.title_str()
SnapshotAdmin.files()
SnapshotAdmin.size()
SnapshotAdmin.url_str()
SnapshotAdmin.grid_view()
SnapshotAdmin.update_snapshots()
SnapshotAdmin.update_titles()
SnapshotAdmin.resnapshot_snapshot()
SnapshotAdmin.overwrite_snapshots()
SnapshotAdmin.delete_snapshots()
SnapshotAdmin.add_tags()
SnapshotAdmin.remove_tags()
SnapshotAdmin.media
TagAdmin
ArchiveResultAdmin
ArchiveResultAdmin.list_display
ArchiveResultAdmin.sort_fields
ArchiveResultAdmin.readonly_fields
ArchiveResultAdmin.search_fields
ArchiveResultAdmin.fields
ArchiveResultAdmin.autocomplete_fields
ArchiveResultAdmin.list_filter
ArchiveResultAdmin.ordering
ArchiveResultAdmin.list_per_page
ArchiveResultAdmin.snapshot_str()
ArchiveResultAdmin.tags_str()
ArchiveResultAdmin.cmd_str()
ArchiveResultAdmin.output_str()
ArchiveResultAdmin.media
ArchiveBoxAdmin
- archivebox.core.apps module
- archivebox.core.settings module
- archivebox.core.tests module
- archivebox.core.urls module
- archivebox.core.views module
- archivebox.core.welcome_message module
- archivebox.core.wsgi module
- Module contents
- archivebox.extractors package
- Submodules
- archivebox.extractors.archive_org module
- archivebox.extractors.dom module
- archivebox.extractors.favicon module
- archivebox.extractors.git module
- archivebox.extractors.media module
- archivebox.extractors.pdf module
- archivebox.extractors.screenshot module
- archivebox.extractors.title module
- archivebox.extractors.wget module
- Module contents
- archivebox.index package
- Submodules
- archivebox.index.csv module
- archivebox.index.html module
- archivebox.index.json module
- archivebox.index.schema module
ArchiveError
ArchiveResult
ArchiveResult.cmd
ArchiveResult.pwd
ArchiveResult.cmd_version
ArchiveResult.output
ArchiveResult.status
ArchiveResult.start_ts
ArchiveResult.end_ts
ArchiveResult.index_texts
ArchiveResult.schema
ArchiveResult.typecheck()
ArchiveResult.guess_ts()
ArchiveResult.from_json()
ArchiveResult.to_dict()
ArchiveResult.to_json()
ArchiveResult.to_csv()
ArchiveResult.field_names()
ArchiveResult.duration
Link
Link.timestamp
Link.url
Link.title
Link.tags
Link.sources
Link.history
Link.updated
Link.schema
Link.overwrite()
Link.typecheck()
Link.as_snapshot()
Link.from_json()
Link.to_json()
Link.to_csv()
Link.snapshot_id
Link.field_names()
Link.link_dir
Link.archive_path
Link.archive_size
Link.url_hash
Link.scheme
Link.extension
Link.domain
Link.path
Link.basename
Link.base_url
Link.bookmarked_date
Link.updated_date
Link.archive_dates
Link.oldest_archive_date
Link.newest_archive_date
Link.num_outputs
Link.num_failures
Link.is_static
Link.is_archived
Link.latest_outputs()
Link.canonical_outputs()
- archivebox.index.sql module
- Module contents
merge_links()
validate_links()
archivable_links()
fix_duplicate_links()
sorted_links()
links_after_timestamp()
lowest_uniq_timestamp()
timed_index_update()
write_main_index()
load_main_index()
load_main_index_meta()
parse_links_from_source()
fix_duplicate_links_in_index()
dedupe_links()
write_link_details()
load_link_details()
q_filter()
search_filter()
snapshot_filter()
get_indexed_folders()
get_archived_folders()
get_unarchived_folders()
get_present_folders()
get_valid_folders()
get_invalid_folders()
get_duplicate_folders()
get_orphaned_folders()
get_corrupted_folders()
get_unrecognized_folders()
is_valid()
is_corrupt()
is_archived()
is_unarchived()
fix_invalid_folder_locations()
- archivebox.parsers package
- Submodules
- archivebox.parsers.generic_json module
- archivebox.parsers.generic_rss module
- archivebox.parsers.generic_txt module
- archivebox.parsers.medium_rss module
- archivebox.parsers.netscape_html module
- archivebox.parsers.pinboard_rss module
- archivebox.parsers.pocket_html module
- archivebox.parsers.shaarli_rss module
- Module contents
Submodulesο
archivebox.main moduleο
- archivebox.main.help(out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/dev')) None [source]ο
Print the ArchiveBox help message and usage
- archivebox.main.version(quiet: bool = False, out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/dev')) None [source]ο
Print the ArchiveBox version and dependency information
- archivebox.main.run(subcommand: str, subcommand_args: List[str] | None, stdin: IO | None = None, out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/dev')) None [source]ο
Run a given ArchiveBox subcommand with the given list of args
- archivebox.main.init(force: bool = False, quick: bool = False, setup: bool = False, out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/dev')) None [source]ο
Initialize a new ArchiveBox collection in the current directory
- archivebox.main.status(out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/dev')) None [source]ο
Print out some info and statistics about the archive collection
- archivebox.main.oneshot(url: str, extractors: str = '', out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/dev'))[source]ο
Create a single URL archive folder with an index.json and index.html, and all the archive method outputs. You can run this to archive single pages without needing to create a whole collection with archivebox init.
- archivebox.main.add(urls: str | List[str], tag: str = '', depth: int = 0, update: bool = False, update_all: bool = False, index_only: bool = False, overwrite: bool = False, init: bool = False, extractors: str = '', parser: str = 'auto', out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/dev')) List[Link] [source]ο
Add a new URL or list of URLs to your archive
- archivebox.main.remove(filter_str: str | None = None, filter_patterns: List[str] | None = None, filter_type: str = 'exact', snapshots: QuerySet | None = None, after: float | None = None, before: float | None = None, yes: bool = False, delete: bool = False, out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/dev')) List[Link] [source]ο
Remove the specified URLs from the archive
- archivebox.main.update(resume: float | None = None, only_new: bool = True, index_only: bool = False, overwrite: bool = False, filter_patterns_str: str | None = None, filter_patterns: List[str] | None = None, filter_type: str | None = None, status: str | None = None, after: str | None = None, before: str | None = None, extractors: str = '', out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/dev')) List[Link] [source]ο
Import any new links from subscriptions and retry any previously failed/skipped links
- archivebox.main.list_all(filter_patterns_str: str | None = None, filter_patterns: List[str] | None = None, filter_type: str = 'exact', status: str | None = None, after: float | None = None, before: float | None = None, sort: str | None = None, csv: str | None = None, json: bool = False, html: bool = False, with_headers: bool = False, out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/dev')) Iterable[Link] [source]ο
List, filter, and export information about archive entries
- archivebox.main.list_links(snapshots: QuerySet | None = None, filter_patterns: List[str] | None = None, filter_type: str = 'exact', after: float | None = None, before: float | None = None, out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/dev')) Iterable[Link] [source]ο
- archivebox.main.list_folders(links: List[Link], status: str, out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/dev')) Dict[str, Link | None] [source]ο
- archivebox.main.setup(out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/dev')) None [source]ο
Automatically install all ArchiveBox dependencies and extras
- archivebox.main.config(config_options_str: str | None = None, config_options: List[str] | None = None, get: bool = False, set: bool = False, reset: bool = False, out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/dev')) None [source]ο
Get and set your ArchiveBox project configuration values
- archivebox.main.schedule(add: bool = False, show: bool = False, clear: bool = False, foreground: bool = False, run_all: bool = False, quiet: bool = False, every: str | None = None, depth: int = 0, overwrite: bool = False, update: bool = False, import_path: str | None = None, out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/dev'))[source]ο
Set ArchiveBox to regularly import URLs at specific times using cron
- archivebox.main.server(runserver_args: List[str] | None = None, reload: bool = False, debug: bool = False, init: bool = False, quick_init: bool = False, createsuperuser: bool = False, out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/dev')) None [source]ο
Run the ArchiveBox HTTP server
archivebox.manage moduleο
archivebox.system moduleο
- archivebox.system.run(cmd, *args, input=None, capture_output=True, timeout=None, check=False, text=False, start_new_session=True, **kwargs)[source]ο
Patched of subprocess.run to kill forked child subprocesses and fix blocking io making timeout=innefective Mostly copied from https://github.com/python/cpython/blob/master/Lib/subprocess.py
- archivebox.system.atomic_write(path: Path | str, contents: dict | str | bytes, overwrite: bool = True) None [source]ο
Safe atomic write to filesystem by writing to temp file + atomic rename
- archivebox.system.chmod_file(path: str, cwd: str = '.') None [source]ο
chmod -R <permissions> <cwd>/<path>
- archivebox.system.copy_and_overwrite(from_path: str | Path, to_path: str | Path)[source]ο
copy a given file or directory to a given path, overwriting the destination
- archivebox.system.get_dir_size(path: str | Path, recursive: bool = True, pattern: str | None = None) Tuple[int, int, int] [source]ο
get the total disk size of a given directory, optionally summing up recursively and limiting to a given filter list
- class archivebox.system.suppress_output(stdout=True, stderr=True)[source]ο
Bases:
object
A context manager for doing a βdeep suppressionβ of stdout and stderr in Python, i.e. will suppress all print, even if the print originates in a compiled C/Fortran sub-function.
This will not suppress raised exceptions, since exceptions are printed to stderr just before a script exits, and after the context manager has exited (at least, I think that is why it lets exceptions through).
- with suppress_stdout_stderr():
rogue_function()
archivebox.util moduleο
- archivebox.util.detect_encoding(rawdata)ο
- archivebox.util.scheme(url)ο
- archivebox.util.without_scheme(url)ο
- archivebox.util.without_query(url)ο
- archivebox.util.without_fragment(url)ο
- archivebox.util.without_path(url)ο
- archivebox.util.path(url)ο
- archivebox.util.basename(url)ο
- archivebox.util.domain(url)ο
- archivebox.util.query(url)ο
- archivebox.util.fragment(url)ο
- archivebox.util.extension(url)ο
- archivebox.util.base_url(url)ο
- archivebox.util.without_www(url)ο
- archivebox.util.without_trailing_slash(url)ο
- archivebox.util.hashurl(url)ο
- archivebox.util.urlencode(s)ο
- archivebox.util.urldecode(s)ο
- archivebox.util.htmlencode(s)ο
- archivebox.util.htmldecode(s)ο
- archivebox.util.short_ts(ts)ο
- archivebox.util.ts_to_date_str(ts)ο
- archivebox.util.ts_to_iso(ts)ο
- archivebox.util.enforce_types(func)[source]ο
Enforce function arg and kwarg types at runtime using its python3 type hints
- archivebox.util.docstring(text: str | None)[source]ο
attach the given docstring to the decorated function
- archivebox.util.str_between(string: str, start: str, end: str = None) str [source]ο
(<abc>12345</def>, <abc>, </def>) -> 12345
- archivebox.util.parse_date(date: Any) datetime | None [source]ο
Parse unix timestamps, iso format, and human-readable strings
- archivebox.util.download_url(url: str, timeout: int = None) str [source]ο
Download the contents of a remote url and return the text
- archivebox.util.get_headers(url: str, timeout: int = None) str [source]ο
Download the contents of a remote url and return the headers
- archivebox.util.chrome_args(**options) List[str] [source]ο
helper to build up a chrome shell command with arguments
- archivebox.util.chrome_cleanup()[source]ο
Cleans up any state or runtime files that chrome leaves behind when killed by a timeout or other error
- archivebox.util.ansi_to_html(text)[source]ο
Based on: https://stackoverflow.com/questions/19212665/python-converting-ansi-color-codes-to-html
- class archivebox.util.AttributeDict(*args, **kwargs)[source]ο
Bases:
dict
Helper to allow accessing dict values via Example.key or Example[βkeyβ]
- class archivebox.util.ExtendedEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)[source]ο
Bases:
JSONEncoder
Extended json serializer that supports serializing several model fields and objects
- default(obj)[source]ο
Implement this method in a subclass such that it returns a serializable object for
o
, or calls the base implementation (to raise aTypeError
).For example, to support arbitrary iterators, you could implement default like this:
def default(self, o): try: iterable = iter(o) except TypeError: pass else: return list(iterable) # Let the base class default method raise the TypeError return JSONEncoder.default(self, o)