archivebox package¶
Subpackages¶
- archivebox.cli package
- Submodules
- archivebox.cli.archivebox module
- archivebox.cli.archivebox_add module
- archivebox.cli.archivebox_config module
- archivebox.cli.archivebox_help module
- archivebox.cli.archivebox_info module
- archivebox.cli.archivebox_init module
- archivebox.cli.archivebox_list module
- archivebox.cli.archivebox_manage module
- archivebox.cli.archivebox_remove module
- archivebox.cli.archivebox_schedule module
- archivebox.cli.archivebox_server module
- archivebox.cli.archivebox_shell module
- archivebox.cli.archivebox_update module
- archivebox.cli.archivebox_version module
- archivebox.cli.logging module
- archivebox.cli.tests module
- Module contents
- archivebox.config package
- archivebox.core package
- Subpackages
- Submodules
- archivebox.core.admin module
- archivebox.core.apps module
- archivebox.core.models module
- archivebox.core.settings module
- archivebox.core.tests module
- archivebox.core.urls module
- archivebox.core.views module
- archivebox.core.welcome_message module
- archivebox.core.wsgi module
- Module contents
- archivebox.extractors package
- Submodules
- archivebox.extractors.archive_org module
- archivebox.extractors.dom module
- archivebox.extractors.favicon module
- archivebox.extractors.git module
- archivebox.extractors.media module
- archivebox.extractors.pdf module
- archivebox.extractors.screenshot module
- archivebox.extractors.title module
- archivebox.extractors.wget module
- Module contents
- archivebox.index package
- archivebox.parsers package
- Submodules
- archivebox.parsers.generic_json module
- archivebox.parsers.generic_rss module
- archivebox.parsers.generic_txt module
- archivebox.parsers.medium_rss module
- archivebox.parsers.netscape_html module
- archivebox.parsers.pinboard_rss module
- archivebox.parsers.pocket_html module
- archivebox.parsers.shaarli_rss module
- Module contents
Submodules¶
archivebox.main module¶
-
archivebox.main.
help
(out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.5.6/docs')) → None[source]¶ Print the ArchiveBox help message and usage
-
archivebox.main.
version
(quiet: bool = False, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.5.6/docs')) → None[source]¶ Print the ArchiveBox version and dependency information
-
archivebox.main.
run
(subcommand: str, subcommand_args: Optional[List[str]], stdin: Optional[IO] = None, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.5.6/docs')) → None[source]¶ Run a given ArchiveBox subcommand with the given list of args
-
archivebox.main.
init
(force: bool = False, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.5.6/docs')) → None[source]¶ Initialize a new ArchiveBox collection in the current directory
-
archivebox.main.
status
(out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.5.6/docs')) → None[source]¶ Print out some info and statistics about the archive collection
-
archivebox.main.
oneshot
(url: str, extractors: str = '', out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.5.6/docs'))[source]¶ Create a single URL archive folder with an index.json and index.html, and all the archive method outputs. You can run this to archive single pages without needing to create a whole collection with archivebox init.
-
archivebox.main.
add
(urls: Union[str, List[str]], depth: int = 0, update_all: bool = False, index_only: bool = False, overwrite: bool = False, init: bool = False, extractors: str = '', out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.5.6/docs')) → List[archivebox.index.schema.Link][source]¶ Add a new URL or list of URLs to your archive
-
archivebox.main.
remove
(filter_str: Optional[str] = None, filter_patterns: Optional[List[str]] = None, filter_type: str = 'exact', snapshots: Optional[django.db.models.query.QuerySet] = None, after: Optional[float] = None, before: Optional[float] = None, yes: bool = False, delete: bool = False, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.5.6/docs')) → List[archivebox.index.schema.Link][source]¶ Remove the specified URLs from the archive
-
archivebox.main.
update
(resume: Optional[float] = None, only_new: bool = True, index_only: bool = False, overwrite: bool = False, filter_patterns_str: Optional[str] = None, filter_patterns: Optional[List[str]] = None, filter_type: Optional[str] = None, status: Optional[str] = None, after: Optional[str] = None, before: Optional[str] = None, extractors: str = '', out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.5.6/docs')) → List[archivebox.index.schema.Link][source]¶ Import any new links from subscriptions and retry any previously failed/skipped links
-
archivebox.main.
list_all
(filter_patterns_str: Optional[str] = None, filter_patterns: Optional[List[str]] = None, filter_type: str = 'exact', status: Optional[str] = None, after: Optional[float] = None, before: Optional[float] = None, sort: Optional[str] = None, csv: Optional[str] = None, json: bool = False, html: bool = False, with_headers: bool = False, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.5.6/docs')) → Iterable[archivebox.index.schema.Link][source]¶ List, filter, and export information about archive entries
-
archivebox.main.
list_links
(snapshots: Optional[django.db.models.query.QuerySet] = None, filter_patterns: Optional[List[str]] = None, filter_type: str = 'exact', after: Optional[float] = None, before: Optional[float] = None, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.5.6/docs')) → Iterable[archivebox.index.schema.Link][source]¶
-
archivebox.main.
list_folders
(links: List[archivebox.index.schema.Link], status: str, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.5.6/docs')) → Dict[str, Optional[archivebox.index.schema.Link]][source]¶
-
archivebox.main.
config
(config_options_str: Optional[str] = None, config_options: Optional[List[str]] = None, get: bool = False, set: bool = False, reset: bool = False, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.5.6/docs')) → None[source]¶ Get and set your ArchiveBox project configuration values
-
archivebox.main.
schedule
(add: bool = False, show: bool = False, clear: bool = False, foreground: bool = False, run_all: bool = False, quiet: bool = False, every: Optional[str] = None, depth: int = 0, import_path: Optional[str] = None, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.5.6/docs'))[source]¶ Set ArchiveBox to regularly import URLs at specific times using cron
-
archivebox.main.
server
(runserver_args: Optional[List[str]] = None, reload: bool = False, debug: bool = False, init: bool = False, createsuperuser: bool = False, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.5.6/docs')) → None[source]¶ Run the ArchiveBox HTTP server
archivebox.manage module¶
archivebox.system module¶
-
archivebox.system.
run
(*args, input=None, capture_output=True, text=False, **kwargs)[source]¶ Patched of subprocess.run to fix blocking io making timeout=innefective
-
archivebox.system.
atomic_write
(path: Union[pathlib.Path, str], contents: Union[dict, str, bytes], overwrite: bool = True) → None[source]¶ Safe atomic write to filesystem by writing to temp file + atomic rename
-
archivebox.system.
chmod_file
(path: str, cwd: str = '.', permissions: str = '755') → None[source]¶ chmod -R <permissions> <cwd>/<path>
-
archivebox.system.
copy_and_overwrite
(from_path: Union[str, pathlib.Path], to_path: Union[str, pathlib.Path])[source]¶ copy a given file or directory to a given path, overwriting the destination
-
archivebox.system.
get_dir_size
(path: Union[str, pathlib.Path], recursive: bool = True, pattern: Optional[str] = None) → Tuple[int, int, int][source]¶ get the total disk size of a given directory, optionally summing up recursively and limiting to a given filter list
-
class
archivebox.system.
suppress_output
(stdout=True, stderr=True)[source]¶ Bases:
object
A context manager for doing a “deep suppression” of stdout and stderr in Python, i.e. will suppress all print, even if the print originates in a compiled C/Fortran sub-function.
This will not suppress raised exceptions, since exceptions are printedto stderr just before a script exits, and after the context manager has exited (at least, I think that is why it lets exceptions through).
- with suppress_stdout_stderr():
- rogue_function()
archivebox.util module¶
-
archivebox.util.
detect_encoding
(rawdata)¶
-
archivebox.util.
scheme
(url)¶
-
archivebox.util.
without_scheme
(url)¶
-
archivebox.util.
without_query
(url)¶
-
archivebox.util.
without_fragment
(url)¶
-
archivebox.util.
without_path
(url)¶
-
archivebox.util.
path
(url)¶
-
archivebox.util.
basename
(url)¶
-
archivebox.util.
domain
(url)¶
-
archivebox.util.
query
(url)¶
-
archivebox.util.
fragment
(url)¶
-
archivebox.util.
extension
(url)¶
-
archivebox.util.
base_url
(url)¶
-
archivebox.util.
without_www
(url)¶
-
archivebox.util.
without_trailing_slash
(url)¶
-
archivebox.util.
hashurl
(url)¶
-
archivebox.util.
urlencode
(s)¶
-
archivebox.util.
urldecode
(s)¶
-
archivebox.util.
htmlencode
(s)¶
-
archivebox.util.
htmldecode
(s)¶
-
archivebox.util.
short_ts
(ts)¶
-
archivebox.util.
ts_to_date
(ts)¶
-
archivebox.util.
ts_to_iso
(ts)¶
-
archivebox.util.
enforce_types
(func)[source]¶ Enforce function arg and kwarg types at runtime using its python3 type hints
-
archivebox.util.
docstring
(text: Optional[str])[source]¶ attach the given docstring to the decorated function
-
archivebox.util.
str_between
(string: str, start: str, end: str = None) → str[source]¶ (<abc>12345</def>, <abc>, </def>) -> 12345
-
archivebox.util.
parse_date
(date: Any) → Optional[datetime.datetime][source]¶ Parse unix timestamps, iso format, and human-readable strings
-
archivebox.util.
download_url
(url: str, timeout: int = None) → str[source]¶ Download the contents of a remote url and return the text
-
archivebox.util.
get_headers
(url: str, timeout: int = None) → str[source]¶ Download the contents of a remote url and return the headers
-
archivebox.util.
chrome_args
(**options) → List[str][source]¶ helper to build up a chrome shell command with arguments
-
archivebox.util.
ansi_to_html
(text)[source]¶ Based on: https://stackoverflow.com/questions/19212665/python-converting-ansi-color-codes-to-html
-
class
archivebox.util.
AttributeDict
(*args, **kwargs)[source]¶ Bases:
dict
Helper to allow accessing dict values via Example.key or Example[‘key’]
-
class
archivebox.util.
ExtendedEncoder
(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)[source]¶ Bases:
json.encoder.JSONEncoder
Extended json serializer that supports serializing several model fields and objects
-
default
(obj)[source]¶ Implement this method in a subclass such that it returns a serializable object for
o
, or calls the base implementation (to raise aTypeError
).For example, to support arbitrary iterators, you could implement default like this:
def default(self, o): try: iterable = iter(o) except TypeError: pass else: return list(iterable) # Let the base class default method raise the TypeError return JSONEncoder.default(self, o)
-