archivebox.cli.archivebox_add

Module Contents

Functions

_collect_input_urls

add

Add a new URL or list of URLs to your archive.

main

Add a new URL or list of URLs to your archive

Data

__command__

API

archivebox.cli.archivebox_add.__command__[source]

‘archivebox add’

archivebox.cli.archivebox_add._collect_input_urls(args: tuple[str, ...]) list[str][source]
archivebox.cli.archivebox_add.add(urls: str | list[str], depth: int | str = 0, max_urls: int = 0, max_size: int | str = 0, tag: str = '', url_allowlist: str = '', url_denylist: str = '', parser: str = 'auto', plugins: str = '', persona: str = 'Default', overwrite: bool = False, update: bool | None = None, index_only: bool = False, bg: bool = False, created_by_id: int | None = None) tuple[archivebox.crawls.models.Crawl, django.db.models.QuerySet[archivebox.core.models.Snapshot]][source]

Add a new URL or list of URLs to your archive.

The flow is:

  1. Save URLs to sources file

  2. Create Crawl with URLs and max_depth

  3. Crawl runner creates Snapshots from Crawl URLs (depth=0)

  4. Crawl runner runs parser extractors on root snapshots

  5. Parser extractors output to urls.jsonl

  6. URLs are added to Crawl.urls and child Snapshots are created

  7. Repeat until max_depth is reached

archivebox.cli.archivebox_add.main(**kwargs)[source]

Add a new URL or list of URLs to your archive