`archivebox.cli.archivebox_add`

Module Contents

Functions

`_collect_input_urls`
`add`	Add a new URL or list of URLs to your archive.
`main`	Add a new URL or list of URLs to your archive

Data

__command__

API

archivebox.cli.archivebox_add.__command__[source]: ‘archivebox add’

archivebox.cli.archivebox_add._collect_input_urls(args: tuple[str, ...], *, parser: str = 'auto') → list[str][source]

archivebox.cli.archivebox_add.add(urls: str | list[str], snapshot_ids: list[str] | None = None, depth: int | str = 0, max_urls: int = 0, crawl_max_size: int | str = 0, crawl_timeout: int = 0, snapshot_max_size: int | str = 0, crawl_max_concurrent_snapshots: int | None = None, tag: str = '', url_allowlist: str = '', url_denylist: str = '', parser: str = 'auto', plugins: str = '', persona: str = 'Default', index_only: bool = False, bg: bool = False, created_by_id: int | None = None, config: dict[str, Any] | None = None) → tuple[archivebox.crawls.models.Crawl, django.db.models.QuerySet[archivebox.core.models.Snapshot]][source]

Add a new URL or list of URLs to your archive.

The flow is:

Save URLs to sources file
Create Crawl with URLs and max_depth
Crawl runner creates Snapshots from Crawl URLs (depth=0)
Crawl runner runs parser extractors on root snapshots
Parser extractors output to urls.jsonl
URLs are added to Crawl.urls and child Snapshots are created
Repeat until max_depth is reached

archivebox.cli.archivebox_add.main(**kwargs)[source]: Add a new URL or list of URLs to your archive

archivebox.cli.archivebox_add

Module Contents

Functions

Data

API

`archivebox.cli.archivebox_add`