archivebox.cli.archivebox_add
Module Contents
Functions
Add a new URL or list of URLs to your archive. |
|
Add a new URL or list of URLs to your archive |
Data
API
- archivebox.cli.archivebox_add.add(urls: str | list[str], depth: int | str = 0, max_urls: int = 0, max_size: int | str = 0, tag: str = '', url_allowlist: str = '', url_denylist: str = '', parser: str = 'auto', plugins: str = '', persona: str = 'Default', overwrite: bool = False, update: bool | None = None, index_only: bool = False, bg: bool = False, created_by_id: int | None = None) tuple[archivebox.crawls.models.Crawl, django.db.models.QuerySet[archivebox.core.models.Snapshot]][source]
Add a new URL or list of URLs to your archive.
The flow is:
Save URLs to sources file
Create Crawl with URLs and max_depth
Crawl runner creates Snapshots from Crawl URLs (depth=0)
Crawl runner runs parser extractors on root snapshots
Parser extractors output to urls.jsonl
URLs are added to Crawl.urls and child Snapshots are created
Repeat until max_depth is reached