archivebox.crawls.modelsο
Module Contentsο
Classesο
APIο
- class archivebox.crawls.models.CrawlSchedule[source]ο
Bases:
archivebox.base_models.models.ModelWithUUID,archivebox.base_models.models.ModelWithNotes- class Meta[source]ο
Bases:
archivebox.base_models.models.ModelWithUUID.Meta,archivebox.base_models.models.ModelWithNotes.Meta
- enqueue(queued_at=None) archivebox.crawls.models.Crawl[source]ο
- class archivebox.crawls.models.Crawl[source]ο
Bases:
archivebox.base_models.models.ModelWithOutputDir,archivebox.base_models.models.ModelWithConfig,archivebox.base_models.models.ModelWithHealthStats,archivebox.workers.models.ModelWithStateMachine- sm: CrawlMachine[source]ο
None
- snapshot_set: django.db.models.Manager[archivebox.core.models.Snapshot][source]ο
None
- class Meta[source]ο
Bases:
archivebox.base_models.models.ModelWithOutputDir.Meta,archivebox.base_models.models.ModelWithConfig.Meta,archivebox.base_models.models.ModelWithHealthStats.Meta,archivebox.workers.models.ModelWithStateMachine.Meta
- static from_json(record: dict, overrides: dict | None = None)[source]ο
Create or get a Crawl from a JSON dict.
Args: record: Dict with βurlsβ (required), optional βmax_depthβ, βtags_strβ, βlabelβ overrides: Dict of field overrides (e.g., created_by_id)
Returns: Crawl instance or None if invalid
- property output_dir: pathlib.Path[source]ο
Construct output directory: archive/users/{username}/crawls/{YYYYMMDD}/{domain}/{crawl-id} Domain is extracted from the first URL in the crawl.
- get_urls_list() list[str][source]ο
Get list of URLs from urls field, filtering out comments and empty lines.
- count_urls_for_limit() int[source]ο
Count unique URLs already queued or snapshotted for this crawl.
max_urls is a crawl-wide cap on snapshots, so direct URL entries and recursively discovered snapshots both have to consume the same budget.
- add_url(entry: dict) bool[source]ο
Add a URL to the crawl queue if not already present.
Args: entry: dict with βurlβ, optional βdepthβ, βtitleβ, βtimestampβ, βtagsβ, βvia_snapshotβ, βpluginβ
Returns: True if URL was added, False if skipped (duplicate or depth exceeded)
- create_snapshots_from_urls() list[archivebox.core.models.Snapshot][source]ο
Create Snapshot objects for each URL in self.urls that doesnβt already exist.
Returns: List of newly created Snapshot objects
- create_discovered_snapshot(parent_snapshot, *, url: str, depth: int, title: str = '', tags: str = '', created_by_id: int | None = None)[source]ο
Create one child snapshot if it passes crawl filters and limits.
- install_declared_binaries(binary_names: set[str], machine=None) None[source]ο
Install crawl-declared Binary rows without violating the retry_at lock lifecycle.
Correct calling pattern:
Crawl hooks declare Binary records and queue them with retry_at <= now
Exactly one actor claims each Binary by moving retry_at into the future
Only that owner executes
.sm.tick()and performs install side effectsEveryone else waits for the claimed owner to finish instead of launching a second install against shared state such as the pip or npm trees
This helper follows that contract by claiming each Binary before ticking it, and by waiting when another worker already owns the row. That keeps synchronous crawl execution compatible with the shared background runner and avoids duplicate installs of the same dependency.
- run() Snapshot | None[source]ο
Execute this Crawl: run hooks, process JSONL, create snapshots.
Called by the state machine when entering the βstartedβ state.
Returns: The root Snapshot for this crawl, or None for system crawls that donβt create snapshots
- class archivebox.crawls.models.CrawlMachine(obj, *args, **kwargs)[source]ο
Bases:
archivebox.workers.models.BaseStateMachine- crawl: archivebox.crawls.models.Crawl[source]ο
None
State machine for managing Crawl lifecycle.
Hook Lifecycle: βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β QUEUED State β β β’ Waiting for crawl to be ready (has URLs) β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β tick() when can_start() βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β STARTED State β enter_started() β β 1. crawl.run() β β β’ discover_hooks(βCrawlβ) β finds all crawl hooks β β β’ For each hook: β β - run_hook(script, output_dir, β¦) β β - Parse JSONL from hook output β β - process_hook_records() β creates Snapshots β β β’ create_snapshots_from_urls() β from self.urls field β β β β 2. Snapshots process independently with their own β β state machines (see SnapshotMachine) β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β tick() when is_finished() βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β SEALED State β enter_sealed() β β β’ cleanup() β runs on_CrawlEnd hooks, kills background β β β’ Set retry_at=None (no more processing) β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ