archivebox.crawls.models

Module Contents

Classes

CrawlSchedule

Crawl

CrawlMachine

API

class archivebox.crawls.models.CrawlSchedule[source]

Bases: archivebox.base_models.models.ModelWithUUID, archivebox.base_models.models.ModelWithNotes

id[source]

β€˜UUIDField(…)’

created_at[source]

β€˜DateTimeField(…)’

created_by[source]

β€˜ForeignKey(…)’

modified_at[source]

β€˜DateTimeField(…)’

template: Crawl[source]

β€˜ForeignKey(…)’

schedule[source]

β€˜CharField(…)’

is_enabled[source]

β€˜BooleanField(…)’

label[source]

β€˜CharField(…)’

notes[source]

β€˜TextField(…)’

crawl_set: django.db.models.Manager[Crawl][source]

None

class Meta[source]

Bases: archivebox.base_models.models.ModelWithUUID.Meta, archivebox.base_models.models.ModelWithNotes.Meta

app_label[source]

β€˜crawls’

verbose_name[source]

β€˜Scheduled Crawl’

verbose_name_plural[source]

β€˜Scheduled Crawls’

__str__() str[source]
property api_url: str[source]
save(*args, **kwargs)[source]
property last_run_at[source]
property next_run_at[source]
is_due(now=None) bool[source]
enqueue(queued_at=None) archivebox.crawls.models.Crawl[source]
class archivebox.crawls.models.Crawl[source]

Bases: archivebox.base_models.models.ModelWithOutputDir, archivebox.base_models.models.ModelWithConfig, archivebox.base_models.models.ModelWithHealthStats, archivebox.workers.models.ModelWithStateMachine

id[source]

β€˜UUIDField(…)’

created_at[source]

β€˜DateTimeField(…)’

created_by[source]

β€˜ForeignKey(…)’

modified_at[source]

β€˜DateTimeField(…)’

urls[source]

β€˜TextField(…)’

config[source]

β€˜JSONField(…)’

max_depth[source]

β€˜PositiveSmallIntegerField(…)’

max_urls[source]

β€˜IntegerField(…)’

max_size[source]

β€˜BigIntegerField(…)’

tags_str[source]

β€˜CharField(…)’

persona_id[source]

β€˜UUIDField(…)’

label[source]

β€˜CharField(…)’

notes[source]

β€˜TextField(…)’

schedule[source]

β€˜ForeignKey(…)’

status[source]

β€˜StatusField(…)’

retry_at[source]

β€˜RetryAtField(…)’

state_machine_name[source]

β€˜archivebox.crawls.models.CrawlMachine’

retry_at_field_name[source]

β€˜retry_at’

state_field_name[source]

β€˜status’

StatusChoices[source]

None

active_state[source]

None

schedule_id: uuid.UUID | None[source]

None

sm: CrawlMachine[source]

None

snapshot_set: django.db.models.Manager[archivebox.core.models.Snapshot][source]

None

class Meta[source]

Bases: archivebox.base_models.models.ModelWithOutputDir.Meta, archivebox.base_models.models.ModelWithConfig.Meta, archivebox.base_models.models.ModelWithHealthStats.Meta, archivebox.workers.models.ModelWithStateMachine.Meta

app_label[source]

β€˜crawls’

verbose_name[source]

β€˜Crawl’

verbose_name_plural[source]

β€˜Crawls’

__str__()[source]
save(*args, **kwargs)[source]
property api_url: str[source]
to_json() dict[source]

Convert Crawl model instance to a JSON-serializable dict.

static from_json(record: dict, overrides: dict | None = None)[source]

Create or get a Crawl from a JSON dict.

Args: record: Dict with β€˜urls’ (required), optional β€˜max_depth’, β€˜tags_str’, β€˜label’ overrides: Dict of field overrides (e.g., created_by_id)

Returns: Crawl instance or None if invalid

property output_dir: pathlib.Path[source]

Construct output directory: archive/users/{username}/crawls/{YYYYMMDD}/{domain}/{crawl-id} Domain is extracted from the first URL in the crawl.

get_urls_list() list[str][source]

Get list of URLs from urls field, filtering out comments and empty lines.

static normalize_domain(value: str) str[source]
static split_filter_patterns(value) list[str][source]
classmethod _pattern_matches_url(url: str, pattern: str) bool[source]
get_url_allowlist(*, use_effective_config: bool = False, snapshot=None) list[str][source]
get_url_denylist(*, use_effective_config: bool = False, snapshot=None) list[str][source]
url_passes_filters(url: str, *, snapshot=None, use_effective_config: bool = True) bool[source]
set_url_filters(allowlist, denylist) None[source]
apply_crawl_config_filters() dict[str, int][source]
_iter_url_lines() list[tuple[str, str]][source]
count_urls_for_limit() int[source]

Count unique URLs already queued or snapshotted for this crawl.

max_urls is a crawl-wide cap on snapshots, so direct URL entries and recursively discovered snapshots both have to consume the same budget.

remaining_url_capacity() int | None[source]
has_remaining_url_capacity() bool[source]
remaining_snapshot_capacity() int | None[source]
has_remaining_snapshot_capacity() bool[source]
prune_urls(predicate) list[str][source]
prune_url(url: str) int[source]
exclude_domain(domain: str) dict[str, int | str | bool][source]
get_system_task() str | None[source]
resolve_persona()[source]
add_url(entry: dict) bool[source]

Add a URL to the crawl queue if not already present.

Args: entry: dict with β€˜url’, optional β€˜depth’, β€˜title’, β€˜timestamp’, β€˜tags’, β€˜via_snapshot’, β€˜plugin’

Returns: True if URL was added, False if skipped (duplicate or depth exceeded)

create_snapshots_from_urls() list[archivebox.core.models.Snapshot][source]

Create Snapshot objects for each URL in self.urls that doesn’t already exist.

Returns: List of newly created Snapshot objects

create_discovered_snapshot(parent_snapshot, *, url: str, depth: int, title: str = '', tags: str = '', created_by_id: int | None = None)[source]

Create one child snapshot if it passes crawl filters and limits.

install_declared_binaries(binary_names: set[str], machine=None) None[source]

Install crawl-declared Binary rows without violating the retry_at lock lifecycle.

Correct calling pattern:

  1. Crawl hooks declare Binary records and queue them with retry_at <= now

  2. Exactly one actor claims each Binary by moving retry_at into the future

  3. Only that owner executes .sm.tick() and performs install side effects

  4. Everyone else waits for the claimed owner to finish instead of launching a second install against shared state such as the pip or npm trees

This helper follows that contract by claiming each Binary before ticking it, and by waiting when another worker already owns the row. That keeps synchronous crawl execution compatible with the shared background runner and avoids duplicate installs of the same dependency.

run() Snapshot | None[source]

Execute this Crawl: run hooks, process JSONL, create snapshots.

Called by the state machine when entering the β€˜started’ state.

Returns: The root Snapshot for this crawl, or None for system crawls that don’t create snapshots

is_finished() bool[source]

Check if crawl is finished (all snapshots sealed or no snapshots exist).

cleanup()[source]

Clean up background hooks and run on_CrawlEnd hooks.

class archivebox.crawls.models.CrawlMachine(obj, *args, **kwargs)[source]

Bases: archivebox.workers.models.BaseStateMachine

crawl: archivebox.crawls.models.Crawl[source]

None

State machine for managing Crawl lifecycle.

Hook Lifecycle: β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ QUEUED State β”‚ β”‚ β€’ Waiting for crawl to be ready (has URLs) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ↓ tick() when can_start() β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ STARTED State β†’ enter_started() β”‚ β”‚ 1. crawl.run() β”‚ β”‚ β€’ discover_hooks(β€˜Crawl’) β†’ finds all crawl hooks β”‚ β”‚ β€’ For each hook: β”‚ β”‚ - run_hook(script, output_dir, …) β”‚ β”‚ - Parse JSONL from hook output β”‚ β”‚ - process_hook_records() β†’ creates Snapshots β”‚ β”‚ β€’ create_snapshots_from_urls() β†’ from self.urls field β”‚ β”‚ β”‚ β”‚ 2. Snapshots process independently with their own β”‚ β”‚ state machines (see SnapshotMachine) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ↓ tick() when is_finished() β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ SEALED State β†’ enter_sealed() β”‚ β”‚ β€’ cleanup() β†’ runs on_CrawlEnd hooks, kills background β”‚ β”‚ β€’ Set retry_at=None (no more processing) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

model_attr_name[source]

β€˜crawl’

queued[source]

β€˜State(…)’

started[source]

β€˜State(…)’

sealed[source]

β€˜State(…)’

tick[source]

None

seal[source]

β€˜to(…)’

can_start() bool[source]
is_finished() bool[source]

Check if all Snapshots for this crawl are finished.

enter_started()[source]
enter_sealed()[source]