archivebox.machine.models

Module Contents

Classes

MachineManager

Machine

NetworkInterfaceManager

NetworkInterface

BinaryManager

Binary

Tracks a binary on a specific machine.

ProcessManager

Manager for Process model.

Process

Tracks a single OS process execution.

BinaryMachine

State machine for managing Binary installation lifecycle.

ProcessMachine

State machine for managing Process (OS subprocess) lifecycle.

Functions

_find_existing_binary_for_reference

_get_process_binary_env_keys

_sanitize_machine_config

Data

_psutil

_CURRENT_MACHINE

_CURRENT_INTERFACE

_CURRENT_BINARIES

_CURRENT_PROCESS

MACHINE_RECHECK_INTERVAL

NETWORK_INTERFACE_RECHECK_INTERVAL

BINARY_RECHECK_INTERVAL

PROCESS_RECHECK_INTERVAL

PID_REUSE_WINDOW

PROCESS_TIMEOUT_GRACE

START_TIME_TOLERANCE

LEGACY_MACHINE_CONFIG_KEYS

API

archivebox.machine.models._psutil: Any | None[source]

None

archivebox.machine.models._CURRENT_MACHINE: archivebox.machine.models.Machine | None[source]

None

archivebox.machine.models._CURRENT_INTERFACE: archivebox.machine.models.NetworkInterface | None[source]

None

archivebox.machine.models._CURRENT_BINARIES: dict[str, archivebox.machine.models.Binary][source]

None

archivebox.machine.models._CURRENT_PROCESS: archivebox.machine.models.Process | None[source]

None

archivebox.machine.models.MACHINE_RECHECK_INTERVAL[source]

None

archivebox.machine.models.NETWORK_INTERFACE_RECHECK_INTERVAL[source]

None

archivebox.machine.models.BINARY_RECHECK_INTERVAL[source]

None

archivebox.machine.models.PROCESS_RECHECK_INTERVAL[source]

60

archivebox.machine.models.PID_REUSE_WINDOW[source]

β€˜timedelta(…)’

archivebox.machine.models.PROCESS_TIMEOUT_GRACE[source]

β€˜timedelta(…)’

archivebox.machine.models.START_TIME_TOLERANCE[source]

5.0

archivebox.machine.models.LEGACY_MACHINE_CONFIG_KEYS[source]

β€˜frozenset(…)’

archivebox.machine.models._find_existing_binary_for_reference(machine: Machine, reference: str) Binary | None[source]
archivebox.machine.models._get_process_binary_env_keys(plugin_name: str, hook_path: str, env: dict[str, Any] | None) list[str][source]
archivebox.machine.models._sanitize_machine_config(config: dict[str, Any] | None) dict[str, Any][source]
class archivebox.machine.models.MachineManager[source]

Bases: django.db.models.Manager

current() archivebox.machine.models.Machine[source]
class archivebox.machine.models.Machine[source]

Bases: archivebox.base_models.models.ModelWithHealthStats

id[source]

β€˜UUIDField(…)’

created_at[source]

β€˜DateTimeField(…)’

modified_at[source]

β€˜DateTimeField(…)’

guid[source]

β€˜CharField(…)’

hostname[source]

β€˜CharField(…)’

hw_in_docker[source]

β€˜BooleanField(…)’

hw_in_vm[source]

β€˜BooleanField(…)’

hw_manufacturer[source]

β€˜CharField(…)’

hw_product[source]

β€˜CharField(…)’

hw_uuid[source]

β€˜CharField(…)’

os_arch[source]

β€˜CharField(…)’

os_family[source]

β€˜CharField(…)’

os_platform[source]

β€˜CharField(…)’

os_release[source]

β€˜CharField(…)’

os_kernel[source]

β€˜CharField(…)’

stats[source]

β€˜JSONField(…)’

config[source]

β€˜JSONField(…)’

num_uses_failed[source]

β€˜PositiveIntegerField(…)’

num_uses_succeeded[source]

β€˜PositiveIntegerField(…)’

objects[source]

β€˜MachineManager(…)’

networkinterface_set: django.db.models.Manager[NetworkInterface][source]

None

class Meta[source]

Bases: archivebox.base_models.models.ModelWithHealthStats.Meta

app_label[source]

β€˜machine’

classmethod current() archivebox.machine.models.Machine[source]
classmethod _sanitize_config(machine: archivebox.machine.models.Machine) archivebox.machine.models.Machine[source]
to_json() dict[source]

Convert Machine model instance to a JSON-serializable dict.

static from_json(record: dict[str, Any], overrides: dict[str, Any] | None = None)[source]

Update Machine config from JSON dict.

Args: record: JSON dict with β€˜config’: {key: value} patch overrides: Not used

Returns: Machine instance or None

class archivebox.machine.models.NetworkInterfaceManager[source]

Bases: django.db.models.Manager

current() archivebox.machine.models.NetworkInterface[source]
class archivebox.machine.models.NetworkInterface[source]

Bases: archivebox.base_models.models.ModelWithHealthStats

id[source]

β€˜UUIDField(…)’

created_at[source]

β€˜DateTimeField(…)’

modified_at[source]

β€˜DateTimeField(…)’

machine[source]

β€˜ForeignKey(…)’

mac_address[source]

β€˜CharField(…)’

ip_public[source]

β€˜GenericIPAddressField(…)’

ip_local[source]

β€˜GenericIPAddressField(…)’

dns_server[source]

β€˜GenericIPAddressField(…)’

hostname[source]

β€˜CharField(…)’

iface[source]

β€˜CharField(…)’

isp[source]

β€˜CharField(…)’

city[source]

β€˜CharField(…)’

region[source]

β€˜CharField(…)’

country[source]

β€˜CharField(…)’

objects[source]

β€˜NetworkInterfaceManager(…)’

machine_id: uuid.UUID[source]

None

class Meta[source]

Bases: archivebox.base_models.models.ModelWithHealthStats.Meta

app_label[source]

β€˜machine’

unique_together[source]

((β€˜machine’, β€˜ip_public’, β€˜ip_local’, β€˜mac_address’, β€˜dns_server’),)

classmethod current(refresh: bool = False) archivebox.machine.models.NetworkInterface[source]
class archivebox.machine.models.BinaryManager[source]

Bases: django.db.models.Manager

get_from_db_or_cache(name: str, abspath: str = '', version: str = '', sha256: str = '', binprovider: str = 'env') archivebox.machine.models.Binary[source]

Get or create an Binary record from the database or cache.

get_valid_binary(name: str, machine: archivebox.machine.models.Machine | None = None) archivebox.machine.models.Binary | None[source]

Get a valid Binary for the given name on the current machine, or None if not found.

class archivebox.machine.models.Binary[source]

Bases: archivebox.base_models.models.ModelWithHealthStats, archivebox.workers.models.ModelWithStateMachine

Tracks a binary on a specific machine.

Simple state machine with 2 states:

  • queued: Binary needs to be installed

  • installed: Binary installed successfully (abspath, version, sha256 populated)

Installation is synchronous during queued→installed transition. If installation fails, Binary stays in queued with retry_at set for later retry.

State machine calls run() which executes on_BinaryRequest__* hooks to install the binary using the specified providers.

class StatusChoices[source]

Bases: django.db.models.TextChoices

QUEUED[source]

(β€˜queued’, β€˜Queued’)

INSTALLED[source]

(β€˜installed’, β€˜Installed’)

id[source]

β€˜UUIDField(…)’

created_at[source]

β€˜DateTimeField(…)’

modified_at[source]

β€˜DateTimeField(…)’

machine[source]

β€˜ForeignKey(…)’

name[source]

β€˜CharField(…)’

binproviders[source]

β€˜CharField(…)’

overrides[source]

β€˜JSONField(…)’

binprovider[source]

β€˜CharField(…)’

abspath[source]

β€˜CharField(…)’

version[source]

β€˜CharField(…)’

sha256[source]

β€˜CharField(…)’

status[source]

β€˜StatusField(…)’

retry_at[source]

β€˜RetryAtField(…)’

num_uses_failed[source]

β€˜PositiveIntegerField(…)’

num_uses_succeeded[source]

β€˜PositiveIntegerField(…)’

machine_id: uuid.UUID[source]

None

state_machine_name: str | None[source]

β€˜archivebox.machine.models.BinaryMachine’

active_state: str[source]

None

objects[source]

β€˜BinaryManager(…)’

class Meta[source]

Bases: archivebox.base_models.models.ModelWithHealthStats.Meta, archivebox.workers.models.ModelWithStateMachine.Meta

app_label[source]

β€˜machine’

verbose_name[source]

β€˜Binary’

verbose_name_plural[source]

β€˜Binaries’

unique_together[source]

((β€˜machine’, β€˜name’, β€˜abspath’, β€˜version’, β€˜sha256’),)

__str__() str[source]
property is_valid: bool[source]

A binary is valid if it has a resolved path and is marked installed.

binary_info() dict[source]

Return info about the binary.

property output_dir: pathlib.Path[source]

Get output directory for this binary’s hook logs. Path: data/machines/{machine_uuid}/binaries/{binary_name}/{binary_uuid}

to_json() dict[source]

Convert Binary model instance to a JSON-serializable dict.

static from_json(record: dict[str, Any], overrides: dict[str, Any] | None = None)[source]

Create/update Binary from JSON dict.

Handles two cases:

  1. From binaries.json: creates queued binary with name, binproviders, overrides

  2. From hook output: updates binary with abspath, version, sha256, binprovider

Args: record: JSON dict with β€˜name’ and either: - β€˜binproviders’, β€˜overrides’ (from binaries.json) - β€˜abspath’, β€˜version’, β€˜sha256’, β€˜binprovider’ (from hook output) overrides: Not used

Returns: Binary instance or None

update_and_requeue(**kwargs) bool[source]

Update binary fields and requeue for worker state machine.

Sets modified_at to ensure workers pick up changes. Always saves the model after updating.

_allowed_binproviders() set[str] | None[source]

Return the allowed binproviders for this binary, or None for wildcard.

run()[source]

Execute binary installation by running on_BinaryRequest__* hooks.

Called by BinaryMachine when entering β€˜started’ state. Runs ALL on_BinaryRequest__* hooks - each hook checks binproviders and decides if it can handle this binary. First hook to succeed wins. Updates status to SUCCEEDED or FAILED based on hook output.

cleanup()[source]

Clean up background binary installation hooks.

Called by state machine if needed (not typically used for binaries since installations are foreground, but included for consistency).

Symlink this binary into LIB_BIN_DIR for unified PATH management.

After a binary is installed by any binprovider (pip, npm, brew, apt, etc), we symlink it into LIB_BIN_DIR so that:

  1. All binaries can be found in a single directory

  2. PATH only needs LIB_BIN_DIR prepended (not multiple provider-specific paths)

  3. Binary priorities are clear (symlink points to the canonical install location)

Args: lib_bin_dir: Path to LIB_BIN_DIR (e.g., /data/lib/arm64-darwin/bin)

Returns: Path to the created symlink, or None if symlinking failed

Example: >>> binary = Binary.objects.get(name=’yt-dlp’) >>> binary.symlink_to_lib_bin(β€˜/data/lib/arm64-darwin/bin’) Path(β€˜/data/lib/arm64-darwin/bin/yt-dlp’)

class archivebox.machine.models.ProcessManager[source]

Bases: django.db.models.Manager

Manager for Process model.

current() archivebox.machine.models.Process[source]

Get the Process record for the current OS process.

get_by_pid(pid: int, machine: archivebox.machine.models.Machine | None = None) archivebox.machine.models.Process | None[source]

Find a Process by PID with proper validation against PID reuse.

IMPORTANT: PIDs are reused by the OS! This method:

  1. Filters by machine (required - PIDs are only unique per machine)

  2. Filters by time window (processes older than 24h are stale)

  3. Validates via psutil that start times match

Args: pid: OS process ID machine: Machine instance (defaults to current machine)

Returns: Process if found and validated, None otherwise

create_for_archiveresult(archiveresult, **kwargs)[source]

Create a Process record for an ArchiveResult.

Called during migration and when creating new ArchiveResults.

class archivebox.machine.models.Process[source]

Bases: django.db.models.Model

Tracks a single OS process execution.

Process represents the actual subprocess spawned to execute a hook. One Process can optionally be associated with an ArchiveResult (via OneToOne), but Process can also exist standalone for internal operations.

Follows the unified state machine pattern:

  • queued: Process ready to launch

  • running: Process actively executing

  • exited: Process completed (check exit_code for success/failure)

State machine calls launch() to spawn the process and monitors its lifecycle.

class StatusChoices[source]

Bases: django.db.models.TextChoices

QUEUED[source]

(β€˜queued’, β€˜Queued’)

RUNNING[source]

(β€˜running’, β€˜Running’)

EXITED[source]

(β€˜exited’, β€˜Exited’)

class TypeChoices[source]

Bases: django.db.models.TextChoices

SUPERVISORD[source]

(β€˜supervisord’, β€˜Supervisord’)

ORCHESTRATOR[source]

(β€˜orchestrator’, β€˜Orchestrator’)

WORKER[source]

(β€˜worker’, β€˜Worker’)

CLI[source]

(β€˜cli’, β€˜CLI’)

HOOK[source]

(β€˜hook’, β€˜Hook’)

BINARY[source]

(β€˜binary’, β€˜Binary’)

id[source]

β€˜UUIDField(…)’

created_at[source]

β€˜DateTimeField(…)’

modified_at[source]

β€˜DateTimeField(…)’

machine[source]

β€˜ForeignKey(…)’

parent[source]

β€˜ForeignKey(…)’

process_type[source]

β€˜CharField(…)’

worker_type[source]

β€˜CharField(…)’

pwd[source]

β€˜CharField(…)’

cmd[source]

β€˜JSONField(…)’

env[source]

β€˜JSONField(…)’

timeout[source]

β€˜IntegerField(…)’

pid[source]

β€˜IntegerField(…)’

exit_code[source]

β€˜IntegerField(…)’

stdout[source]

β€˜TextField(…)’

stderr[source]

β€˜TextField(…)’

started_at[source]

β€˜DateTimeField(…)’

ended_at[source]

β€˜DateTimeField(…)’

binary[source]

β€˜ForeignKey(…)’

iface[source]

β€˜ForeignKey(…)’

url[source]

β€˜URLField(…)’

status[source]

β€˜CharField(…)’

retry_at[source]

β€˜DateTimeField(…)’

machine_id: uuid.UUID[source]

None

parent_id: uuid.UUID | None[source]

None

binary_id: uuid.UUID | None[source]

None

children: django.db.models.Manager[archivebox.machine.models.Process][source]

None

archiveresult: archivebox.core.models.ArchiveResult[source]

None

state_machine_name: str[source]

β€˜archivebox.machine.models.ProcessMachine’

objects[source]

β€˜ProcessManager(…)’

class Meta[source]

Bases: django_stubs_ext.db.models.TypedModelMeta

app_label[source]

β€˜machine’

verbose_name[source]

β€˜Process’

verbose_name_plural[source]

β€˜Processes’

indexes[source]

None

__str__() str[source]
property cmd_version: str[source]

Get version from associated binary.

property bin_abspath: str[source]

Get absolute path from associated binary.

property plugin: str[source]

Get plugin name from associated ArchiveResult (if any).

property hook_name: str[source]

Get hook name from associated ArchiveResult (if any).

to_json() dict[source]

Convert Process model instance to a JSON-serializable dict.

hydrate_binary_from_context(*, plugin_name: str = '', hook_path: str = '') archivebox.machine.models.Binary | None[source]
classmethod parse_records_from_text(text: str) list[dict][source]

Parse JSONL records from raw text using the shared JSONL parser.

get_records() list[dict][source]

Parse JSONL records from this process’s stdout.

static from_json(record: dict[str, Any], overrides: dict[str, Any] | None = None)[source]

Create/update Process from JSON dict.

Args: record: JSON dict with β€˜id’ or process details overrides: Optional dict of field overrides

Returns: Process instance or None

update_and_requeue(**kwargs) bool[source]

Update process fields and requeue for worker state machine. Sets modified_at to ensure workers pick up changes.

classmethod current() archivebox.machine.models.Process[source]

Get or create the Process record for the current OS process.

Similar to Machine.current(), this:

  1. Checks cache for existing Process with matching PID

  2. Validates the cached Process is still valid (PID not reused)

  3. Creates new Process if needed

IMPORTANT: Uses psutil to validate PID hasn’t been reused. PIDs are recycled by OS, so we compare start times.

classmethod _find_parent_process(machine: archivebox.machine.models.Machine | None = None) archivebox.machine.models.Process | None[source]

Find the parent Process record by looking up PPID.

IMPORTANT: Validates against PID reuse by checking:

  1. Same machine (PIDs are only unique per machine)

  2. Start time matches OS process start time

  3. Process is still RUNNING and recent

Returns None if parent is not an ArchiveBox process.

classmethod _detect_process_type() str[source]

Detect the type of the current process from sys.argv.

classmethod cleanup_stale_running(machine: archivebox.machine.models.Machine | None = None) int[source]

Mark stale RUNNING processes as EXITED in the DB.

Processes are stale if:

  • Status is RUNNING but OS process no longer exists

  • Status is RUNNING but exceeded its timeout plus a small grace margin

  • Status is RUNNING but started_at is older than PID_REUSE_WINDOW

Returns count of processes cleaned up.

property root: archivebox.machine.models.Process[source]

Get the root process (CLI command) of this hierarchy.

property ancestors: list[archivebox.machine.models.Process][source]

Get all ancestor processes from parent to root.

property depth: int[source]

Get depth in the process tree (0 = root).

get_descendants(include_self: bool = False)[source]

Get all descendant processes recursively.

property proc: psutil.Process | None[source]

Get validated psutil.Process for this record.

Returns psutil.Process ONLY if:

  1. Process with this PID exists in OS

  2. OS process start time matches our started_at (within tolerance)

  3. Process is on current machine

Returns None if:

  • PID doesn’t exist (process exited)

  • PID was reused by a different process (start times don’t match)

  • We’re on a different machine than where process ran

  • psutil is not available

This prevents accidentally matching a stale/recycled PID.

property is_running: bool[source]

Check if process is currently running via psutil.

More reliable than checking status field since it validates the actual OS process exists and matches our record.

is_alive() bool[source]

Alias for is_running, for compatibility with subprocess.Popen API.

get_memory_info() dict | None[source]

Get memory usage if process is running.

get_cpu_percent() float | None[source]

Get CPU usage percentage if process is running.

get_children_pids() list[int][source]

Get PIDs of child processes from OS (not DB).

property pid_file: pathlib.Path | None[source]

Path to PID file for this process.

property cmd_file: pathlib.Path | None[source]

Path to cmd.sh script for this process.

property stdout_file: pathlib.Path | None[source]

Path to stdout log.

property stderr_file: pathlib.Path | None[source]

Path to stderr log.

property hook_script_name: str | None[source]

Best-effort hook filename extracted from the process command.

property runtime_dir: pathlib.Path | None[source]

Directory where this process stores runtime logs/pid/cmd metadata.

tail_stdout(lines: int = 50, follow: bool = False)[source]

Tail stdout log file (like tail or tail -f).

Args: lines: Number of lines to show (default 50) follow: If True, follow the file and yield new lines as they appear

Yields: Lines from stdout

tail_stderr(lines: int = 50, follow: bool = False)[source]

Tail stderr log file (like tail or tail -f).

Args: lines: Number of lines to show (default 50) follow: If True, follow the file and yield new lines as they appear

Yields: Lines from stderr

pipe_stdout(lines: int = 10, follow: bool = True)[source]

Pipe stdout to sys.stdout.

Args: lines: Number of initial lines to show follow: If True, follow the file and print new lines as they appear

pipe_stderr(lines: int = 10, follow: bool = True)[source]

Pipe stderr to sys.stderr.

Args: lines: Number of initial lines to show follow: If True, follow the file and print new lines as they appear

_write_pid_file() None[source]

Write PID file with mtime set to process start time.

_write_cmd_file() None[source]

Write cmd.sh script for debugging/validation.

ensure_log_files() None[source]

Ensure stdout/stderr log files exist for this process.

_build_env() dict[source]

Build environment dict for subprocess, merging stored env with system.

launch(background: bool = False, cwd: str | None = None) archivebox.machine.models.Process[source]

Spawn the subprocess and update this Process record.

Args: background: If True, don’t wait for completion (for daemons/bg hooks) cwd: Working directory for the subprocess (defaults to self.pwd)

Returns: self (updated with pid, started_at, etc.)

kill(signal_num: int = 15) bool[source]

Kill this process and update status.

Uses self.proc for safe killing - only kills if PID matches our recorded process (prevents killing recycled PIDs).

Args: signal_num: Signal to send (default SIGTERM=15)

Returns: True if killed successfully, False otherwise

poll() int | None[source]

Check if process has exited and update status if so.

Cleanup when process exits:

  • Copy stdout/stderr to DB (keep files for debugging)

  • Delete PID file

Returns: exit_code if exited, None if still running

wait(timeout: int | None = None) int[source]

Wait for process to exit, polling periodically.

Args: timeout: Max seconds to wait (None = use self.timeout)

Returns: exit_code

Raises: TimeoutError if process doesn’t exit in time

terminate(graceful_timeout: float = 5.0) bool[source]

Gracefully terminate process: SIGTERM β†’ wait β†’ SIGKILL.

This consolidates the scattered SIGTERM/SIGKILL logic from:

  • crawls/models.py Crawl.cleanup()

  • workers/pid_utils.py stop_worker()

  • supervisord_util.py stop_existing_supervisord_process()

Args: graceful_timeout: Seconds to wait after SIGTERM before SIGKILL

Returns: True if process was terminated, False if already dead

kill_tree(graceful_timeout: float = 2.0) int[source]

Kill this process and all its children (OS children, not DB children) in parallel.

Uses parallel polling approach - sends SIGTERM to all processes at once, then polls all simultaneously with individual deadline tracking.

This consolidates the scattered child-killing logic from:

  • crawls/models.py Crawl.cleanup() os.killpg()

  • supervisord_util.py stop_existing_supervisord_process()

Args: graceful_timeout: Seconds to wait after SIGTERM before SIGKILL

Returns: Number of processes killed (including self)

kill_children_db() int[source]

Kill all DB-tracked child processes (via parent FK).

Different from kill_tree() which uses OS children. This kills processes created via Process.create(parent=self).

Returns: Number of child Process records killed

classmethod get_running(process_type: str | None = None, machine: archivebox.machine.models.Machine | None = None) django.db.models.QuerySet[archivebox.machine.models.Process][source]

Get all running processes, optionally filtered by type.

Replaces:

  • workers/pid_utils.py get_all_worker_pids()

  • workers/orchestrator.py get_total_worker_count()

Args: process_type: Filter by TypeChoices (e.g., β€˜worker’, β€˜hook’) machine: Filter by machine (defaults to current)

Returns: QuerySet of running Process records

classmethod get_running_count(process_type: str | None = None, machine: archivebox.machine.models.Machine | None = None) int[source]

Get count of running processes.

Replaces:

  • workers/pid_utils.py get_running_worker_count()

classmethod stop_all(process_type: str | None = None, machine: archivebox.machine.models.Machine | None = None, graceful: bool = True) int[source]

Stop all running processes of a given type.

Args: process_type: Filter by TypeChoices machine: Filter by machine graceful: If True, use terminate() (SIGTERM→SIGKILL), else kill()

Returns: Number of processes stopped

classmethod get_next_worker_id(process_type: str = 'worker', machine: archivebox.machine.models.Machine | None = None) int[source]

Get the next available worker ID for spawning new workers.

Replaces workers/pid_utils.py get_next_worker_id(). Simply returns count of running workers of this type.

Args: process_type: Worker type to count machine: Machine to scope query

Returns: Next available worker ID (0-indexed)

classmethod cleanup_orphaned_chrome() int[source]

Kill orphaned Chrome processes using chrome_utils.js killZombieChrome.

Scans DATA_DIR for chrome/*.pid files from stale crawls (>5 min old) and kills any orphaned Chrome processes.

Called by:

  • Orchestrator on startup (cleanup from previous crashes)

  • Orchestrator periodically (every N minutes)

Returns: Number of zombie Chrome processes killed

classmethod cleanup_orphaned_workers() int[source]

Mark orphaned worker/hook processes as EXITED in the DB.

Orphaned if:

  • Root (orchestrator/cli) is not running, or

  • No orchestrator/cli ancestor exists.

Standalone worker runs (archivebox run –snapshot-id) are allowed.

class archivebox.machine.models.BinaryMachine(obj, *args, **kwargs)[source]

Bases: archivebox.workers.models.BaseStateMachine

State machine for managing Binary installation lifecycle.

Simple 2-state machine: β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ QUEUED State β”‚ β”‚ β€’ Binary needs to be installed β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ↓ tick() when can_install() ↓ Synchronous installation during transition β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ INSTALLED State β”‚ β”‚ β€’ Binary installed (abspath, version, sha256 set) β”‚ β”‚ β€’ Health stats incremented β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

If installation fails, Binary stays in QUEUED with retry_at bumped.

Initialization

model_attr_name[source]

β€˜binary’

binary: archivebox.machine.models.Binary[source]

None

queued[source]

β€˜State(…)’

installed[source]

β€˜State(…)’

tick[source]

None

can_install() bool[source]

Check if binary installation can start.

enter_queued()[source]

Binary is queued for installation.

on_install()[source]

Called during queued→installed transition. Runs installation synchronously.

enter_installed()[source]

Binary installed successfully.

class archivebox.machine.models.ProcessMachine(obj, *args, **kwargs)[source]

Bases: archivebox.workers.models.BaseStateMachine

State machine for managing Process (OS subprocess) lifecycle.

Process Lifecycle: β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ QUEUED State β”‚ β”‚ β€’ Process ready to launch, waiting for resources β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ↓ tick() when can_start() β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ RUNNING State β†’ enter_running() β”‚ β”‚ 1. process.launch() β”‚ β”‚ β€’ Spawn subprocess with cmd, pwd, env, timeout β”‚ β”‚ β€’ Set pid, started_at β”‚ β”‚ β€’ Process runs in background or foreground β”‚ β”‚ 2. Monitor process completion β”‚ β”‚ β€’ Check exit code when process completes β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ↓ tick() checks is_exited() β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ EXITED State β”‚ β”‚ β€’ Process completed (exit_code set) β”‚ β”‚ β€’ Health stats incremented β”‚ β”‚ β€’ stdout/stderr captured β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Note: This is a simpler state machine than ArchiveResult. Process is just about execution lifecycle. ArchiveResult handles the archival-specific logic (status, output parsing, etc.).

Initialization

model_attr_name[source]

β€˜process’

process: archivebox.machine.models.Process[source]

None

queued[source]

β€˜State(…)’

running[source]

β€˜State(…)’

exited[source]

β€˜State(…)’

tick[source]

None

launch[source]

β€˜to(…)’

kill[source]

β€˜to(…)’

can_start() bool[source]

Check if process can start (has cmd and machine).

is_exited() bool[source]

Check if process has exited (exit_code is set).

enter_queued()[source]

Process is queued for execution.

enter_running()[source]

Start process execution.

enter_exited()[source]

Process has exited.