archivebox.plugins.hooks

Hook discovery and execution helpers for ArchiveBox plugins.

ArchiveBox no longer drives plugin execution itself during normal crawls. abx-dl owns the live runtime and emits typed bus events; ArchiveBox mainly:

  • discovers hook files for inspection / docs / legacy direct execution helpers

  • executes individual hook scripts when explicitly requested

  • parses hook stdout JSONL records into ArchiveBox models when needed

Hook-backed event families are discovered from filenames like: on_CrawlSetup__* on_Snapshot__*

Internal bus event names are normalized to the corresponding on_{EventFamily}__* prefix by a simple string transform. If no scripts exist for that prefix, discovery returns [].

Directory structure: abx_plugins/plugins/<plugin_name>/on_<hook_name>. (built-in package) data/custom_plugins/<plugin_name>/on_<hook_name>. (user)

Hook contract: Input: –url= (and other –key=value args) Output: JSONL records to stdout, files to $PWD Exit: 0 = success, non-zero = failure

Execution order: - Hooks are named with two-digit prefixes (00-99) and sorted lexicographically by filename - Foreground hooks run sequentially in that order - Background hooks (.bg suffix) run concurrently and do not block foreground progress - After all foreground hooks complete, background hooks receive SIGTERM and must finalize

Hook naming convention: on_{EventFamily}__{run_order}_{description}[.finite.bg|.daemon.bg].{ext}

API: discover_hooks(event) -> List[Path] Find hook scripts for a hook-backed event family run_hook(script, …) -> Process Execute a hook script directly is_background_hook(name) -> bool Check if hook is background (.bg suffix)

Module Contents

Classes

ConfigDump

Functions

_has_config_dump

_config_to_overrides

is_background_hook

Check if a hook is a background hook (doesn’t block foreground progression).

is_finite_background_hook

Check if a background hook is finite-lived and should be awaited.

normalize_hook_event_name

Normalize a hook event family or event class name to its on_* prefix.

_model_output_dir_from_child_path

Infer the model output dir from a model dir or one of its plugin subdirs.

discover_hooks

Find all hook scripts for an event family.

run_hook

Execute a hook script with the given arguments using Process model.

extract_records_from_process

Extract JSONL records from a Process’s stdout.

collect_urls_from_plugins

Collect all urls.jsonl entries from parser plugin output subdirectories.

process_hook_records

Process JSONL records emitted by hook stdout.

API

class archivebox.plugins.hooks.ConfigDump[source]

Bases: typing.Protocol

as_dict() dict[str, Any][source]
archivebox.plugins.hooks._has_config_dump(config: object) TypeGuard[archivebox.plugins.hooks.ConfigDump][source]
archivebox.plugins.hooks._config_to_overrides(config: archivebox.plugins.discovery.ConfigLookup | collections.abc.Mapping[str, Any] | None) dict[str, Any][source]
archivebox.plugins.hooks.is_background_hook(hook_name: str) bool[source]

Check if a hook is a background hook (doesn’t block foreground progression).

Background hooks have ‘.bg.’ in their filename before the extension.

Args: hook_name: Hook filename (e.g., ‘on_Snapshot__10_chrome_tab.daemon.bg.js’)

Returns: True if background hook, False if foreground.

Examples: is_background_hook(‘on_Snapshot__10_chrome_tab.daemon.bg.js’) -> True is_background_hook(‘on_Snapshot__50_wget.py’) -> False is_background_hook(‘on_Snapshot__63_media.finite.bg.py’) -> True

archivebox.plugins.hooks.is_finite_background_hook(hook_name: str) bool[source]

Check if a background hook is finite-lived and should be awaited.

archivebox.plugins.hooks.normalize_hook_event_name(event_name: str) str | None[source]

Normalize a hook event family or event class name to its on_* prefix.

Examples: CrawlSetupEvent -> CrawlSetup SnapshotEvent -> Snapshot BinaryEvent -> Binary CrawlCleanupEvent -> CrawlCleanup

archivebox.plugins.hooks._model_output_dir_from_child_path(path: pathlib.Path, marker: str) pathlib.Path | None[source]

Infer the model output dir from a model dir or one of its plugin subdirs.

Current ArchiveBox snapshot/crawl dirs are: …/{snapshots,crawls}/YYYYMMDD/domain/uuid[/plugin]

archivebox.plugins.hooks.discover_hooks(event_name: str, filter_disabled: bool = True, config: archivebox.plugins.discovery.ConfigLookup | None = None, **config_kwargs: Any) list[pathlib.Path][source]

Find all hook scripts for an event family.

Searches both built-in and user plugin directories. Filters out hooks from disabled plugins by default (respects USE_/SAVE_ flags). Returns scripts sorted alphabetically by filename for deterministic execution order.

Hook naming convention uses numeric prefixes to control order: on_Snapshot__10_title.py # runs first on_Snapshot__15_singlefile.py # runs second on_Snapshot__26_readability.py # runs later (depends on singlefile)

Args: event_name: Hook event family or event class name. Examples: ‘CrawlSetupEvent’, ‘Snapshot’. Event names are normalized by stripping a trailing Event. If no matching on_{EventFamily}__* scripts exist, returns []. filter_disabled: If True, skip hooks from disabled plugins (default: True) config: Optional pre-merged config dict from get_config(). **config_kwargs: Scope/override args forwarded to get_config() when config is not supplied.

Returns: Sorted list of hook script paths from enabled plugins only.

Examples: # With proper config context (recommended): from archivebox.config.common import get_config config = get_config(crawl=my_crawl, snapshot=my_snapshot) discover_hooks(‘Snapshot’, config=config) # Returns: [Path(’…/on_Snapshot__10_title.py’), …] (wget excluded if SAVE_WGET=False)

# Without config (uses global defaults):
discover_hooks('Snapshot')
# Returns: [Path('.../on_Snapshot__10_title.py'), ...]

# Show all plugins regardless of enabled status:
discover_hooks('Snapshot', filter_disabled=False)
# Returns: [Path('.../on_Snapshot__10_title.py'), ..., Path('.../on_Snapshot__50_wget.py')]
archivebox.plugins.hooks.run_hook(script: pathlib.Path, output_dir: pathlib.Path, config: archivebox.plugins.discovery.ConfigLookup | collections.abc.Mapping[str, Any] | None = None, timeout: int | None = None, parent: Optional[archivebox.machine.models.Process] = None, **kwargs: Any) archivebox.machine.models.Process[source]

Execute a hook script with the given arguments using Process model.

This is the low-level hook executor that creates a Process record and uses Process.launch() for subprocess management.

Config is passed to hooks via environment variables. Crawl/snapshot callers should pass the runtime config produced by for_crawl_runtime().

Args: script: Path to the hook script (.sh, .py, or .js) output_dir: Working directory for the script (where output files go) config: Optional runtime config dict from for_crawl_runtime(). If omitted, pass scope/override args using kwargs prefixed with config_. timeout: Maximum execution time in seconds If None, auto-detects from PLUGINNAME_TIMEOUT config (fallback to TIMEOUT, default 300) parent: Optional parent Process (for tracking worker->hook hierarchy) **kwargs: Arguments passed to the script as –key=value

Returns: Process model instance (use process.exit_code, process.stdout, process.get_records())

Example: from archivebox.config.common import get_config config = get_config(crawl=my_crawl, snapshot=my_snapshot).for_crawl_runtime(crawl=my_crawl, snapshot=my_snapshot) process = run_hook(hook_path, output_dir, config=config, url=url, snapshot_id=id) if process.status == ‘exited’: records = process.get_records() # Get parsed JSONL output

archivebox.plugins.hooks.extract_records_from_process(process: archivebox.machine.models.Process) list[dict[str, Any]][source]

Extract JSONL records from a Process’s stdout.

Adds plugin metadata to each record.

Args: process: Process model instance with stdout captured

Returns: List of parsed JSONL records with plugin metadata

archivebox.plugins.hooks.collect_urls_from_plugins(snapshot_dir: pathlib.Path) list[dict[str, Any]][source]

Collect all urls.jsonl entries from parser plugin output subdirectories.

Each parser plugin outputs urls.jsonl to its own subdir: snapshot_dir/parse_rss_urls/urls.jsonl snapshot_dir/parse_html_urls/urls.jsonl etc.

This is not special handling - urls.jsonl is just a normal output file. This utility collects them all for the crawl system.

archivebox.plugins.hooks.process_hook_records(records: list[dict[str, Any]], overrides: dict[str, Any] | None = None) dict[str, int][source]

Process JSONL records emitted by hook stdout.

This handles hook-emitted record types such as Snapshot, Tag, and Binary. It does not process internal bus lifecycle events, since those are not emitted as JSONL records by hook subprocesses.

Args: records: List of JSONL record dicts from result[‘records’] overrides: Dict with ‘snapshot’, ‘crawl’, ‘dependency’, ‘created_by_id’, etc.

Returns: Dict with counts by record type