archivebox.plugins.hooks
Hook discovery and execution helpers for ArchiveBox plugins.
ArchiveBox no longer drives plugin execution itself during normal crawls.
abx-dl owns the live runtime and emits typed bus events; ArchiveBox mainly:
discovers hook files for inspection / docs / legacy direct execution helpers
executes individual hook scripts when explicitly requested
parses hook stdout JSONL records into ArchiveBox models when needed
Hook-backed event families are discovered from filenames like: on_CrawlSetup__* on_Snapshot__*
Internal bus event names are normalized to the corresponding
on_{EventFamily}__* prefix by a simple string transform. If no scripts exist
for that prefix, discovery returns [].
Directory structure:
abx_plugins/plugins/<plugin_name>/on_
Hook contract:
Input: –url=
Execution order: - Hooks are named with two-digit prefixes (00-99) and sorted lexicographically by filename - Foreground hooks run sequentially in that order - Background hooks (.bg suffix) run concurrently and do not block foreground progress - After all foreground hooks complete, background hooks receive SIGTERM and must finalize
Hook naming convention: on_{EventFamily}__{run_order}_{description}[.finite.bg|.daemon.bg].{ext}
API: discover_hooks(event) -> List[Path] Find hook scripts for a hook-backed event family run_hook(script, …) -> Process Execute a hook script directly is_background_hook(name) -> bool Check if hook is background (.bg suffix)
Module Contents
Classes
Functions
Check if a hook is a background hook (doesn’t block foreground progression). |
|
Check if a background hook is finite-lived and should be awaited. |
|
Normalize a hook event family or event class name to its on_* prefix. |
|
Infer the model output dir from a model dir or one of its plugin subdirs. |
|
Find all hook scripts for an event family. |
|
Execute a hook script with the given arguments using Process model. |
|
Extract JSONL records from a Process’s stdout. |
|
Collect all urls.jsonl entries from parser plugin output subdirectories. |
|
Process JSONL records emitted by hook stdout. |
API
- archivebox.plugins.hooks._has_config_dump(config: object) TypeGuard[archivebox.plugins.hooks.ConfigDump][source]
- archivebox.plugins.hooks._config_to_overrides(config: archivebox.plugins.discovery.ConfigLookup | collections.abc.Mapping[str, Any] | None) dict[str, Any][source]
- archivebox.plugins.hooks.is_background_hook(hook_name: str) bool[source]
Check if a hook is a background hook (doesn’t block foreground progression).
Background hooks have ‘.bg.’ in their filename before the extension.
Args: hook_name: Hook filename (e.g., ‘on_Snapshot__10_chrome_tab.daemon.bg.js’)
Returns: True if background hook, False if foreground.
Examples: is_background_hook(‘on_Snapshot__10_chrome_tab.daemon.bg.js’) -> True is_background_hook(‘on_Snapshot__50_wget.py’) -> False is_background_hook(‘on_Snapshot__63_media.finite.bg.py’) -> True
- archivebox.plugins.hooks.is_finite_background_hook(hook_name: str) bool[source]
Check if a background hook is finite-lived and should be awaited.
- archivebox.plugins.hooks.normalize_hook_event_name(event_name: str) str | None[source]
Normalize a hook event family or event class name to its on_* prefix.
Examples: CrawlSetupEvent -> CrawlSetup SnapshotEvent -> Snapshot BinaryEvent -> Binary CrawlCleanupEvent -> CrawlCleanup
- archivebox.plugins.hooks._model_output_dir_from_child_path(path: pathlib.Path, marker: str) pathlib.Path | None[source]
Infer the model output dir from a model dir or one of its plugin subdirs.
Current ArchiveBox snapshot/crawl dirs are: …/{snapshots,crawls}/YYYYMMDD/domain/uuid[/plugin]
- archivebox.plugins.hooks.discover_hooks(event_name: str, filter_disabled: bool = True, config: archivebox.plugins.discovery.ConfigLookup | None = None, **config_kwargs: Any) list[pathlib.Path][source]
Find all hook scripts for an event family.
Searches both built-in and user plugin directories. Filters out hooks from disabled plugins by default (respects USE_/SAVE_ flags). Returns scripts sorted alphabetically by filename for deterministic execution order.
Hook naming convention uses numeric prefixes to control order: on_Snapshot__10_title.py # runs first on_Snapshot__15_singlefile.py # runs second on_Snapshot__26_readability.py # runs later (depends on singlefile)
Args: event_name: Hook event family or event class name. Examples: ‘CrawlSetupEvent’, ‘Snapshot’. Event names are normalized by stripping a trailing
Event. If no matchingon_{EventFamily}__*scripts exist, returns []. filter_disabled: If True, skip hooks from disabled plugins (default: True) config: Optional pre-merged config dict from get_config(). **config_kwargs: Scope/override args forwarded to get_config() when config is not supplied.Returns: Sorted list of hook script paths from enabled plugins only.
Examples: # With proper config context (recommended): from archivebox.config.common import get_config config = get_config(crawl=my_crawl, snapshot=my_snapshot) discover_hooks(‘Snapshot’, config=config) # Returns: [Path(’…/on_Snapshot__10_title.py’), …] (wget excluded if SAVE_WGET=False)
# Without config (uses global defaults): discover_hooks('Snapshot') # Returns: [Path('.../on_Snapshot__10_title.py'), ...] # Show all plugins regardless of enabled status: discover_hooks('Snapshot', filter_disabled=False) # Returns: [Path('.../on_Snapshot__10_title.py'), ..., Path('.../on_Snapshot__50_wget.py')]
- archivebox.plugins.hooks.run_hook(script: pathlib.Path, output_dir: pathlib.Path, config: archivebox.plugins.discovery.ConfigLookup | collections.abc.Mapping[str, Any] | None = None, timeout: int | None = None, parent: Optional[archivebox.machine.models.Process] = None, **kwargs: Any) archivebox.machine.models.Process[source]
Execute a hook script with the given arguments using Process model.
This is the low-level hook executor that creates a Process record and uses Process.launch() for subprocess management.
Config is passed to hooks via environment variables. Crawl/snapshot callers should pass the runtime config produced by for_crawl_runtime().
Args: script: Path to the hook script (.sh, .py, or .js) output_dir: Working directory for the script (where output files go) config: Optional runtime config dict from for_crawl_runtime(). If omitted, pass scope/override args using kwargs prefixed with config_. timeout: Maximum execution time in seconds If None, auto-detects from PLUGINNAME_TIMEOUT config (fallback to TIMEOUT, default 300) parent: Optional parent Process (for tracking worker->hook hierarchy) **kwargs: Arguments passed to the script as –key=value
Returns: Process model instance (use process.exit_code, process.stdout, process.get_records())
Example: from archivebox.config.common import get_config config = get_config(crawl=my_crawl, snapshot=my_snapshot).for_crawl_runtime(crawl=my_crawl, snapshot=my_snapshot) process = run_hook(hook_path, output_dir, config=config, url=url, snapshot_id=id) if process.status == ‘exited’: records = process.get_records() # Get parsed JSONL output
- archivebox.plugins.hooks.extract_records_from_process(process: archivebox.machine.models.Process) list[dict[str, Any]][source]
Extract JSONL records from a Process’s stdout.
Adds plugin metadata to each record.
Args: process: Process model instance with stdout captured
Returns: List of parsed JSONL records with plugin metadata
- archivebox.plugins.hooks.collect_urls_from_plugins(snapshot_dir: pathlib.Path) list[dict[str, Any]][source]
Collect all urls.jsonl entries from parser plugin output subdirectories.
Each parser plugin outputs urls.jsonl to its own subdir: snapshot_dir/parse_rss_urls/urls.jsonl snapshot_dir/parse_html_urls/urls.jsonl etc.
This is not special handling - urls.jsonl is just a normal output file. This utility collects them all for the crawl system.
- archivebox.plugins.hooks.process_hook_records(records: list[dict[str, Any]], overrides: dict[str, Any] | None = None) dict[str, int][source]
Process JSONL records emitted by hook stdout.
This handles hook-emitted record types such as Snapshot, Tag, and Binary. It does not process internal bus lifecycle events, since those are not emitted as JSONL records by hook subprocesses.
Args: records: List of JSONL record dicts from result[‘records’] overrides: Dict with ‘snapshot’, ‘crawl’, ‘dependency’, ‘created_by_id’, etc.
Returns: Dict with counts by record type