archivebox.hooks
Hook discovery and execution helpers for ArchiveBox plugins.
ArchiveBox no longer drives plugin execution itself during normal crawls.
abx-dl owns the live runtime and emits typed bus events; ArchiveBox mainly:
discovers hook files for inspection / docs / legacy direct execution helpers
executes individual hook scripts when explicitly requested
parses hook stdout JSONL records into ArchiveBox models when needed
Hook-backed event families are discovered from filenames like: on_BinaryRequest__* on_CrawlSetup__* on_Snapshot__*
Internal bus event names are normalized to the corresponding
on_{EventFamily}__* prefix by a simple string transform. If no scripts exist
for that prefix, discovery returns [].
Directory structure:
abx_plugins/plugins/<plugin_name>/on_
Hook contract:
Input: –url=
Execution order: - Hooks are named with two-digit prefixes (00-99) and sorted lexicographically by filename - Foreground hooks run sequentially in that order - Background hooks (.bg suffix) run concurrently and do not block foreground progress - After all foreground hooks complete, background hooks receive SIGTERM and must finalize
Hook naming convention: on_{EventFamily}__{run_order}_{description}[.finite.bg|.daemon.bg].{ext}
API: discover_hooks(event) -> List[Path] Find hook scripts for a hook-backed event family run_hook(script, …) -> Process Execute a hook script directly is_background_hook(name) -> bool Check if hook is background (.bg suffix)
Module Contents
Classes
Raw result from run_hook(). |
Functions
Check if a hook is a background hook (doesn’t block foreground progression). |
|
Check if a background hook is finite-lived and should be awaited. |
|
Iterate over all built-in and user plugin directories. |
|
Normalize a hook event family or event class name to its on_* prefix. |
|
Find all hook scripts for an event family. |
|
Execute a hook script with the given arguments using Process model. |
|
Extract JSONL records from a Process’s stdout. |
|
Collect all urls.jsonl entries from parser plugin output subdirectories. |
|
Get list of available plugins by discovering plugin directories. |
|
Get the base plugin name without numeric prefix. |
|
Get the list of enabled plugins based on config and available hooks. |
|
Discover plugins that provide a specific Python module with required interface. |
|
Discover all available search backend plugins. |
|
Discover all plugin config.json schemas. |
|
Get default values for all plugin config options. |
|
Extract special config keys for a plugin following naming conventions. |
|
Get a plugin template by plugin name and template type. |
|
Get the icon for a plugin from its icon.html template. |
|
Process JSONL records emitted by hook stdout. |
Data
API
- archivebox.hooks.is_background_hook(hook_name: str) bool[source]
Check if a hook is a background hook (doesn’t block foreground progression).
Background hooks have ‘.bg.’ in their filename before the extension.
Args: hook_name: Hook filename (e.g., ‘on_Snapshot__10_chrome_tab.daemon.bg.js’)
Returns: True if background hook, False if foreground.
Examples: is_background_hook(‘on_Snapshot__10_chrome_tab.daemon.bg.js’) -> True is_background_hook(‘on_Snapshot__50_wget.py’) -> False is_background_hook(‘on_Snapshot__63_media.finite.bg.py’) -> True
- archivebox.hooks.is_finite_background_hook(hook_name: str) bool[source]
Check if a background hook is finite-lived and should be awaited.
- archivebox.hooks.iter_plugin_dirs() list[pathlib.Path][source]
Iterate over all built-in and user plugin directories.
- archivebox.hooks.normalize_hook_event_name(event_name: str) str | None[source]
Normalize a hook event family or event class name to its on_* prefix.
Examples: BinaryRequestEvent -> BinaryRequest CrawlSetupEvent -> CrawlSetup SnapshotEvent -> Snapshot BinaryEvent -> Binary CrawlCleanupEvent -> CrawlCleanup
- class archivebox.hooks.HookResult[source]
Bases:
typing.TypedDictRaw result from run_hook().
Initialization
Initialize self. See help(type(self)) for accurate signature.
- archivebox.hooks.discover_hooks(event_name: str, filter_disabled: bool = True, config: dict[str, Any] | None = None) list[pathlib.Path][source]
Find all hook scripts for an event family.
Searches both built-in and user plugin directories. Filters out hooks from disabled plugins by default (respects USE_/SAVE_ flags). Returns scripts sorted alphabetically by filename for deterministic execution order.
Hook naming convention uses numeric prefixes to control order: on_Snapshot__10_title.py # runs first on_Snapshot__15_singlefile.py # runs second on_Snapshot__26_readability.py # runs later (depends on singlefile)
Args: event_name: Hook event family or event class name. Examples: ‘BinaryRequestEvent’, ‘Snapshot’. Event names are normalized by stripping a trailing
Event. If no matchingon_{EventFamily}__*scripts exist, returns []. filter_disabled: If True, skip hooks from disabled plugins (default: True) config: Optional config dict from get_config() (merges file, env, machine, crawl, snapshot) If None, will call get_config() with global scopeReturns: Sorted list of hook script paths from enabled plugins only.
Examples: # With proper config context (recommended): from archivebox.config.configset import get_config config = get_config(crawl=my_crawl, snapshot=my_snapshot) discover_hooks(‘Snapshot’, config=config) # Returns: [Path(’…/on_Snapshot__10_title.py’), …] (wget excluded if SAVE_WGET=False)
# Without config (uses global defaults): discover_hooks('Snapshot') # Returns: [Path('.../on_Snapshot__10_title.py'), ...] # Show all plugins regardless of enabled status: discover_hooks('Snapshot', filter_disabled=False) # Returns: [Path('.../on_Snapshot__10_title.py'), ..., Path('.../on_Snapshot__50_wget.py')]
- archivebox.hooks.run_hook(script: pathlib.Path, output_dir: pathlib.Path, config: dict[str, Any], timeout: int | None = None, parent: Optional[archivebox.machine.models.Process] = None, **kwargs: Any) archivebox.machine.models.Process[source]
Execute a hook script with the given arguments using Process model.
This is the low-level hook executor that creates a Process record and uses Process.launch() for subprocess management.
Config is passed to hooks via environment variables. Caller MUST use get_config() to merge all sources (file, env, machine, crawl, snapshot).
Args: script: Path to the hook script (.sh, .py, or .js) output_dir: Working directory for the script (where output files go) config: Merged config dict from get_config(crawl=…, snapshot=…) - REQUIRED timeout: Maximum execution time in seconds If None, auto-detects from PLUGINNAME_TIMEOUT config (fallback to TIMEOUT, default 300) parent: Optional parent Process (for tracking worker->hook hierarchy) **kwargs: Arguments passed to the script as –key=value
Returns: Process model instance (use process.exit_code, process.stdout, process.get_records())
Example: from archivebox.config.configset import get_config config = get_config(crawl=my_crawl, snapshot=my_snapshot) process = run_hook(hook_path, output_dir, config=config, url=url, snapshot_id=id) if process.status == ‘exited’: records = process.get_records() # Get parsed JSONL output
- archivebox.hooks.extract_records_from_process(process: archivebox.machine.models.Process) list[dict[str, Any]][source]
Extract JSONL records from a Process’s stdout.
Adds plugin metadata to each record.
Args: process: Process model instance with stdout captured
Returns: List of parsed JSONL records with plugin metadata
- archivebox.hooks.collect_urls_from_plugins(snapshot_dir: pathlib.Path) list[dict[str, Any]][source]
Collect all urls.jsonl entries from parser plugin output subdirectories.
Each parser plugin outputs urls.jsonl to its own subdir: snapshot_dir/parse_rss_urls/urls.jsonl snapshot_dir/parse_html_urls/urls.jsonl etc.
This is not special handling - urls.jsonl is just a normal output file. This utility collects them all for the crawl system.
- archivebox.hooks.get_plugins() list[str][source]
Get list of available plugins by discovering plugin directories.
Returns plugin directory names for any plugin that exposes hooks, config.json, or a standardized templates/icon.html asset. This includes non-extractor plugins such as binary providers and shared base plugins.
- archivebox.hooks.get_plugin_name(plugin: str) str[source]
Get the base plugin name without numeric prefix.
Examples: ‘10_title’ -> ‘title’ ‘26_readability’ -> ‘readability’ ‘50_parse_html_urls’ -> ‘parse_html_urls’
- archivebox.hooks.get_enabled_plugins(config: dict[str, Any] | None = None) list[str][source]
Get the list of enabled plugins based on config and available hooks.
Filters plugins by USE_/SAVE_ flags. Only returns plugins that are enabled.
Args: config: Merged config dict from get_config() - if None, uses global config
Returns: Plugin names sorted alphabetically (numeric prefix controls order).
Example: from archivebox.config.configset import get_config config = get_config(crawl=my_crawl, snapshot=my_snapshot) enabled = get_enabled_plugins(config) # [‘wget’, ‘media’, ‘chrome’, …]
- archivebox.hooks.discover_plugins_that_provide_interface(module_name: str, required_attrs: list[str], plugin_prefix: str | None = None) dict[str, Any][source]
Discover plugins that provide a specific Python module with required interface.
This enables dynamic plugin discovery for features like search backends, storage backends, etc. without hardcoding imports.
Args: module_name: Name of the module to look for (e.g., ‘search’) required_attrs: List of attributes the module must have (e.g., [‘search’, ‘flush’]) plugin_prefix: Optional prefix to filter plugins (e.g., ‘search_backend_’)
Returns: Dict mapping backend names to imported modules. Backend name is derived from plugin directory name minus the prefix. e.g., search_backend_sqlite -> ‘sqlite’
Example: backends = discover_plugins_that_provide_interface( module_name=’search’, required_attrs=[‘search’, ‘flush’], plugin_prefix=’search_backend_’, ) # Returns: {‘sqlite’:
, ‘sonic’: , ‘ripgrep’: }
- archivebox.hooks.get_search_backends() dict[str, Any][source]
Discover all available search backend plugins.
Search backends must provide a search.py module with: - search(query: str) -> List[str] (returns snapshot IDs) - flush(snapshot_ids: Iterable[str]) -> None
Returns: Dict mapping backend names to their modules. e.g., {‘sqlite’:
, ‘sonic’: , ‘ripgrep’: }
- archivebox.hooks.discover_plugin_configs() dict[str, dict[str, Any]][source]
Discover all plugin config.json schemas.
Each plugin can define a config.json file with JSONSchema defining its configuration options. This function discovers and loads all such schemas.
The config.json files use JSONSchema draft-07 with custom extensions: - x-fallback: Global config key to use as fallback - x-aliases: List of old/alternative config key names
Returns: Dict mapping plugin names to their parsed JSONSchema configs. e.g., {‘wget’: {…schema…}, ‘chrome’: {…schema…}}
Example config.json: { “$schema”: “http://json-schema.org/draft-07/schema#”, “type”: “object”, “properties”: { “SAVE_WGET”: {“type”: “boolean”, “default”: true}, “WGET_TIMEOUT”: {“type”: “integer”, “default”: 60, “x-fallback”: “TIMEOUT”} } }
- archivebox.hooks.get_config_defaults_from_plugins() dict[str, Any][source]
Get default values for all plugin config options.
Returns: Dict mapping config keys to their default values. e.g., {‘SAVE_WGET’: True, ‘WGET_TIMEOUT’: 60, …}
- archivebox.hooks.get_plugin_special_config(plugin_name: str, config: dict[str, Any], _visited: set[str] | None = None) dict[str, Any][source]
Extract special config keys for a plugin following naming conventions.
ArchiveBox recognizes 3 special config key patterns per plugin: - {PLUGIN}_ENABLED: Enable/disable toggle (default True) - {PLUGIN}_TIMEOUT: Plugin-specific timeout (fallback to TIMEOUT, default 300) - {PLUGIN}_BINARY: Primary binary path (default to plugin_name)
These allow ArchiveBox to: - Skip disabled plugins (optimization) - Enforce plugin-specific timeouts automatically - Discover plugin binaries for validation
Args: plugin_name: Plugin name (e.g., ‘wget’, ‘media’, ‘chrome’) config: Merged config dict from get_config() (properly merges file, env, machine, crawl, snapshot)
Returns: Dict with standardized keys: { ‘enabled’: True, # bool ‘timeout’: 60, # int, seconds ‘binary’: ‘wget’, # str, path or name }
Examples: >>> from archivebox.config.configset import get_config >>> config = get_config(crawl=my_crawl, snapshot=my_snapshot) >>> get_plugin_special_config(‘wget’, config) {‘enabled’: True, ‘timeout’: 120, ‘binary’: ‘/usr/bin/wget’}
- archivebox.hooks.get_plugin_template(plugin: str, template_name: str, fallback: bool = True) str | None[source]
Get a plugin template by plugin name and template type.
Args: plugin: Plugin name (e.g., ‘screenshot’, ‘15_singlefile’) template_name: One of ‘icon’, ‘card’, ‘full’ fallback: If True, return default template if plugin template not found
Returns: Template content as string, or None if not found and fallback=False.
- archivebox.hooks.get_plugin_icon(plugin: str) str[source]
Get the icon for a plugin from its icon.html template.
Args: plugin: Plugin name (e.g., ‘screenshot’, ‘15_singlefile’)
Returns: Icon HTML/emoji string.
- archivebox.hooks.process_hook_records(records: list[dict[str, Any]], overrides: dict[str, Any] | None = None) dict[str, int][source]
Process JSONL records emitted by hook stdout.
This handles hook-emitted record types such as Snapshot, Tag, BinaryRequest, and Binary. It does not process internal bus lifecycle events, since those are not emitted as JSONL records by hook subprocesses.
Args: records: List of JSONL record dicts from result[‘records’] overrides: Dict with ‘snapshot’, ‘crawl’, ‘dependency’, ‘created_by_id’, etc.
Returns: Dict with counts by record type