archivebox.hooks

Hook discovery and execution helpers for ArchiveBox plugins.

ArchiveBox no longer drives plugin execution itself during normal crawls. abx-dl owns the live runtime and emits typed bus events; ArchiveBox mainly:

  • discovers hook files for inspection / docs / legacy direct execution helpers

  • executes individual hook scripts when explicitly requested

  • parses hook stdout JSONL records into ArchiveBox models when needed

Hook-backed event families are discovered from filenames like: on_BinaryRequest__* on_CrawlSetup__* on_Snapshot__*

Internal bus event names are normalized to the corresponding on_{EventFamily}__* prefix by a simple string transform. If no scripts exist for that prefix, discovery returns [].

Directory structure: abx_plugins/plugins/<plugin_name>/on_<hook_name>. (built-in package) data/custom_plugins/<plugin_name>/on_<hook_name>. (user)

Hook contract: Input: –url= (and other –key=value args) Output: JSONL records to stdout, files to $PWD Exit: 0 = success, non-zero = failure

Execution order: - Hooks are named with two-digit prefixes (00-99) and sorted lexicographically by filename - Foreground hooks run sequentially in that order - Background hooks (.bg suffix) run concurrently and do not block foreground progress - After all foreground hooks complete, background hooks receive SIGTERM and must finalize

Hook naming convention: on_{EventFamily}__{run_order}_{description}[.finite.bg|.daemon.bg].{ext}

API: discover_hooks(event) -> List[Path] Find hook scripts for a hook-backed event family run_hook(script, …) -> Process Execute a hook script directly is_background_hook(name) -> bool Check if hook is background (.bg suffix)

Module Contents

Classes

HookResult

Raw result from run_hook().

Functions

is_background_hook

Check if a hook is a background hook (doesn’t block foreground progression).

is_finite_background_hook

Check if a background hook is finite-lived and should be awaited.

iter_plugin_dirs

Iterate over all built-in and user plugin directories.

normalize_hook_event_name

Normalize a hook event family or event class name to its on_* prefix.

discover_hooks

Find all hook scripts for an event family.

run_hook

Execute a hook script with the given arguments using Process model.

extract_records_from_process

Extract JSONL records from a Process’s stdout.

collect_urls_from_plugins

Collect all urls.jsonl entries from parser plugin output subdirectories.

get_plugins

Get list of available plugins by discovering plugin directories.

get_plugin_name

Get the base plugin name without numeric prefix.

get_enabled_plugins

Get the list of enabled plugins based on config and available hooks.

discover_plugins_that_provide_interface

Discover plugins that provide a specific Python module with required interface.

get_search_backends

Discover all available search backend plugins.

discover_plugin_configs

Discover all plugin config.json schemas.

get_config_defaults_from_plugins

Get default values for all plugin config options.

get_plugin_special_config

Extract special config keys for a plugin following naming conventions.

get_plugin_template

Get a plugin template by plugin name and template type.

get_plugin_icon

Get the icon for a plugin from its icon.html template.

process_hook_records

Process JSONL records emitted by hook stdout.

Data

BUILTIN_PLUGINS_DIR

USER_PLUGINS_DIR

DEFAULT_TEMPLATES

API

archivebox.hooks.BUILTIN_PLUGINS_DIR[source]

‘resolve(…)’

archivebox.hooks.USER_PLUGINS_DIR[source]

‘expanduser(…)’

archivebox.hooks.is_background_hook(hook_name: str) bool[source]

Check if a hook is a background hook (doesn’t block foreground progression).

Background hooks have ‘.bg.’ in their filename before the extension.

Args: hook_name: Hook filename (e.g., ‘on_Snapshot__10_chrome_tab.daemon.bg.js’)

Returns: True if background hook, False if foreground.

Examples: is_background_hook(‘on_Snapshot__10_chrome_tab.daemon.bg.js’) -> True is_background_hook(‘on_Snapshot__50_wget.py’) -> False is_background_hook(‘on_Snapshot__63_media.finite.bg.py’) -> True

archivebox.hooks.is_finite_background_hook(hook_name: str) bool[source]

Check if a background hook is finite-lived and should be awaited.

archivebox.hooks.iter_plugin_dirs() list[pathlib.Path][source]

Iterate over all built-in and user plugin directories.

archivebox.hooks.normalize_hook_event_name(event_name: str) str | None[source]

Normalize a hook event family or event class name to its on_* prefix.

Examples: BinaryRequestEvent -> BinaryRequest CrawlSetupEvent -> CrawlSetup SnapshotEvent -> Snapshot BinaryEvent -> Binary CrawlCleanupEvent -> CrawlCleanup

class archivebox.hooks.HookResult[source]

Bases: typing.TypedDict

Raw result from run_hook().

Initialization

Initialize self. See help(type(self)) for accurate signature.

returncode: int[source]

None

stdout: str[source]

None

stderr: str[source]

None

output_json: dict[str, Any] | None[source]

None

output_files: list[dict[str, Any]][source]

None

duration_ms: int[source]

None

hook: str[source]

None

plugin: str[source]

None

hook_name: str[source]

None

records: list[dict[str, Any]][source]

None

archivebox.hooks.discover_hooks(event_name: str, filter_disabled: bool = True, config: dict[str, Any] | None = None) list[pathlib.Path][source]

Find all hook scripts for an event family.

Searches both built-in and user plugin directories. Filters out hooks from disabled plugins by default (respects USE_/SAVE_ flags). Returns scripts sorted alphabetically by filename for deterministic execution order.

Hook naming convention uses numeric prefixes to control order: on_Snapshot__10_title.py # runs first on_Snapshot__15_singlefile.py # runs second on_Snapshot__26_readability.py # runs later (depends on singlefile)

Args: event_name: Hook event family or event class name. Examples: ‘BinaryRequestEvent’, ‘Snapshot’. Event names are normalized by stripping a trailing Event. If no matching on_{EventFamily}__* scripts exist, returns []. filter_disabled: If True, skip hooks from disabled plugins (default: True) config: Optional config dict from get_config() (merges file, env, machine, crawl, snapshot) If None, will call get_config() with global scope

Returns: Sorted list of hook script paths from enabled plugins only.

Examples: # With proper config context (recommended): from archivebox.config.configset import get_config config = get_config(crawl=my_crawl, snapshot=my_snapshot) discover_hooks(‘Snapshot’, config=config) # Returns: [Path(’…/on_Snapshot__10_title.py’), …] (wget excluded if SAVE_WGET=False)

# Without config (uses global defaults):
discover_hooks('Snapshot')
# Returns: [Path('.../on_Snapshot__10_title.py'), ...]

# Show all plugins regardless of enabled status:
discover_hooks('Snapshot', filter_disabled=False)
# Returns: [Path('.../on_Snapshot__10_title.py'), ..., Path('.../on_Snapshot__50_wget.py')]
archivebox.hooks.run_hook(script: pathlib.Path, output_dir: pathlib.Path, config: dict[str, Any], timeout: int | None = None, parent: Optional[archivebox.machine.models.Process] = None, **kwargs: Any) archivebox.machine.models.Process[source]

Execute a hook script with the given arguments using Process model.

This is the low-level hook executor that creates a Process record and uses Process.launch() for subprocess management.

Config is passed to hooks via environment variables. Caller MUST use get_config() to merge all sources (file, env, machine, crawl, snapshot).

Args: script: Path to the hook script (.sh, .py, or .js) output_dir: Working directory for the script (where output files go) config: Merged config dict from get_config(crawl=…, snapshot=…) - REQUIRED timeout: Maximum execution time in seconds If None, auto-detects from PLUGINNAME_TIMEOUT config (fallback to TIMEOUT, default 300) parent: Optional parent Process (for tracking worker->hook hierarchy) **kwargs: Arguments passed to the script as –key=value

Returns: Process model instance (use process.exit_code, process.stdout, process.get_records())

Example: from archivebox.config.configset import get_config config = get_config(crawl=my_crawl, snapshot=my_snapshot) process = run_hook(hook_path, output_dir, config=config, url=url, snapshot_id=id) if process.status == ‘exited’: records = process.get_records() # Get parsed JSONL output

archivebox.hooks.extract_records_from_process(process: archivebox.machine.models.Process) list[dict[str, Any]][source]

Extract JSONL records from a Process’s stdout.

Adds plugin metadata to each record.

Args: process: Process model instance with stdout captured

Returns: List of parsed JSONL records with plugin metadata

archivebox.hooks.collect_urls_from_plugins(snapshot_dir: pathlib.Path) list[dict[str, Any]][source]

Collect all urls.jsonl entries from parser plugin output subdirectories.

Each parser plugin outputs urls.jsonl to its own subdir: snapshot_dir/parse_rss_urls/urls.jsonl snapshot_dir/parse_html_urls/urls.jsonl etc.

This is not special handling - urls.jsonl is just a normal output file. This utility collects them all for the crawl system.

archivebox.hooks.get_plugins() list[str][source]

Get list of available plugins by discovering plugin directories.

Returns plugin directory names for any plugin that exposes hooks, config.json, or a standardized templates/icon.html asset. This includes non-extractor plugins such as binary providers and shared base plugins.

archivebox.hooks.get_plugin_name(plugin: str) str[source]

Get the base plugin name without numeric prefix.

Examples: ‘10_title’ -> ‘title’ ‘26_readability’ -> ‘readability’ ‘50_parse_html_urls’ -> ‘parse_html_urls’

archivebox.hooks.get_enabled_plugins(config: dict[str, Any] | None = None) list[str][source]

Get the list of enabled plugins based on config and available hooks.

Filters plugins by USE_/SAVE_ flags. Only returns plugins that are enabled.

Args: config: Merged config dict from get_config() - if None, uses global config

Returns: Plugin names sorted alphabetically (numeric prefix controls order).

Example: from archivebox.config.configset import get_config config = get_config(crawl=my_crawl, snapshot=my_snapshot) enabled = get_enabled_plugins(config) # [‘wget’, ‘media’, ‘chrome’, …]

archivebox.hooks.discover_plugins_that_provide_interface(module_name: str, required_attrs: list[str], plugin_prefix: str | None = None) dict[str, Any][source]

Discover plugins that provide a specific Python module with required interface.

This enables dynamic plugin discovery for features like search backends, storage backends, etc. without hardcoding imports.

Args: module_name: Name of the module to look for (e.g., ‘search’) required_attrs: List of attributes the module must have (e.g., [‘search’, ‘flush’]) plugin_prefix: Optional prefix to filter plugins (e.g., ‘search_backend_’)

Returns: Dict mapping backend names to imported modules. Backend name is derived from plugin directory name minus the prefix. e.g., search_backend_sqlite -> ‘sqlite’

Example: backends = discover_plugins_that_provide_interface( module_name=’search’, required_attrs=[‘search’, ‘flush’], plugin_prefix=’search_backend_’, ) # Returns: {‘sqlite’: , ‘sonic’: , ‘ripgrep’: }

archivebox.hooks.get_search_backends() dict[str, Any][source]

Discover all available search backend plugins.

Search backends must provide a search.py module with: - search(query: str) -> List[str] (returns snapshot IDs) - flush(snapshot_ids: Iterable[str]) -> None

Returns: Dict mapping backend names to their modules. e.g., {‘sqlite’: , ‘sonic’: , ‘ripgrep’: }

archivebox.hooks.discover_plugin_configs() dict[str, dict[str, Any]][source]

Discover all plugin config.json schemas.

Each plugin can define a config.json file with JSONSchema defining its configuration options. This function discovers and loads all such schemas.

The config.json files use JSONSchema draft-07 with custom extensions: - x-fallback: Global config key to use as fallback - x-aliases: List of old/alternative config key names

Returns: Dict mapping plugin names to their parsed JSONSchema configs. e.g., {‘wget’: {…schema…}, ‘chrome’: {…schema…}}

Example config.json: { “$schema”: “http://json-schema.org/draft-07/schema#”, “type”: “object”, “properties”: { “SAVE_WGET”: {“type”: “boolean”, “default”: true}, “WGET_TIMEOUT”: {“type”: “integer”, “default”: 60, “x-fallback”: “TIMEOUT”} } }

archivebox.hooks.get_config_defaults_from_plugins() dict[str, Any][source]

Get default values for all plugin config options.

Returns: Dict mapping config keys to their default values. e.g., {‘SAVE_WGET’: True, ‘WGET_TIMEOUT’: 60, …}

archivebox.hooks.get_plugin_special_config(plugin_name: str, config: dict[str, Any], _visited: set[str] | None = None) dict[str, Any][source]

Extract special config keys for a plugin following naming conventions.

ArchiveBox recognizes 3 special config key patterns per plugin: - {PLUGIN}_ENABLED: Enable/disable toggle (default True) - {PLUGIN}_TIMEOUT: Plugin-specific timeout (fallback to TIMEOUT, default 300) - {PLUGIN}_BINARY: Primary binary path (default to plugin_name)

These allow ArchiveBox to: - Skip disabled plugins (optimization) - Enforce plugin-specific timeouts automatically - Discover plugin binaries for validation

Args: plugin_name: Plugin name (e.g., ‘wget’, ‘media’, ‘chrome’) config: Merged config dict from get_config() (properly merges file, env, machine, crawl, snapshot)

Returns: Dict with standardized keys: { ‘enabled’: True, # bool ‘timeout’: 60, # int, seconds ‘binary’: ‘wget’, # str, path or name }

Examples: >>> from archivebox.config.configset import get_config >>> config = get_config(crawl=my_crawl, snapshot=my_snapshot) >>> get_plugin_special_config(‘wget’, config) {‘enabled’: True, ‘timeout’: 120, ‘binary’: ‘/usr/bin/wget’}

archivebox.hooks.DEFAULT_TEMPLATES[source]

None

archivebox.hooks.get_plugin_template(plugin: str, template_name: str, fallback: bool = True) str | None[source]

Get a plugin template by plugin name and template type.

Args: plugin: Plugin name (e.g., ‘screenshot’, ‘15_singlefile’) template_name: One of ‘icon’, ‘card’, ‘full’ fallback: If True, return default template if plugin template not found

Returns: Template content as string, or None if not found and fallback=False.

archivebox.hooks.get_plugin_icon(plugin: str) str[source]

Get the icon for a plugin from its icon.html template.

Args: plugin: Plugin name (e.g., ‘screenshot’, ‘15_singlefile’)

Returns: Icon HTML/emoji string.

archivebox.hooks.process_hook_records(records: list[dict[str, Any]], overrides: dict[str, Any] | None = None) dict[str, int][source]

Process JSONL records emitted by hook stdout.

This handles hook-emitted record types such as Snapshot, Tag, BinaryRequest, and Binary. It does not process internal bus lifecycle events, since those are not emitted as JSONL records by hook subprocesses.

Args: records: List of JSONL record dicts from result[‘records’] overrides: Dict with ‘snapshot’, ‘crawl’, ‘dependency’, ‘created_by_id’, etc.

Returns: Dict with counts by record type