archivebox.cli.archivebox_extract
archivebox extract [snapshot_ids…] [–plugins=NAMES]
Run plugins on Snapshots. Accepts snapshot IDs as arguments, from stdin, or via JSONL.
Input formats: - Snapshot UUIDs (one per line) - JSONL: {“type”: “Snapshot”, “id”: “…”, “url”: “…”} - JSONL: {“type”: “ArchiveResult”, “snapshot_id”: “…”, “plugin”: “…”}
Output (JSONL): {“type”: “ArchiveResult”, “id”: “…”, “snapshot_id”: “…”, “plugin”: “…”, “status”: “…”}
Examples: # Extract specific snapshot archivebox extract 01234567-89ab-cdef-0123-456789abcdef
# Pipe from snapshot command
archivebox snapshot https://example.com | archivebox extract
# Run specific plugins only
archivebox extract --plugins=screenshot,singlefile 01234567-89ab-cdef-0123-456789abcdef
# Chain commands
archivebox crawl https://example.com | archivebox snapshot | archivebox extract
Module Contents
Functions
Re-run extraction for a single ArchiveResult by ID. |
|
Run plugins on Snapshots from input. |
|
Check if value looks like an ArchiveResult UUID. |
|
Run plugins on Snapshots, or process existing ArchiveResults by ID |
Data
API
- archivebox.cli.archivebox_extract.process_archiveresult_by_id(archiveresult_id: str) int[source]
Re-run extraction for a single ArchiveResult by ID.
ArchiveResults are projected status rows, not queued work items. Re-running a single result means resetting that row and queueing its parent snapshot through the shared crawl runner with the corresponding plugin selected.
- archivebox.cli.archivebox_extract.run_plugins(args: tuple, records: list[dict] | None = None, plugins: str = '', wait: bool = True, emit_results: bool = True) int[source]
Run plugins on Snapshots from input.
Reads Snapshot IDs or JSONL from args/stdin, runs plugins, outputs JSONL.
Exit codes: 0: Success 1: Failure