archivebox.misc.jsonl
JSONL (JSON Lines) utilities for ArchiveBox.
Provides functions for reading, writing, and processing typed JSONL records. All CLI commands that accept stdin can read both plain URLs and typed JSONL.
CLI Pipeline: archivebox crawl URL -> {“type”: “Crawl”, “id”: “…”, “urls”: “…”, …} archivebox snapshot -> {“type”: “Snapshot”, “id”: “…”, “url”: “…”, …} archivebox extract -> {“type”: “ArchiveResult”, “id”: “…”, “snapshot_id”: “…”, …}
Typed JSONL Format: {“type”: “Crawl”, “id”: “…”, “urls”: “…”, “max_depth”: 0, …} {“type”: “Snapshot”, “id”: “…”, “url”: “https://example.com”, “title”: “…”, …} {“type”: “ArchiveResult”, “id”: “…”, “snapshot_id”: “…”, “plugin”: “…”, …} {“type”: “Tag”, “name”: “…”}
Plain URLs (also supported): https://example.com https://foo.com
Module Contents
Functions
Parse a single line of input as either JSONL or plain URL. |
|
Read JSONL or plain URLs from stdin. |
|
Read JSONL or plain URLs from a file. |
|
Read from CLI arguments if provided, otherwise from stdin. |
|
Write a single JSONL record to stdout (or provided stream). |
|
Write multiple JSONL records to stdout (or provided stream). |
Data
API
- archivebox.misc.jsonl.parse_line(line: str) dict[str, Any] | None[source]
Parse a single line of input as either JSONL or plain URL.
Returns a dict with at minimum {‘type’: ‘…’, ‘url’: ‘…’} or None if invalid.
- archivebox.misc.jsonl.read_stdin(stream: TextIO | None = None) collections.abc.Iterator[dict[str, Any]][source]
Read JSONL or plain URLs from stdin.
Yields parsed records as dicts. Supports both JSONL format and plain URLs (one per line).
- archivebox.misc.jsonl.read_file(path: pathlib.Path) collections.abc.Iterator[dict[str, Any]][source]
Read JSONL or plain URLs from a file.
Yields parsed records as dicts.
- archivebox.misc.jsonl.read_args_or_stdin(args: collections.abc.Iterable[str], stream: TextIO | None = None) collections.abc.Iterator[dict[str, Any]][source]
Read from CLI arguments if provided, otherwise from stdin.
Handles both URLs and JSONL from either source.