archivebox.misc.jsonl

JSONL (JSON Lines) utilities for ArchiveBox.

Provides functions for reading, writing, and processing typed JSONL records. All CLI commands that accept stdin can read both plain URLs and typed JSONL.

CLI Pipeline: archivebox crawl URL -> {“type”: “Crawl”, “id”: “…”, “urls”: “…”, …} archivebox snapshot -> {“type”: “Snapshot”, “id”: “…”, “url”: “…”, …} archivebox extract -> {“type”: “ArchiveResult”, “id”: “…”, “snapshot_id”: “…”, …}

Typed JSONL Format: {“type”: “Crawl”, “id”: “…”, “urls”: “…”, “max_depth”: 0, …} {“type”: “Snapshot”, “id”: “…”, “url”: “https://example.com”, “title”: “…”, …} {“type”: “ArchiveResult”, “id”: “…”, “snapshot_id”: “…”, “plugin”: “…”, …} {“type”: “Tag”, “name”: “…”}

Plain URLs (also supported): https://example.com https://foo.com

Module Contents

Functions

parse_line

Parse a single line of input as either JSONL or plain URL.

read_stdin

Read JSONL or plain URLs from stdin.

read_file

Read JSONL or plain URLs from a file.

read_args_or_stdin

Read from CLI arguments if provided, otherwise from stdin.

write_record

Write a single JSONL record to stdout (or provided stream).

write_records

Write multiple JSONL records to stdout (or provided stream).

Data

TYPE_SNAPSHOT

TYPE_ARCHIVERESULT

TYPE_TAG

TYPE_CRAWL

TYPE_BINARYREQUEST

TYPE_BINARY

TYPE_PROCESS

TYPE_MACHINE

VALID_TYPES

API

archivebox.misc.jsonl.TYPE_SNAPSHOT[source]

‘Snapshot’

archivebox.misc.jsonl.TYPE_ARCHIVERESULT[source]

‘ArchiveResult’

archivebox.misc.jsonl.TYPE_TAG[source]

‘Tag’

archivebox.misc.jsonl.TYPE_CRAWL[source]

‘Crawl’

archivebox.misc.jsonl.TYPE_BINARYREQUEST[source]

‘BinaryRequest’

archivebox.misc.jsonl.TYPE_BINARY[source]

‘Binary’

archivebox.misc.jsonl.TYPE_PROCESS[source]

‘Process’

archivebox.misc.jsonl.TYPE_MACHINE[source]

‘Machine’

archivebox.misc.jsonl.VALID_TYPES[source]

None

archivebox.misc.jsonl.parse_line(line: str) dict[str, Any] | None[source]

Parse a single line of input as either JSONL or plain URL.

Returns a dict with at minimum {‘type’: ‘…’, ‘url’: ‘…’} or None if invalid.

archivebox.misc.jsonl.read_stdin(stream: TextIO | None = None) collections.abc.Iterator[dict[str, Any]][source]

Read JSONL or plain URLs from stdin.

Yields parsed records as dicts. Supports both JSONL format and plain URLs (one per line).

archivebox.misc.jsonl.read_file(path: pathlib.Path) collections.abc.Iterator[dict[str, Any]][source]

Read JSONL or plain URLs from a file.

Yields parsed records as dicts.

archivebox.misc.jsonl.read_args_or_stdin(args: collections.abc.Iterable[str], stream: TextIO | None = None) collections.abc.Iterator[dict[str, Any]][source]

Read from CLI arguments if provided, otherwise from stdin.

Handles both URLs and JSONL from either source.

archivebox.misc.jsonl.write_record(record: dict[str, Any], stream: TextIO | None = None) None[source]

Write a single JSONL record to stdout (or provided stream).

archivebox.misc.jsonl.write_records(records: collections.abc.Iterator[dict[str, Any]], stream: TextIO | None = None) int[source]

Write multiple JSONL records to stdout (or provided stream).

Returns count of records written.