archivebox.misc.db

Database utility functions for ArchiveBox.

Post-bootstrap: requires archivebox.config constants and uses Django lazily (from django.db import ... inside functions). Not safe to import pre-bootstrap.

Module Contents

Functions

run_db_analyze_batch

Advance one step of a batched SQLite ANALYZE sweep.

compact_command

sqlite_lock_holders

log_sqlite_lock_holders

sqlite_lock_error

retry_sqlite_locks

migration_lock

migration_state

Cheaply compare migration files to django_migrations without invoking migrate.

pending_migrations

Return migration files on disk that have not been applied yet.

apply_migrations

Apply pending Django migrations

Data

HISTORICAL_GHOST_MIGRATIONS

API

archivebox.misc.db.run_db_analyze_batch(remaining: list[str] | None, *, max_seconds_per_table: float = 120.0) list[str][source]

Advance one step of a batched SQLite ANALYZE sweep.

Without periodic ANALYZE the optimizer’s table stats go stale as snapshot/archiveresult tables grow, causing it to start large joins from auth_user instead of using the indexed url column and blowing snapshot detail page render time from ~50ms to ~500ms+.

The whole sweep is spread across many calls instead of running as one blocking ANALYZE: pass None to start a fresh sweep (this call enumerates user tables and runs ANALYZE on the first one); pass the returned list to advance one more table on each subsequent call. An empty return value means the sweep is complete (or has been aborted) and the next caller should pass None again. Caller is responsible for throttling new sweeps (orchestrator starts at most one per 24hr while idle) and enforcing a hard upper bound on total sweep wall time.

Safety guarantees:

  • Never raises: every database call is wrapped; on any failure the function returns [] (abandoning the rest of the sweep) so the orchestrator never crashes on maintenance errors.

  • Bounded per-call wall time: a SQLite progress handler aborts the current ANALYZE statement once max_seconds_per_table is exceeded, so a single pathological table cannot wedge the call.

  • Never leaves the db locked: each ANALYZE runs as a single statement transaction that auto-commits (or rolls back on abort/error). The cursor and progress handler are always cleaned up in finally blocks even if Python raises mid-call.

  • Silent no-op on non-SQLite backends.

WAL journal mode (set in Django settings) keeps readers fully unblocked throughout; the writer lock is only held for the brief sqlite_stat* flush after each table completes.

archivebox.misc.db.compact_command(cmdline: list[str] | None, fallback: str = '') str[source]
archivebox.misc.db.sqlite_lock_holders(db_path: pathlib.Path = CONSTANTS.DATABASE_FILE) list[str][source]
archivebox.misc.db.log_sqlite_lock_holders(console: Any, *, db_path: pathlib.Path = CONSTANTS.DATABASE_FILE, limit: int = 8) None[source]
archivebox.misc.db.sqlite_lock_error(error: BaseException) bool[source]
archivebox.misc.db.retry_sqlite_locks(action: collections.abc.Callable[[], Any], *, label: str, stderr: TextIO | None = None) Any[source]
archivebox.misc.db.migration_lock(stdout: TextIO | None = None)[source]
archivebox.misc.db.HISTORICAL_GHOST_MIGRATIONS: frozenset[tuple[str, str]][source]

‘frozenset(…)’

archivebox.misc.db.migration_state(out_dir: pathlib.Path = CONSTANTS.DATA_DIR) tuple[list[str], list[str], dict[str, str]][source]

Cheaply compare migration files to django_migrations without invoking migrate.

archivebox.misc.db.pending_migrations(out_dir: pathlib.Path = CONSTANTS.DATA_DIR) list[str][source]

Return migration files on disk that have not been applied yet.

archivebox.misc.db.apply_migrations(out_dir: pathlib.Path = CONSTANTS.DATA_DIR, stdout: TextIO | None = None, stderr: TextIO | None = None, verbosity: int = 1) list[str][source]

Apply pending Django migrations