Configuration

Configuration of ArchiveBox is done by using the archivebox config command, modifying the ArchiveBox.conf file in the data folder, or by using environment variables. All three methods work equivalently when using Docker as well.

Some equivalent examples of setting some configuration options:

archivebox config --set CHROME_BINARY=google-chrome-stable
# OR
echo "CHROME_BINARY=google-chrome-stable" >> ArchiveBox.conf
# OR
env CHROME_BINARY=google-chrome-stable archivebox add ~/Downloads/bookmarks_export.html

Environment variables take precedence over the config file, which is useful if you only want to use a certain option temporarily during a single run. For more examples see Usage: Configuration…


Available Configuration Options:



In case this document is ever out of date, check the source code for config definitions: archivebox/config/common.py ➑️

General Settings

General options around the archiving process, output format, and timing.


ONLY_NEW

Possible Values: [True]/False Toggle whether or not to attempt rechecking old links when adding new ones, or leave old incomplete links alone and only archive the new links.

By default, ArchiveBox will only archive new links on each import. If you want it to go back through all links in the index and download any missing files on every run, set this to False.

Note: Regardless of how this is set, ArchiveBox will never re-download sites that have already succeeded previously. When this is False it only attempts to fix previous pages have missing archive extractor outputs, it does not re-archive pages that have already been successfully archived.


OVERWRITE

Possible Values: [False]/True When set to True, ArchiveBox will re-archive URLs even if they have already been successfully archived before, overwriting any existing output.


TIMEOUT

Possible Values: [60]/120/… Maximum allowed download time per archive method for each link in seconds. If you have a slow network connection or are seeing frequent timeout errors, you can raise this value.

Note: Do not set this to anything less than 5 seconds as it will cause Chrome to hang indefinitely and many sites to fail completely.


MAX_URL_ATTEMPTS

Possible Values: [50]/100/… Maximum number of times ArchiveBox will attempt to archive a URL before giving up. Useful for handling transient failures.


RESOLUTION

Possible Values: [1440,2000]/1024,768/… Default screenshot/PDF resolution in pixels width,height. Used as the fallback for SCREENSHOT_RESOLUTION, PDF_RESOLUTION, and CHROME_RESOLUTION.


CHECK_SSL_VALIDITY

Possible Values: [True]/False Whether to enforce HTTPS certificate and HSTS chain of trust when archiving sites. Set this to False if you want to archive pages even if they have expired or invalid certificates. Be aware that when False you cannot guarantee that you have not been man-in-the-middle’d while archiving content, so the content cannot be verified to be what’s on the original site.


USER_AGENT

Possible Values: [Mozilla/5.0 ... ArchiveBox/{VERSION} ...]/"Mozilla/5.0 ..."/… The default user agent string used during archiving. Individual extractors (wget, Chrome, curl, etc.) can override this with their own *_USER_AGENT settings, or fall back to this value.


COOKIES_FILE

Possible Values: [None]//path/to/cookies.txt/…

Cookies file to pass to wget, curl, yt-dlp and other extractors that don’t use Chrome (with its CHROME_USER_DATA_DIR) for authentication. To capture sites that require a user to be logged in, you configure this option to point to a netscape-format cookies.txt file containing all the cookies you want to use during archiving.

You can generate this cookies.txt file by using a number of different browser extensions that can export your cookies in this format, or by using wget on the command line with --save-cookies + --user=... --password=....

Alternatively, you can create a persona and import cookies directly from your browser profile:

archivebox persona create --import=chrome personal

[!WARNING] Make sure you use separate burner credentials dedicated to archiving, e.g. don’t re-use your normal daily Facebook/Instagram/Youtube/etc. account cookies as server responses often contain your name/email/PII, session tokens, etc. which then get preserved in your snapshots!

Related options: CHROME_USER_DATA_DIR, DEFAULT_PERSONA


DEFAULT_PERSONA

Possible Values: [Default]/personal/work/… The persona profile to use by default when archiving. Personas allow you to have separate sets of cookies, Chrome profiles, and user agent strings for different archiving contexts.


URL_DENYLIST

Possible Values: [\.(css|js|otf|ttf|woff|woff2|gstatic\.com|googleapis\.com/css)(\?.*)?$]/.+\.exe$/…

A regex expression used to exclude certain URLs from archiving.

Related options: URL_ALLOWLIST, SAVE_ALLOWLIST, SAVE_DENYLIST


URL_ALLOWLIST

Possible Values: [None]/^http(s)?:\/\/(.+)?example\.com\/?.*$/…

A regex expression used to exclude all URLs that don’t match the given pattern from archiving. Useful for recursive crawling within a single domain.


SAVE_ALLOWLIST

Possible Values: [{}]/{".*example\\.com.*": ["screenshot", "pdf"]}/… A JSON dictionary mapping URL regex patterns to lists of archive methods. Only the specified methods will be used for URLs matching each pattern.


SAVE_DENYLIST

Possible Values: [{}]/{".*\\.pdf$": ["screenshot", "dom"]}/… A JSON dictionary mapping URL regex patterns to lists of archive methods to skip.


TAG_SEPARATOR_PATTERN

Possible Values: [[,]]/[,;]/… Regex pattern used to split tag strings into individual tags.


Server Settings

Options for the web UI, authentication, and reverse proxy configuration.


ADMIN_USERNAME / ADMIN_PASSWORD

Possible Values: [None]/"admin"/…

Only used on first run / initial setup in Docker. ArchiveBox will create an admin user with the specified username and password when these options are found in the environment.

More info:


PUBLIC_INDEX / PUBLIC_SNAPSHOTS / PUBLIC_ADD_VIEW

Possible Values: [True]/False Configure whether or not login is required to use each area of ArchiveBox.

archivebox config --set PUBLIC_INDEX=True        # allow viewing snapshots list without login
archivebox config --set PUBLIC_SNAPSHOTS=True    # allow viewing snapshot content without login
archivebox config --set PUBLIC_ADD_VIEW=False    # allow submitting new URLs without login

SECRET_KEY

Possible Values: auto-generated random string Django’s secret key for cryptographic signing (sessions, CSRF tokens, etc.). Automatically generated on first run.


BIND_ADDR

Possible Values: [127.0.0.1:8000]/0.0.0.0:8000/… Address and port for the ArchiveBox web server to listen on.


LISTEN_HOST

Possible Values: [archivebox.localhost:8000]/archive.example.com:443/… The public hostname and port that ArchiveBox is accessible at.


ALLOWED_HOSTS

Possible Values: [*]/archive.example.com,localhost/… Comma-separated list of allowed HTTP Host header values. Set this to your domain name(s) in production.


CSRF_TRUSTED_ORIGINS

Possible Values: [http://admin.archivebox.localhost:8000]/https://archive.example.com/… Comma-separated list of trusted origins for CSRF validation. Must include the scheme (http/https).


ADMIN_BASE_URL

Possible Values: [""]//admin//… Base URL path for the Django admin interface.


ARCHIVE_BASE_URL

Possible Values: [""]//archive//… Base URL path for serving archived content.


SNAPSHOTS_PER_PAGE

Possible Values: [40]/100/… Maximum number of Snapshots to show per page on Snapshot list pages.


PREVIEW_ORIGINALS

Possible Values: [True]/False Whether to show inline previews of the original URL on snapshot detail pages.



CUSTOM_TEMPLATES_DIR

Possible Values: [data/custom_templates]//path/to/custom_templates/… Path to a directory containing custom html/css/images for overriding the default UI styling.


REVERSE_PROXY_USER_HEADER

Possible Values: [Remote-User]/X-Remote-User/… HTTP header containing user name from authenticated proxy.

Related options: REVERSE_PROXY_WHITELIST, LOGOUT_REDIRECT_URL


REVERSE_PROXY_WHITELIST

Possible Values: [<empty string>]/172.16.0.0/16/… Comma separated list of IP CIDRs which are allowed to use reverse proxy authentication.


LOGOUT_REDIRECT_URL

Possible Values: [/]/https://example.com/some/other/app/… URL to redirect users back to on logout when using reverse proxy authentication.


LDAP Settings

Options for LDAP/Active Directory authentication. Requires pip install archivebox[ldap].


LDAP_ENABLED

Possible Values: [False]/True Whether to use an external LDAP server for authentication.

pip install archivebox[ldap]

Then set these configuration values:

LDAP_ENABLED: True
LDAP_SERVER_URI: "ldap://ldap.example.com:3389"
LDAP_BIND_DN: "ou=archivebox,ou=services,dc=ldap.example.com"
LDAP_BIND_PASSWORD: "secret-bind-user-password"
LDAP_USER_BASE: "ou=users,ou=archivebox,ou=services,dc=ldap.example.com"
LDAP_USER_FILTER: "(uid=%(user)s)"
LDAP_USERNAME_ATTR: "username"
LDAP_FIRSTNAME_ATTR: "givenName"
LDAP_LASTNAME_ATTR: "sn"
LDAP_EMAIL_ATTR: "mail"
LDAP_CREATE_SUPERUSER: False

More info:


LDAP_SERVER_URI

Default: [None] LDAP server URI (e.g. ldap://ldap.example.com:389).


LDAP_BIND_DN

Default: [None] DN to bind for searching.


LDAP_BIND_PASSWORD

Default: [None] Password for bind DN.


LDAP_USER_BASE

Default: [None] Base DN for user searches.


LDAP_USER_FILTER

Default: [(uid=%(user)s)] LDAP search filter for users.


LDAP_USERNAME_ATTR

Default: [username] LDAP attribute for username.


LDAP_FIRSTNAME_ATTR

Default: [givenName] LDAP attribute for first name.


LDAP_LASTNAME_ATTR

Default: [sn] LDAP attribute for last name.


LDAP_EMAIL_ATTR

Default: [mail] LDAP attribute for email.


LDAP_CREATE_SUPERUSER

Default: [False] Auto-create superuser accounts for LDAP users.


Storage Settings

Options for file layout, permissions, and temp/lib directories.


OUTPUT_PERMISSIONS

Possible Values: [644]/755/… Permissions to set output files to.

Related options: PUID / PGID


PUID / PGID

Possible Values: [911]/1000/… Note: Only applicable for Docker users, settable via environment variables only. User and Group ID that the data directory should be owned by.

Learn more:


RESTRICT_FILE_NAMES

Possible Values: [windows]/unix/ascii/… Restrict output filenames to be compatible with the given filesystem type.


ENFORCE_ATOMIC_WRITES

Possible Values: [True]/False Whether to use atomic writes when saving files.


TMP_DIR

Possible Values: [data/tmp/<machine_id>]//tmp/archivebox/abc5d851/… Path for temporary files, unix sockets, and supervisor config. Must be a local, fast, short-path directory.


LIB_DIR

Possible Values: [data/lib/<arch>-<os>]//usr/local/share/archivebox/abc5/… Path for installed binary dependencies.


LIB_BIN_DIR

Possible Values: [LIB_DIR/bin] Path where installed binaries are symlinked for easy PATH management.


Search Settings

Options for full-text search backend configuration.


USE_INDEXING_BACKEND

Possible Values: [True]/False Enable the search indexing backend.


USE_SEARCHING_BACKEND

Possible Values: [True]/False Enable the search querying backend.


SEARCH_BACKEND_ENGINE

Possible Values: [ripgrep]/sqlite/sonic Which search backend engine to use. ripgrep (default) requires no setup. sqlite uses FTS5. sonic requires a running Sonic instance.


SEARCH_PROCESS_HTML

Possible Values: [True]/False Whether to strip HTML tags before indexing content for search.


Shell Options

Options around the format of the CLI output.


DEBUG

Possible Values: [False]/True Enable debug mode. Automatically set to True if --debug is passed on the command line.


IS_TTY

Possible Values: auto-detected Whether stdout is a TTY (interactive terminal).


USE_COLOR

Possible Values: [True]/False Colorize console output. Defaults to True if stdin is a TTY.


SHOW_PROGRESS

Possible Values: [True]/False Show real-time progress bar in console output. Defaults to True if stdin is a TTY.


IN_DOCKER

Possible Values: [False]/True Whether ArchiveBox is running inside a Docker container.


IN_QEMU

Possible Values: [False]/True Whether ArchiveBox is running inside QEMU emulation.


Plugin Settings

ArchiveBox uses a plugin system where each extractor defines its own configuration via config.json files. All plugin config options can be set the same way as core options β€” via environment variables, ArchiveBox.conf, or archivebox config --set.

archivebox config                              # see all available config options
archivebox config --set SCREENSHOT_TIMEOUT=120  # set a plugin option

For the full list of plugins and their config schemas, see the abx-plugins repository.

Title Settings

TITLE_ENABLED

Default: [True] Enable title extraction


TITLE_TIMEOUT

Default: [30] (falls back to TIMEOUT) Timeout for title extraction in seconds

Favicon Settings

FAVICON_ENABLED

Default: [True] Enable favicon downloading


FAVICON_TIMEOUT

Default: [30] (falls back to TIMEOUT) Timeout for favicon fetch in seconds


FAVICON_USER_AGENT

Default: [""] (falls back to USER_AGENT) User agent string

Wget Settings

WGET_ARGS

Default: [see defaults] Default wget arguments


WGET_ARGS_EXTRA

Default: [[]] Extra arguments to append to wget command


WGET_BINARY

Default: [wget] Path to wget binary


WGET_CHECK_SSL_VALIDITY

Default: [True] (falls back to CHECK_SSL_VALIDITY) Whether to verify SSL certificates


WGET_COOKIES_FILE

Default: [""] (falls back to COOKIES_FILE) Path to cookies file


WGET_ENABLED

Default: [True] Enable wget archiving


WGET_TIMEOUT

Default: [60] (falls back to TIMEOUT) Timeout for wget in seconds


WGET_USER_AGENT

Default: [""] (falls back to USER_AGENT) User agent string for wget


WGET_WARC_ENABLED

Default: [True] Save WARC archive file

Screenshot Settings

SCREENSHOT_ENABLED

Default: [True] Enable screenshot capture


SCREENSHOT_RESOLUTION

Default: [1440,2000] (falls back to RESOLUTION) Screenshot resolution (width,height)


SCREENSHOT_TIMEOUT

Default: [60] (falls back to TIMEOUT) Timeout for screenshot capture in seconds

PDF Settings

PDF_ENABLED

Default: [True] Enable PDF generation


PDF_RESOLUTION

Default: [1440,2000] (falls back to RESOLUTION) PDF page resolution (width,height)


PDF_TIMEOUT

Default: [60] (falls back to TIMEOUT) Timeout for PDF generation in seconds

DOM Settings

DOM_ENABLED

Default: [True] Enable DOM capture


DOM_TIMEOUT

Default: [60] (falls back to TIMEOUT) Timeout for DOM capture in seconds

SingleFile Settings

SINGLEFILE_ARGS

Default: [['--browser-headless']] Default single-file arguments


SINGLEFILE_ARGS_EXTRA

Default: [[]] Extra arguments to append to single-file command


SINGLEFILE_BINARY

Default: [single-file] Path to single-file binary


SINGLEFILE_CHECK_SSL_VALIDITY

Default: [True] (falls back to CHECK_SSL_VALIDITY) Whether to verify SSL certificates


SINGLEFILE_CHROME_ARGS

Default: [[]] (falls back to CHROME_ARGS) Chrome command-line arguments for SingleFile

SINGLEFILE_COOKIES_FILE

Default: [""] (falls back to COOKIES_FILE) Path to cookies file


SINGLEFILE_ENABLED

Default: [True] Enable SingleFile archiving

SINGLEFILE_TIMEOUT

Default: [60] (falls back to TIMEOUT) Timeout for SingleFile in seconds


SINGLEFILE_USER_AGENT

Default: [""] (falls back to USER_AGENT) User agent string

Readability Settings

READABILITY_ARGS

Default: [[]] Default Readability arguments


READABILITY_ARGS_EXTRA

Default: [[]] Extra arguments to append to Readability command


READABILITY_BINARY

Default: [readability-extractor] Path to readability-extractor binary


READABILITY_ENABLED

Default: [True] Enable Readability text extraction


READABILITY_TIMEOUT

Default: [30] (falls back to TIMEOUT) Timeout for Readability in seconds

Mercury Settings

MERCURY_ARGS

Default: [[]] Default Mercury parser arguments


MERCURY_ARGS_EXTRA

Default: [[]] Extra arguments to append to Mercury parser command


MERCURY_BINARY

Default: [postlight-parser] Path to Mercury/Postlight parser binary


MERCURY_ENABLED

Default: [True] Enable Mercury text extraction


MERCURY_TIMEOUT

Default: [30] (falls back to TIMEOUT) Timeout for Mercury in seconds

Defuddle Settings

DEFUDDLE_ARGS

Default: [[]] Default Defuddle arguments


DEFUDDLE_ARGS_EXTRA

Default: [[]] Extra arguments to append to Defuddle command


DEFUDDLE_BINARY

Default: [defuddle] Path to defuddle binary


DEFUDDLE_ENABLED

Default: [True] Enable Defuddle text extraction


DEFUDDLE_TIMEOUT

Default: [30] (falls back to TIMEOUT) Timeout for Defuddle in seconds

HTML to Text Settings

HTMLTOTEXT_ENABLED

Default: [True] Enable HTML to text conversion


HTMLTOTEXT_TIMEOUT

Default: [30] (falls back to TIMEOUT) Timeout for HTML to text conversion in seconds

Trafilatura Settings

TRAFILATURA_BINARY

Default: [trafilatura] Path to trafilatura binary


TRAFILATURA_ENABLED

Default: [True] Enable Trafilatura extraction


TRAFILATURA_OUTPUT_CSV

Default: [False] Write CSV output (content.csv)


TRAFILATURA_OUTPUT_HTML

Default: [True] Write HTML output (content.html)


TRAFILATURA_OUTPUT_JSON

Default: [False] Write JSON output (content.json)


TRAFILATURA_OUTPUT_MARKDOWN

Default: [True] Write markdown output (content.md)


TRAFILATURA_OUTPUT_TXT

Default: [True] Write plain text output (content.txt)


TRAFILATURA_OUTPUT_XML

Default: [False] Write XML output (content.xml)


TRAFILATURA_OUTPUT_XMLTEI

Default: [False] Write XML TEI output (content.xmltei)


TRAFILATURA_TIMEOUT

Default: [30] (falls back to TIMEOUT) Timeout for Trafilatura in seconds

Git Settings

GIT_ARGS

Default: [['clone', '--depth=1', '--recursive']] Default git arguments


GIT_ARGS_EXTRA

Default: [[]] Extra arguments to append to git command


GIT_BINARY

Default: [git] Path to git binary


GIT_DOMAINS

Default: [see defaults] Comma-separated list of domains to treat as git repositories


GIT_ENABLED

Default: [True] Enable git repository cloning


GIT_TIMEOUT

Default: [120] (falls back to TIMEOUT) Timeout for git operations in seconds

yt-dlp Settings

YTDLP_ARGS

Default: [see defaults] Default yt-dlp arguments


YTDLP_ARGS_EXTRA

Default: [[]] Extra arguments to append to yt-dlp command


YTDLP_BINARY

Default: [yt-dlp] Path to yt-dlp binary


YTDLP_CHECK_SSL_VALIDITY

Default: [True] (falls back to CHECK_SSL_VALIDITY) Whether to verify SSL certificates


YTDLP_COOKIES_FILE

Default: [""] (falls back to COOKIES_FILE) Path to cookies file


YTDLP_ENABLED

Default: [True] Enable video/audio downloading with yt-dlp


YTDLP_MAX_SIZE

Default: [750m] Maximum file size for yt-dlp downloads

YTDLP_TIMEOUT

Default: [3600] (falls back to TIMEOUT) Timeout for yt-dlp downloads in seconds

forum-dl Settings

FORUMDL_ARGS

Default: [[]] Default forum-dl arguments


FORUMDL_ARGS_EXTRA

Default: [[]] Extra arguments to append to forum-dl command


FORUMDL_BINARY

Default: [forum-dl] Path to forum-dl binary


FORUMDL_ENABLED

Default: [True] Enable forum downloading with forum-dl


FORUMDL_OUTPUT_FORMAT

Default: [jsonl] Output format for forum downloads


FORUMDL_TIMEOUT

Default: [3600] (falls back to TIMEOUT) Timeout for forum downloads in seconds

papers-dl Settings

PAPERSDL_ARGS

Default: [['fetch']] Default papers-dl arguments


PAPERSDL_ARGS_EXTRA

Default: [[]] Extra arguments to append to papers-dl command


PAPERSDL_BINARY

Default: [papers-dl] Path to papers-dl binary


PAPERSDL_ENABLED

Default: [True] Enable paper downloading with papers-dl


PAPERSDL_TIMEOUT

Default: [300] (falls back to TIMEOUT) Timeout for paper downloads in seconds

Archive.org Settings

ARCHIVEDOTORG_ENABLED

Default: [True] Submit URLs to archive.org Wayback Machine


ARCHIVEDOTORG_TIMEOUT

Default: [60] (falls back to TIMEOUT) Timeout for archive.org submission in seconds


ARCHIVEDOTORG_USER_AGENT

Default: [""] (falls back to USER_AGENT) User agent string

Chrome Settings

CHROME_ARGS

Default: [see defaults] Default Chrome command-line arguments (static flags only, dynamic args like –user-data-dir are added at runtime)


CHROME_ARGS_EXTRA

Default: [[]] Extra arguments to append to Chrome command (for user customization)


CHROME_BINARY

Default: [chromium] Path to Chromium binary


CHROME_CHECK_SSL_VALIDITY

Default: [True] (falls back to CHECK_SSL_VALIDITY) Whether to verify SSL certificates (disable for self-signed certs)


CHROME_DELAY_AFTER_LOAD

Default: [0] Extra delay in seconds after page load completes before archiving (useful for JS-heavy SPAs)


CHROME_ENABLED

Default: [True] Enable Chromium browser integration for archiving


CHROME_HEADLESS

Default: [True] Run Chrome in headless mode

CHROME_PAGELOAD_TIMEOUT

Default: [60] (falls back to CHROME_TIMEOUT) Timeout for page navigation/load in seconds


CHROME_RESOLUTION

Default: [1440,2000] (falls back to RESOLUTION) Browser viewport resolution (width,height)


CHROME_SANDBOX

Default: [True] Enable Chrome sandbox (disable in Docker with –no-sandbox)


CHROME_TIMEOUT

Default: [60] (falls back to TIMEOUT) Timeout for Chrome operations in seconds


CHROME_USER_AGENT

Default: [""] (falls back to USER_AGENT) User agent string for Chrome


CHROME_USER_DATA_DIR

Default: [""] Path to Chrome user data directory for persistent sessions (derived from ACTIVE_PERSONA if not set)


CHROME_WAIT_FOR

Default: [networkidle2] Page load completion condition (domcontentloaded, load, networkidle0, networkidle2)

DNS Settings

DNS_ENABLED

Default: [True] Enable DNS traffic recording during page load


DNS_TIMEOUT

Default: [30] (falls back to TIMEOUT) Timeout for DNS recording in seconds

SSL Settings

SSL_ENABLED

Default: [True] Enable SSL certificate capture


SSL_TIMEOUT

Default: [30] (falls back to TIMEOUT) Timeout for SSL capture in seconds

Headers Settings

HEADERS_ENABLED

Default: [True] Enable HTTP headers capture


HEADERS_TIMEOUT

Default: [30] (falls back to TIMEOUT) Timeout for headers capture in seconds

Redirects Settings

REDIRECTS_ENABLED

Default: [True] Enable redirect chain capture


REDIRECTS_TIMEOUT

Default: [30] (falls back to TIMEOUT) Timeout for redirect capture in seconds

Responses Settings

RESPONSES_ENABLED

Default: [True] Enable HTTP response capture


RESPONSES_TIMEOUT

Default: [30] (falls back to TIMEOUT) Timeout for response capture in seconds

Console Log Settings

CONSOLELOG_ENABLED

Default: [True] Enable console log capture


CONSOLELOG_TIMEOUT

Default: [30] (falls back to TIMEOUT) Timeout for console log capture in seconds

Accessibility Settings

ACCESSIBILITY_ENABLED

Default: [True] Enable accessibility tree capture


ACCESSIBILITY_TIMEOUT

Default: [30] (falls back to TIMEOUT) Timeout for accessibility capture in seconds

SEO Settings

SEO_ENABLED

Default: [True] Enable SEO metadata capture


SEO_TIMEOUT

Default: [30] (falls back to TIMEOUT) Timeout for SEO capture in seconds

Hashes Settings

HASHES_ENABLED

Default: [True] Enable merkle tree hash generation


HASHES_TIMEOUT

Default: [30] (falls back to TIMEOUT) Timeout for merkle tree generation in seconds

Static File Settings

STATICFILE_ENABLED

Default: [True] Enable static file detection


STATICFILE_TIMEOUT

Default: [30] (falls back to TIMEOUT) Timeout for static file detection in seconds

uBlock Origin Settings

UBLOCK_ENABLED

Default: [True] Enable uBlock Origin browser extension for ad blocking

I Still Don’t Care About Cookies Settings

ISTILLDONTCAREABOUTCOOKIES_ENABLED

Default: [True] Enable I Still Don’t Care About Cookies browser extension

2captcha Settings

TWOCAPTCHA_API_KEY

Default: [""] 2captcha API key for CAPTCHA solving service (get from https://2captcha.com)


TWOCAPTCHA_AUTO_SUBMIT

Default: [False] Automatically submit forms after CAPTCHA is solved


TWOCAPTCHA_ENABLED

Default: [True] Enable 2captcha browser extension for automatic CAPTCHA solving


TWOCAPTCHA_RETRY_COUNT

Default: [3] Number of times to retry CAPTCHA solving on error


TWOCAPTCHA_RETRY_DELAY

Default: [5] Delay in seconds between CAPTCHA solving retries


TWOCAPTCHA_TIMEOUT

Default: [60] (falls back to TIMEOUT) Timeout for CAPTCHA solving in seconds

Infinite Scroll Settings

INFINISCROLL_ENABLED

Default: [True] Enable infinite scroll page expansion


INFINISCROLL_EXPAND_DETAILS

Default: [True] Expand

elements and click β€˜load more’ buttons for comments


INFINISCROLL_MIN_HEIGHT

Default: [16000] Minimum page height to scroll to in pixels


INFINISCROLL_SCROLL_DELAY

Default: [2000] Delay between scrolls in milliseconds


INFINISCROLL_SCROLL_DISTANCE

Default: [1600] Distance to scroll per step in pixels


INFINISCROLL_SCROLL_LIMIT

Default: [10] Maximum number of scroll steps


INFINISCROLL_TIMEOUT

Default: [120] (falls back to TIMEOUT) Maximum timeout for scrolling in seconds

HTML URL Parser Settings

PARSE_HTML_URLS_ENABLED

Default: [True] Enable HTML URL parsing

JSONL URL Parser Settings

PARSE_JSONL_URLS_ENABLED

Default: [True] Enable JSON Lines URL parsing

Netscape URL Parser Settings

PARSE_NETSCAPE_URLS_ENABLED

Default: [True] Enable Netscape bookmarks HTML URL parsing

Text URL Parser Settings

PARSE_TXT_URLS_ENABLED

Default: [True] Enable plain text URL parsing

RSS URL Parser Settings

PARSE_RSS_URLS_ENABLED

Default: [True] Enable RSS/Atom feed URL parsing

Claude Code Settings

ANTHROPIC_API_KEY

Default: [""] Anthropic API key for Claude Code authentication


CLAUDECODE_BINARY

Default: [claude] Path to Claude Code CLI binary


CLAUDECODE_ENABLED

Default: [False] Enable Claude Code AI agent integration. Controls whether the claudecode plugin participates in crawl-time extraction; child plugins still need the claudecode plugin installed and a working Claude binary.


CLAUDECODE_MAX_TURNS

Default: [10] Maximum number of agentic turns per invocation


CLAUDECODE_MODEL

Default: [sonnet] Claude model to use (e.g. sonnet, opus, haiku)


CLAUDECODE_TIMEOUT

Default: [120] (falls back to TIMEOUT) Timeout for Claude Code operations in seconds

Claude Chrome Settings

CLAUDECHROME_ENABLED

Default: [False] Enable Claude for Chrome browser extension for AI-driven page interaction


CLAUDECHROME_MAX_ACTIONS

Default: [15] Maximum number of agentic loop iterations (screenshots + actions) per page


CLAUDECHROME_MODEL

Default: [sonnet] Claude model to use (e.g. sonnet, opus, haiku). Availability depends on your plan.


CLAUDECHROME_PROMPT

Default: [see defaults] Prompt for Claude to execute on the page. Claude can click buttons, fill forms, download files, and interact with any page element.


CLAUDECHROME_TIMEOUT

Default: [120] (falls back to TIMEOUT) Timeout for Claude for Chrome operations in seconds

Claude Code Extract Settings

CLAUDECODEEXTRACT_ENABLED

Default: [False] Enable Claude Code AI extraction


CLAUDECODEEXTRACT_MAX_TURNS

Default: [10] (falls back to CLAUDECODE_MAX_TURNS) Maximum number of agentic turns for extraction


CLAUDECODEEXTRACT_MODEL

Default: [sonnet] (falls back to CLAUDECODE_MODEL) Claude model to use for extraction (e.g. sonnet, opus, haiku)


CLAUDECODEEXTRACT_PROMPT

Default: [see defaults] Custom prompt for Claude Code extraction. Use this to define what Claude should extract or generate from the snapshot.


CLAUDECODEEXTRACT_TIMEOUT

Default: [120] (falls back to CLAUDECODE_TIMEOUT) Timeout for Claude Code extraction in seconds

Claude Code Cleanup Settings

CLAUDECODECLEANUP_ENABLED

Default: [False] Enable Claude Code AI cleanup of snapshot files


CLAUDECODECLEANUP_MAX_TURNS

Default: [15] (falls back to CLAUDECODE_MAX_TURNS) Maximum number of agentic turns for cleanup


CLAUDECODECLEANUP_MODEL

Default: [sonnet] (falls back to CLAUDECODE_MODEL) Claude model to use for cleanup (e.g. sonnet, opus, haiku)


CLAUDECODECLEANUP_PROMPT

Default: [see defaults] Custom prompt for Claude Code cleanup. Defines what Claude should clean up and how to determine which duplicates to keep.


CLAUDECODECLEANUP_TIMEOUT

Default: [120] (falls back to CLAUDECODE_TIMEOUT) Timeout for Claude Code cleanup in seconds

Ripgrep Search Settings

RIPGREP_ARGS

Default: [['--files-with-matches', '--no-messages', '--ignore-case']] Default ripgrep arguments


RIPGREP_ARGS_EXTRA

Default: [[]] Extra arguments to append to ripgrep command


RIPGREP_BINARY

Default: [rg] Path to ripgrep binary


RIPGREP_TIMEOUT

Default: [90] (falls back to TIMEOUT) Search timeout in seconds

Sonic Search Settings

SEARCH_BACKEND_SONIC_BUCKET

Default: [snapshots] Sonic bucket name


SEARCH_BACKEND_SONIC_COLLECTION

Default: [archivebox] Sonic collection name


SEARCH_BACKEND_SONIC_HOST_NAME

Default: [127.0.0.1] Sonic server hostname


SEARCH_BACKEND_SONIC_PASSWORD

Default: [SecretPassword] Sonic server password


SEARCH_BACKEND_SONIC_PORT

Default: [1491] Sonic server port

SQLite FTS Search Settings

SEARCH_BACKEND_SQLITE_DB

Default: [search.sqlite3] SQLite FTS database filename


SEARCH_BACKEND_SQLITE_SEPARATE_DATABASE

Default: [True] Use separate database file for FTS index


SEARCH_BACKEND_SQLITE_TOKENIZERS

Default: [porter unicode61 remove_diacritics 2] FTS5 tokenizer configuration