Configuration

▶️ The default ArchiveBox config file can be found here: etc/ArchiveBox.conf.default.

Configuration is done through environment variables. You can pass in settings using all the usual environment variable methods: e.g. by using the env command, exporting variables in your shell profile, or sourcing a .env file before running the command.

Example of passing configuration using env command:

env CHROME_BINARY=google-chrome-stable RESOLUTION=1440,900 FETCH_PDF=False ./archive ~/Downloads/bookmarks_export.html

Available Configuration Options:


All the available config options are described in this document below, but can also be found along with examples in etc/ArchiveBox.conf.default. The code that loads the config is in archivebox/config.py, but don’t modify the defaults in config.py directly, as your changes there will be erased whenever you update ArchiveBox.

To create a persistent config file, see the Creating a Config File section.To see details on how to do configuration when using Docker, see the [[Docker]] page.


General Settings

General options around the archiving process, output format, and timing.


OUTPUT_DIR

Possible Values: [$REPO_DIR/output]//srv/www/bookmarks/…Path to an output folder to store the archive in.

Defaults to output/ in the root directory of the repository folder.

Note: ArchiveBox will create this folder if missing. If it already exists, make sure ArchiveBox has permission to write to it.


OUTPUT_PERMISSIONS

Possible Values: [755]/644/…Permissions to set the output directory and file contents to.

This is useful when running ArchiveBox inside Docker as root and you need to explicitly set the permissions to something that the users on the host can access.


ONLY_NEW

Possible Values: [False]/TrueDownload files for only newly added links when running the ./archive command.

By default, ArchiveBox will go through all links in the index and download any missing files on every run, set this to True to only archive the most recently added batch of links without attempting to also update older archived links.

Note: Regardless of how this is set, ArchiveBox will never re-download sites that have already succeeded previously. When this is False it only attempts to fix previous pages have missing archives, it does not re-archive pages that have already been archived. Set it to True only if you wish to skip repairing missing older archives on every run.


TIMEOUT

Possible Values: [60]/120/…Maximum allowed download time per archive method for each link in seconds. If you have a slow network connection or are seeing frequent timeout errors, you can raise this value.

Note: Do not set this to anything less than 15 seconds as it will cause Chrome to hang indefinitely and many sites to fail completely.


MEDIA_TIMEOUT

Possible Values: [3600]/120/…Maximum allowed download time for fetching media when FETCH_MEDIA=True in seconds. This timeout is separate and usually much longer than TIMEOUT because media downloaded with youtube-dl can often be quite large and take many minutes/hours to download. Tweak this setting based on your network speed and maximum media file size you plan on downloading.

Note: Do not set this to anything less than 10 seconds as it can often take 5-10 seconds for youtube-dl just to parse the page before it starts downloading media files.

Related options:FETCH_MEDIA


TEMPLATES_DIR

Possible Values: [$REPO_DIR/archivebox/templates]//path/to/custom/templates/…Path to a directory containing custom index html templates for themeing your archive output. Folder at specified path must contain the following files:

  • static/
  • index.html
  • link_index.html
  • index_row.html

You can copy the files in archivebox/templates into your own directory to start developing a custom theme, then edit TEMPLATES_DIR to point to your new custom templates directory.

Related options:FOOTER_INFO



URL_BLACKLIST

Possible Values: [None]/.+\.exe$/http(s)?:\/\/(.+)?example.com\/.*'/…

A regex expression used to exclude certain URLs from the archive. You can use if there are certain domains, extensions, or other URL patterns that you want to ignore whenever they get imported. Blacklisted URLs wont be included in the index, and their page content wont be archived.

When building your blacklist, you can check whether a given URL matches your regex expression like so:

>>>import re
>>>URL_BLACKLIST = r'http(s)?:\/\/(.+)?(youtube\.com)|(amazon\.com)\/.*'  # replace this with your regex to test
>>>test_url = 'https://test.youtube.com/example.php?abc=123'
>>>bool(re.compile(URL_BLACKLIST, re.IGNORECASE).match(test_url))
True

Related options:FETCH_MEDIA, FETCH_GIT, GIT_DOMAINS


Archive Method Toggles

High-level on/off switches for all the various methods used to archive URLs.


FETCH_TITLE

Possible Values: [True]/FalseBy default ArchiveBox uses the title provided by the import file, but not all types of imports provide titles (e.g. Plain texts lists of URLs). When this is True, ArchiveBox downloads the page (and follows all redirects), then it attempts to parse the link’s title from the first <title></title> tag found in the response. It may be buggy or not work for certain sites that use JS to set the title, disabling it will lead to links imported without a title showing up with their URL as the title in the UI.

Related options:ONLY_NEW, CHECK_SSL_VALIDITY


FETCH_FAVICON

Possible Values: [True]/FalseFetch and save favicon for the URL from Google’s public favicon service: https://www.google.com/s2/favicons?domain={domain}. Set this to FALSE if you don’t need favicons.

Related options:TEMPLATES_DIR, CHECK_SSL_VALIDITY, CURL_BINARY


FETCH_WGET

Possible Values: [True]/FalseFetch page with wget, and save responses into folders for each domain, e.g. example.com/index.html, with .html appended if not present. For a full list of options used during the wget download process, see the archivebox/archive_methods.py:fetch_wget(...) function.

Related options:TIMEOUT, FETCH_WGET_REQUISITES, CHECK_SSL_VALIDITY, COOKIES_FILE, WGET_USER_AGENT, FETCH_WARC, WGET_BINARY


FETCH_WARC

Possible Values: [True]/FalseSave a timestamped WARC archive of all the page requests and responses during the wget archive process.

Related options:TIMEOUT, FETCH_WGET_REQUISITES, CHECK_SSL_VALIDITY, COOKIES_FILE, WGET_USER_AGENT, FETCH_WGET, WGET_BINARY


FETCH_PDF

Possible Values: [True]/FalsePrint page as PDF.

Related options:TIMEOUT, CHECK_SSL_VALIDITY, CHROME_USER_DATA_DIR, CHROME_BINARY


FETCH_SCREENSHOT

Possible Values: [True]/FalseFetch a screenshot of the page.

Related options:RESOLUTION, TIMEOUT, CHECK_SSL_VALIDITY, CHROME_USER_DATA_DIR, CHROME_BINARY


FETCH_DOM

Possible Values: [True]/FalseFetch a DOM dump of the page.

Related options:TIMEOUT, CHECK_SSL_VALIDITY, CHROME_USER_DATA_DIR, CHROME_BINARY


FETCH_GIT

Possible Values: [True]/FalseFetch any git repositories on the page.

Related options:TIMEOUT, GIT_DOMAINS, CHECK_SSL_VALIDITY, GIT_BINARY


FETCH_MEDIA

Possible Values: [True]/FalseFetch all audio, video, annotations, and media metadata on the page using youtube-dl. Warning, this can use up a lot of storage very quickly.

Related options:MEDIA_TIMEOUT, CHECK_SSL_VALIDITY, YOUTUBEDL_BINARY


SUBMIT_ARCHIVE_DOT_ORG

Possible Values: [True]/FalseSubmit the page’s URL to be archived on Archive.org. (The Internet Archive)

Related options:TIMEOUT, CHECK_SSL_VALIDITY, CURL_BINARY


Archive Method Options

Specific options for individual archive methods above. Some of these are shared between multiple archive methods, others are specific to a single method.


CHECK_SSL_VALIDITY

Possible Values: [True]/FalseWhether to enforce HTTPS certificate and HSTS chain of trust when archiving sites. Set this to False if you want to archive pages even if they have expired or invalid certificates. Be aware that when False you cannot guarantee that you have not been man-in-the-middle’d while archiving content, so the content cannot be verified to be what’s on the original site.


FETCH_WGET_REQUISITES

Possible Values: [True]/FalseFetch images/css/js with wget. (True is highly recommended, otherwise your wont download many critical assets to render the page, like images, js, css, etc.)

Related options:TIMEOUT, FETCH_WGET, FETCH_WARC, WGET_BINARY


RESOLUTION

Possible Values: [1440,900]/1024,768/…Screenshot resolution in pixels width,height.

Related options:FETCH_SCREENSHOT


WGET_USER_AGENT

Possible Values: [Wget/1.19.1]/"Mozilla/5.0 ..."/…This is the user agent to use during wget archiving. You can set this to impersonate a more common browser like Chrome or Firefox if you’re getting blocked by servers for having an unknown/blacklisted user agent.

Related options:FETCH_WGET, FETCH_WARC, CHECK_SSL_VALIDITY, WGET_BINARY, CHROME_USER_AGENT


CHROME_USER_AGENT

Possible Values: ["Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/73.0.3683.75 Safari/537.36"]/"Mozilla/5.0 ..."/…

This is the user agent to use during Chrome headless archiving. If you’re experiencing being blocked by many sites, you can set this to hide the Headless string that reveals to servers that you’re using a headless browser.

Related options:FETCH_PDF, FETCH_SCREENSHOT, FETCH_DOM, CHECK_SSL_VALIDITY, CHROME_USER_DATA_DIR, CHROME_HEADLESS, CHROME_BINARY, WGET_USER_AGENT


GIT_DOMAINS

Possible Values: [github.com,bitbucket.org,gitlab.com]/git.example.com/…Domains to attempt download of git repositories on using git clone.

Related options:FETCH_GIT, CHECK_SSL_VALIDITY


COOKIES_FILE

Possible Values: [None]//path/to/cookies.txt/…Cookies file to pass to wget. To capture sites that require a user to be logged in, you can specify a path to a netscape-format cookies.txt file for wget to use. You can generate this file by using a browser extension to export your cookies in this format, or by using wget with --save-cookies.

Related options:FETCH_WGET, FETCH_WARC, CHECK_SSL_VALIDITY, WGET_BINARY


CHROME_USER_DATA_DIR

Possible Values: [~/.config/google-chrome]//tmp/chrome-profile/…Path to a Chrome user profile directory. To capture sites that require a user to be logged in, you can specify a path to a chrome user profile (which loads the cookies needed for the user to be logged in). If you don’t have an existing Chrome profile, create one with chromium-browser --user-data-dir=/tmp/chrome-profile, and log into the sites you need. Then set CHROME_USER_DATA_DIR=/tmp/chrome-profile to make ArchiveBox use that profile.

Note: Make sure the path does not have Default at the end (it should the the parent folder of Default), e.g. set it to CHROME_USER_DATA_DIR=~/.config/chromium and not CHROME_USER_DATA_DIR=~/.config/chromium/Default.

By default when set to None, ArchiveBox tries all the following User Data Dir paths in order:https://chromium.googlesource.com/chromium/src/+/HEAD/docs/user_data_dir.md

Related options:FETCH_PDF, FETCH_SCREENSHOT, FETCH_DOM, CHECK_SSL_VALIDITY, CHROME_HEADLESS, CHROME_BINARY


CHROME_HEADLESS

Possible Values: [True]/FalseWhether or not to use Chrome/Chromium in --headless mode (no browser UI displayed). When set to False, the full Chrome UI will be launched each time it’s used to archive a page, which greatly slows down the process but allows you to watch in real-time as it saves each page.

Related options:FETCH_PDF, FETCH_SCREENSHOT, FETCH_DOM, CHROME_USER_DATA_DIR, CHROME_BINARY


CHROME_SANDBOX

Possible Values: [True]/FalseWhether or not to use the Chrome sandbox when archiving.

If you see an error message like this, it means you are trying to run ArchiveBox as root:

:ERROR:zygote_host_impl_linux.cc(89)] Running as root without --no-sandbox is not supported. See https://crbug.com/638180

*Note: Do not run ArchiveBox as root! The solution to this error is not to override it by setting CHROME_SANDBOX=False, it’s to use create another user (e.g. www-data) and run ArchiveBox under that new, less privileged user. This is a security-critical setting, only set this to False if you’re running ArchiveBox inside a container or VM where it doesn’t have access to the rest of your system!

Related options:FETCH_PDF, FETCH_SCREENSHOT, FETCH_DOM, CHECK_SSL_VALIDITY, CHROME_USER_DATA_DIR, CHROME_HEADLESS, CHROME_BINARY


Shell Options

Options around the format of the CLI output.


USE_COLOR

Possible Values: [True]/FalseColorize console output. Defaults to True if stdin is a TTY (interactive session), otherwise False (e.g. if run in a script or piped into a file).


SHOW_PROGRESS

Possible Values: [True]/FalseShow real-time progress bar in console output. Defaults to True if stdin is a TTY (interactive session), otherwise False (e.g. if run in a script or piped into a file).


Dependency Options

Options for defining which binaries to use for the various archive method dependencies.


CHROME_BINARY

Possible Values: [chromium-browser]//usr/local/bin/google-chrome/…Path or name of the Google Chrome / Chromium binary to use for all the headless browser archive methods.

Without setting this environment variable, ArchiveBox by default look for the following binaries in $PATH in this order:

  • chromium-browser
  • chromium
  • google-chrome
  • google-chrome-stable
  • google-chrome-unstable
  • google-chrome-beta
  • google-chrome-canary
  • google-chrome-dev

You can override the default behavior to search for any available bin by setting the environment variable to your preferred Chrome binary name or path.

The chrome/chromium dependency is optional and only required for screenshots, PDF, and DOM dump output, it can be safely ignored if those three methods are disabled.

Related options:FETCH_PDF, FETCH_SCREENSHOT, FETCH_DOM, CHROME_USER_DATA_DIR, CHROME_HEADLESS, CHROME_SANDBOX


WGET_BINARY

Possible Values: [wget]//usr/local/bin/wget/…Path or name of the wget binary to use.

Related options:FETCH_WGET, FETCH_WARC


YOUTUBEDL_BINARY

Possible Values: [youtube-dl]//usr/local/bin/youtube-dl/…Path or name of the youtube-dl binary to use.

Related options:FETCH_MEDIA


GIT_BINARY

Possible Values: [git]//usr/local/bin/git/…Path or name of the git binary to use.

Related options:FETCH_GIT


CURL_BINARY

Possible Values: [curl]//usr/local/bin/curl/…Path or name of the curl binary to use.

Related options:FETCH_FAVICON, SUBMIT_ARCHIVE_DOT_ORG