ArchiveBox¶
“The open-source self-hosted internet archive.”
Website | Github | Source | Bug Tracker
mkdir my-archive; cd my-archive/
pip install archivebox
archivebox init
archivebox add https://example.com
archivebox info
Documentation¶
Intro¶

ArchiveBox
Open-source self-hosted web archiving.
▶️ Quickstart | Demo | Github | Documentation | Info & Motivation | Community | Roadmap
"Your own personal internet archive" (网站存档 / 爬虫)
ArchiveBox is a powerful, self-hosted internet archiving solution to collect, save, and view sites you want to preserve offline.
You can set it up as a command-line tool, web app, and desktop app (alpha), on Linux, macOS, and Windows.
You can feed it URLs one at a time, or schedule regular imports from browser bookmarks or history, feeds like RSS, bookmark services like Pocket/Pinboard, and more. See input formats for a full list.
It saves snapshots of the URLs you feed it in several formats: HTML, PDF, PNG screenshots, WARC, and more out-of-the-box, with a wide variety of content extracted and preserved automatically (article text, audio/video, git repos, etc.). See output formats for a full list.
The goal is to sleep soundly knowing the part of the internet you care about will be automatically preserved in durable, easily accessable formats for decades after it goes down.
📦 Install ArchiveBox with Docker Compose (recommended) / Docker, or apt
/ brew
/ pip
(see below).
No matter which setup method you choose, they all follow this basic process and provide the same CLI, Web UI, and on-disk data layout.
- Once you’ve installed ArchiveBox, run this in a new empty folder to get started
archivebox init --setup # creates a new collection in the current directory
- Add some URLs you want to archive
archivebox add 'https://example.com' # add URLs one at a time via args / piped stdin
archivebox schedule --every=day --depth=1 https://example.com/rss.xml # or have it import URLs on a schedule
- Then view your archived pages
archivebox server 0.0.0.0:8000 # use the interactive web UI
archivebox list 'https://example.com' # use the CLI commands (--help for more)
ls ./archive/*/index.json # or browse directly via the filesystem
⤵️ See the Quickstart below for more…




Key Features¶
- Free & open source, doesn’t require signing up for anything, stores all data locally
- Powerful, intuitive command line interface with modular optional dependencies
- Comprehensive documentation, active development, and rich community
- Extracts a wide variety of content out-of-the-box: media (youtube-dl), articles (readability), code (git), etc.
- Supports scheduled/realtime importing from many types of sources
- Uses standard, durable, long-term formats like HTML, JSON, PDF, PNG, and WARC
- Usable as a oneshot CLI, self-hosted web UI, Python API (BETA), REST API (ALPHA), or desktop app (ALPHA)
- Saves all pages to archive.org as well by default for redundancy (can be disabled for local-only mode)
- Planned: support for archiving content requiring a login/paywall/cookies (working, but ill-advised until some pending fixes are released)
- Planned: support for running JS during archiving to adblock, autoscroll, modal-hide, thread-expand…


Quickstart¶
🖥 Supported OSs: Linux/BSD, macOS, Windows (Docker/WSL) 👾 CPUs: amd64, x86, arm8, arm7 (raspi>=3)
⬇️ Initial Setup¶
(click to expand your preferred ► distribution
below for full setup instructions)
Get ArchiveBox with docker-compose
on macOS/Linux/Windows ✨ (highly recommended)
First make sure you have Docker installed: https://docs.docker.com/get-docker/
Download the docker-compose.yml
file.
curl -O 'https://raw.githubusercontent.com/ArchiveBox/ArchiveBox/master/docker-compose.yml'
Start the server.
docker-compose run archivebox init --setup
docker-compose up
Open http://127.0.0.1:8000
.
# you can also add links and manage your archive via the CLI:
docker-compose run archivebox add 'https://example.com'
echo 'https://example.com' | docker-compose run archivebox -T add
docker-compose run archivebox status
docker-compose run archivebox help # to see more options
# when passing stdin/stdout via the cli, use the -T flag
echo 'https://example.com' | docker-compose run -T archivebox add
docker-compose run -T archivebox list --html --with-headers > index.html
This is the recommended way to run ArchiveBox because it includes all the extractors like:
chrome, wget, youtube-dl, git, etc., full-text search w/ sonic, and many other great features.
Get ArchiveBox with docker
on macOS/Linux/Windows
First make sure you have Docker installed: https://docs.docker.com/get-docker/
# create a new empty directory and initalize your collection (can be anywhere)
mkdir ~/archivebox && cd ~/archivebox
docker run -v $PWD:/data -it archivebox/archivebox init --setup
# start the webserver and open the UI (optional)
docker run -v $PWD:/data -p 8000:8000 archivebox/archivebox server 0.0.0.0:8000
open http://127.0.0.1:8000
# you can also add links and manage your archive via the CLI:
docker run -v $PWD:/data -it archivebox/archivebox add 'https://example.com'
docker run -v $PWD:/data -it archivebox/archivebox status
docker run -v $PWD:/data -it archivebox/archivebox help # to see more options
# when passing stdin/stdout via the cli, use only -i (not -it)
echo 'https://example.com' | docker run -v $PWD:/data -i archivebox/archivebox add
docker run -v $PWD:/data -i archivebox/archivebox list --html --with-headers > index.html
Get ArchiveBox with apt
on Ubuntu/Debian
This method should work on all Ubuntu/Debian based systems, including x86, amd64, arm7, and arm8 CPUs (e.g. Raspberry Pis >=3).
If you’re on Ubuntu >= 20.04, add the apt
repository with add-apt-repository
:
(on other Ubuntu/Debian-based systems follow the ♰ instructions below)
# add the repo to your sources and install the archivebox package using apt
sudo apt install software-properties-common
sudo add-apt-repository -u ppa:archivebox/archivebox
sudo apt install archivebox
# create a new empty directory and initalize your collection (can be anywhere)
mkdir ~/archivebox && cd ~/archivebox
archivebox init --setup
# start the webserver and open the web UI (optional)
archivebox server 0.0.0.0:8000
open http://127.0.0.1:8000
# you can also add URLs and manage the archive via the CLI and filesystem:
archivebox add 'https://example.com'
archivebox status
archivebox list --html --with-headers > index.html
archivebox list --json --with-headers > index.json
archivebox help # to see more options
♰ On other Ubuntu/Debian-based systems add these sources directly to /etc/apt/sources.list
:
echo "deb http://ppa.launchpad.net/archivebox/archivebox/ubuntu focal main" > /etc/apt/sources.list.d/archivebox.list
echo "deb-src http://ppa.launchpad.net/archivebox/archivebox/ubuntu focal main" >> /etc/apt/sources.list.d/archivebox.list
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys C258F79DCC02E369
sudo apt update
sudo apt install archivebox
archivebox setup
archivebox --version
# then scroll back up and continue the initalization instructions above
(you may need to install some other dependencies manually however)
Get ArchiveBox with brew
on macOS
First make sure you have Homebrew installed: https://brew.sh/#install
# install the archivebox package using homebrew
brew install archivebox/archivebox/archivebox
# create a new empty directory and initalize your collection (can be anywhere)
mkdir ~/archivebox && cd ~/archivebox
archivebox init --setup
# start the webserver and open the web UI (optional)
archivebox server 0.0.0.0:8000
open http://127.0.0.1:8000
# you can also add URLs and manage the archive via the CLI and filesystem:
archivebox add 'https://example.com'
archivebox status
archivebox list --html --with-headers > index.html
archivebox list --json --with-headers > index.json
archivebox help # to see more options
Get ArchiveBox with pip
on any other platforms (some extras must be installed manually)
First make sure you have Python >= v3.7 and Node >= v12 installed.
# install the archivebox package using pip3
pip3 install archivebox
# create a new empty directory and initalize your collection (can be anywhere)
mkdir ~/archivebox && cd ~/archivebox
archivebox init --setup
# Install any missing extras like wget/git/ripgrep/etc. manually as needed
# start the webserver and open the web UI (optional)
archivebox server 0.0.0.0:8000
open http://127.0.0.1:8000
# you can also add URLs and manage the archive via the CLI and filesystem:
archivebox add 'https://example.com'
archivebox status
archivebox list --html --with-headers > index.html
archivebox list --json --with-headers > index.json
archivebox help # to see more options
⚡️ CLI Usage¶
# archivebox [subcommand] [--args]
# docker-compose run archivebox [subcommand] [--args]
# docker run -v $PWD:/data -it [subcommand] [--args]
archivebox init --setup # safe to run init multiple times (also how you update versions)
archivebox --version
archivebox help
archivebox setup/init/config/status/manage
to administer your collectionarchivebox add/schedule/remove/update/list/shell/oneshot
to manage Snapshots in the archivearchivebox schedule
to pull in fresh URLs in regularly from boorkmarks/history/Pocket/Pinboard/RSS/etc.
🖥 Web UI Usage¶
archivebox manage createsuperuser
archivebox server 0.0.0.0:8000
Then open http://127.0.0.1:8000 to view the UI.
# you can also configure whether or not login is required for most features
archivebox config --set PUBLIC_INDEX=False
archivebox config --set PUBLIC_SNAPSHOTS=False
archivebox config --set PUBLIC_ADD_VIEW=False
🗄 SQL/Python/Filesystem Usage¶
sqlite3 ./index.sqlite3 # run SQL queries on your index
archivebox shell # explore the Python API in a REPL
ls ./archive/*/index.html # or inspect snapshots on the filesystem


DEMO:
https://demo.archivebox.io
Usage | Configuration | Caveats

Overview¶
Input formats¶
ArchiveBox supports many input formats for URLs, including Pocket & Pinboard exports, Browser bookmarks, Browser history, plain text, HTML, markdown, and more!
Click these links for instructions on how to propare your links from these sources:
TXT, RSS, XML, JSON, CSV, SQL, HTML, Markdown, or any other text-based format…
Browser history or browser bookmarks (see instructions for: Chrome, Firefox, Safari, IE, Opera, and more…)
Pocket, Pinboard, Instapaper, Shaarli, Delicious, Reddit Saved, Wallabag, Unmark.it, OneTab, and more…
# archivebox add --help
archivebox add 'https://example.com/some/page'
archivebox add < ~/Downloads/firefox_bookmarks_export.html
archivebox add --depth=1 'https://news.ycombinator.com#2020-12-12'
echo 'http://example.com' | archivebox add
echo 'any_text_with [urls](https://example.com) in it' | archivebox add
# (if using docker add -i when piping stdin)
echo 'https://example.com' | docker run -v $PWD:/data -i archivebox/archivebox add
# (if using docker-compose add -T when piping stdin / stdout)
echo 'https://example.com' | docker-compose run -T archivebox add
See the Usage: CLI page for documentation and examples.
It also includes a built-in scheduled import feature with archivebox schedule
and browser bookmarklet, so you can pull in URLs from RSS feeds, websites, or the filesystem regularly/on-demand.
Archive Layout¶
All of ArchiveBox’s state (including the index, snapshot data, and config file) is stored in a single folder called the “ArchiveBox data folder”. All archivebox
CLI commands must be run from inside this folder, and you first create it by running archivebox init
.
The on-disk layout is optimized to be easy to browse by hand and durable long-term. The main index is a standard index.sqlite3
database in the root of the data folder (it can also be exported as static JSON/HTML), and the archive snapshots are organized by date-added timestamp in the ./archive/
subfolder.
./
index.sqlite3
ArchiveBox.conf
archive/
...
1617687755/
index.html
index.json
screenshot.png
media/some_video.mp4
warc/1617687755.warc.gz
git/somerepo.git
...
Each snapshot subfolder ./archive/<timestamp>/
includes a static index.json
and index.html
describing its contents, and the snapshot extrator outputs are plain files within the folder.
Output formats¶
Inside each Snapshot folder, ArchiveBox save these different types of extractor outputs as plain files:
./archive/<timestamp>/*
- Index:
index.html
&index.json
HTML and JSON index files containing metadata and details - Title, Favicon, Headers Response headers, site favicon, and parsed site title
- SingleFile:
singlefile.html
HTML snapshot rendered with headless Chrome using SingleFile - Wget Clone:
example.com/page-name.html
wget clone of the site withwarc/<timestamp>.gz
- Chrome Headless
- PDF:
output.pdf
Printed PDF of site using headless chrome - Screenshot:
screenshot.png
1440x900 screenshot of site using headless chrome - DOM Dump:
output.html
DOM Dump of the HTML after rendering using headless chrome
- PDF:
- Article Text:
article.html/json
Article text extraction using Readability & Mercury - Archive.org Permalink:
archive.org.txt
A link to the saved site on archive.org - Audio & Video:
media/
all audio/video files + playlists, including subtitles & metadata with youtube-dl - Source Code:
git/
clone of any repository found on github, bitbucket, or gitlab links - More coming soon! See the Roadmap…
It does everything out-of-the-box by default, but you can disable or tweak individual archive methods via environment variables / config.
# archivebox config --help
archivebox config # see all currently configured options
archivebox config --set SAVE_ARCHIVE_DOT_ORG=False
archivebox config --set YOUTUBEDL_ARGS='--max-filesize=500m'
Static Archive Exporting¶
You can export the main index to browse it statically without needing to run a server.
Note about large exports: These exports are not paginated, exporting many URLs or the entire archive at once may be slow. Use the filtering CLI flags on the archivebox list
command to export specific Snapshots or ranges.
# archivebox list --help
archivebox list --html --with-headers > index.html # export to static html table
archivebox list --json --with-headers > index.json # export to json blob
archivebox list --csv=timestamp,url,title > index.csv # export to csv spreadsheet
# (if using docker-compose, add the -T flag when piping)
docker-compose run -T archivebox list --html --filter-type=search snozzberries > index.json
The paths in the static exports are relative, make sure to keep them next to your ./archive
folder when backing them up or viewing them.
Dependencies¶
For better security, easier updating, and to avoid polluting your host system with extra dependencies, it is strongly recommended to use the official Docker image with everything preinstalled for the best experience.
To achieve high fidelity archives in as many situations as possible, ArchiveBox depends on a variety of 3rd-party tools and libraries that specialize in extracting different types of content. These optional dependencies used for archiving sites include:
chromium
/chrome
(for screenshots, PDF, DOM HTML, and headless JS scripts)node
&npm
(for readability, mercury, and singlefile)wget
(for plain HTML, static files, and WARC saving)curl
(for fetching headers, favicon, and posting to Archive.org)youtube-dl
(for audio, video, and subtitles)git
(for cloning git repos)- and more as we grow…
You don’t need to install every dependency to use ArchiveBox. ArchiveBox will automatically disable extractors that rely on dependencies that aren’t installed, based on what is configured and available in your $PATH
.
If using Docker, you don’t have to install any of these manually, all dependencies are set up properly out-of-the-box.
However, if you prefer not using Docker, you can install ArchiveBox and its dependencies using your system package manager or pip
directly on any Linux/macOS system. Just make sure to keep the dependencies up-to-date and check that ArchiveBox isn’t reporting any incompatibility with the versions you install.
# install python3 and archivebox with your system package manager
# apt/brew/pip/etc install ... (see Quickstart instructions above)
archivebox setup # auto install all the extractors and extras
archivebox --version # see info and check validity of installed dependencies
Installing directly on Windows without Docker or WSL/WSL2/Cygwin is not officially supported, but some advanced users have reported getting it working.

Caveats¶
Archiving Private URLs¶
If you’re importing URLs containing secret slugs or pages with private content (e.g Google Docs, unlisted videos, etc), you may want to disable some of the extractor modules to avoid leaking private URLs to 3rd party APIs during the archiving process.
# don't do this:
archivebox add 'https://docs.google.com/document/d/12345somelongsecrethere'
archivebox add 'https://example.com/any/url/you/want/to/keep/secret/'
# without first disabling share the URL with 3rd party APIs:
archivebox config --set SAVE_ARCHIVE_DOT_ORG=False # disable saving all URLs in Archive.org
# if extra paranoid or anti-google:
archivebox config --set SAVE_FAVICON=False # disable favicon fetching (it calls a google API)
archivebox config --set CHROME_BINARY=chromium # ensure it's using Chromium instead of Chrome
Security Risks of Viewing Archived JS¶
Be aware that malicious archived JS can access the contents of other pages in your archive when viewed. Because the Web UI serves all viewed snapshots from a single domain, they share a request context and typical CSRF/CORS/XSS/CSP protections do not work to prevent cross-site request attacks. See the Security Overview page for more details.
# visiting an archived page with malicious JS:
https://127.0.0.1:8000/archive/1602401954/example.com/index.html
# example.com/index.js can now make a request to read everything from:
https://127.0.0.1:8000/index.html
https://127.0.0.1:8000/archive/*
# then example.com/index.js can send it off to some evil server
Saving Multiple Snapshots of a Single URL¶
Support for saving multiple snapshots of each site over time will be added eventually (along with the ability to view diffs of the changes between runs). For now ArchiveBox is designed to only archive each URL with each extractor type once. A workaround to take multiple snapshots of the same URL is to make them slightly different by adding a hash:
archivebox add 'https://example.com#2020-10-24'
...
archivebox add 'https://example.com#2020-10-25'
Storage Requirements¶
Because ArchiveBox is designed to ingest a firehose of browser history and bookmark feeds to a local disk, it can be much more disk-space intensive than a centralized service like the Internet Archive or Archive.today. However, as storage space gets cheaper and compression improves, you should be able to use it continuously over the years without having to delete anything.
ArchiveBox can use anywhere from ~1gb per 1000 articles, to ~50gb per 1000 articles, mostly dependent on whether you’re saving audio & video using SAVE_MEDIA=True
and whether you lower MEDIA_MAX_SIZE=750mb
.
Storage requirements can be reduced by using a compressed/deduplicated filesystem like ZFS/BTRFS, or by turning off extractors methods you don’t need. Don’t store large collections on older filesystems like EXT3/FAT as they may not be able to handle more than 50k directory entries in the archive/
folder.
Try to keep the index.sqlite3
file on local drive (not a network mount), and ideally on an SSD for maximum performance, however the archive/
folder can be on a network mount or spinning HDD.
Background & Motivation¶
The aim of ArchiveBox is to enable more of the internet to be archived by empowering people to self-host their own archives. The intent is for all the web content you care about to be viewable with common software in 50 - 100 years without needing to run ArchiveBox or other specialized software to replay it.
Vast treasure troves of knowledge are lost every day on the internet to link rot. As a society, we have an imperative to preserve some important parts of that treasure, just like we preserve our books, paintings, and music in physical libraries long after the originals go out of print or fade into obscurity.
Whether it’s to resist censorship by saving articles before they get taken down or edited, or just to save a collection of early 2010’s flash games you love to play, having the tools to archive internet content enables to you save the stuff you care most about before it disappears.
The balance between the permanence and ephemeral nature of content on the internet is part of what makes it beautiful. I don’t think everything should be preserved in an automated fashion–making all content permanent and never removable, but I do think people should be able to decide for themselves and effectively archive specific content that they care about.
Because modern websites are complicated and often rely on dynamic content, ArchiveBox archives the sites in several different formats beyond what public archiving services like Archive.org/Archive.is save. Using multiple methods and the market-dominant browser to execute JS ensures we can save even the most complex, finicky websites in at least a few high-quality, long-term data formats.
Comparison to Other Projects¶

▶ Check out our community page for an index of web archiving initiatives and projects.
A variety of open and closed-source archiving projects exist, but few provide a nice UI and CLI to manage a large, high-fidelity archive collection over time.
ArchiveBox tries to be a robust, set-and-forget archiving solution suitable for archiving RSS feeds, bookmarks, or your entire browsing history (beware, it may be too big to store), ~~including private/authenticated content that you wouldn’t otherwise share with a centralized service~~ (this is not recommended due to JS replay security concerns).
Comparison With Centralized Public Archives¶
Not all content is suitable to be archived in a centralized collection, wehther because it’s private, copyrighted, too large, or too complex. ArchiveBox hopes to fill that gap.
By having each user store their own content locally, we can save much larger portions of everyone’s browsing history than a shared centralized service would be able to handle. The eventual goal is to work towards federated archiving where users can share portions of their collections with each other.
Comparison With Other Self-Hosted Archiving Options¶
ArchiveBox differentiates itself from similar self-hosted projects by providing both a comprehensive CLI interface for managing your archive, a Web UI that can be used either indepenently or together with the CLI, and a simple on-disk data format that can be used without either.
ArchiveBox is neither the highest fidelity, nor the simplest tool available for self-hosted archiving, rather it’s a jack-of-all-trades that tries to do most things well by default. It can be as simple or advanced as you want, and is designed to do everything out-of-the-box but be tuned to suit your needs.
If being able to archive very complex interactive pages with JS and video is paramount, check out ArchiveWeb.page and ReplayWeb.page.
If you prefer a simpler, leaner solution that archives page text in markdown and provides note-taking abilities, check out Archivy or 22120.
For more alternatives, see our list here…

Internet Archiving Ecosystem¶
Whether you want to learn which organizations are the big players in the web archiving space, want to find a specific open-source tool for your web archiving need, or just want to see where archivists hang out online, our Community Wiki page serves as an index of the broader web archiving community. Check it out to learn about some of the coolest web archiving projects and communities on the web!

- Community Wiki
- The Master ListsCommunity-maintained indexes of archiving tools and institutions.
- Web Archiving SoftwareOpen source tools and projects in the internet archiving space.
- Reading ListArticles, posts, and blogs relevant to ArchiveBox and web archiving in general.
- CommunitiesA collection of the most active internet archiving communities and initiatives.
- Check out the ArchiveBox Roadmap and Changelog
- Learn why archiving the internet is important by reading the “On the Importance of Web Archiving” blog post.
- Reach out to me for questions and comments via @ArchiveBoxApp or @theSquashSH on Twitter
Need help building a custom archiving solution?
✨ Hire the team that helps build Archivebox to work on your project. (we’re @MonadicalSAS on Twitter)
(They also do general software consulting across many industries)

Documentation¶

We use the Github wiki system and Read the Docs (WIP) for documentation.
You can also access the docs locally by looking in the ArchiveBox/docs/
folder.
Getting Started¶
Reference¶
ArchiveBox Development¶
All contributions to ArchiveBox are welcomed! Check our issues and Roadmap for things to work on, and please open an issue to discuss your proposed implementation before working on things! Otherwise we may have to close your PR if it doesn’t align with our roadmap.
Low hanging fruit / easy first tickets:
Setup the dev environment¶
Click to expand...
1. Clone the main code repo (making sure to pull the submodules as well)¶
git clone --recurse-submodules https://github.com/ArchiveBox/ArchiveBox
cd ArchiveBox
git checkout dev # or the branch you want to test
git submodule update --init --recursive
git pull --recurse-submodules
2. Option A: Install the Python, JS, and system dependencies directly on your machine¶
# Install ArchiveBox + python dependencies
python3 -m venv .venv && source .venv/bin/activate && pip install -e '.[dev]'
# or: pipenv install --dev && pipenv shell
# Install node dependencies
npm install
# or
archivebox setup
# Check to see if anything is missing
archivebox --version
# install any missing dependencies manually, or use the helper script:
./bin/setup.sh
2. Option B: Build the docker container and use that for development instead¶
# Optional: develop via docker by mounting the code dir into the container
# if you edit e.g. ./archivebox/core/models.py on the docker host, runserver
# inside the container will reload and pick up your changes
docker build . -t archivebox
docker run -it archivebox init --setup
docker run -it -p 8000:8000 \
-v $PWD/data:/data \
-v $PWD/archivebox:/app/archivebox \
archivebox server 0.0.0.0:8000 --debug --reload
# (remove the --reload flag and add the --nothreading flag when profiling with the django debug toolbar)
Common development tasks¶
See the ./bin/
folder and read the source of the bash scripts within.
You can also run all these in Docker. For more examples see the Github Actions CI/CD tests that are run: .github/workflows/*.yaml
.
Run in DEBUG mode¶
Click to expand...
archivebox config --set DEBUG=True
# or
archivebox server --debug ...
Build and run a Github branch¶
Click to expand...
docker build -t archivebox:dev https://github.com/ArchiveBox/ArchiveBox.git#dev
docker run -it -v $PWD:/data archivebox:dev ...
Run the linters¶
Click to expand...
./bin/lint.sh
(uses flake8
and mypy
)
Run the integration tests¶
Click to expand...
./bin/test.sh
(uses pytest -s
)
Make migrations or enter a django shell¶
Click to expand...
Make sure to run this whenever you change things in models.py
.
cd archivebox/
./manage.py makemigrations
cd path/to/test/data/
archivebox shell
archivebox manage dbshell
(uses pytest -s
)
Build the docs, pip package, and docker image¶
Click to expand...
(Normally CI takes care of this, but these scripts can be run to do it manually)
./bin/build.sh
# or individually:
./bin/build_docs.sh
./bin/build_pip.sh
./bin/build_deb.sh
./bin/build_brew.sh
./bin/build_docker.sh
Roll a release¶
Click to expand...
(Normally CI takes care of this, but these scripts can be run to do it manually)
./bin/release.sh
# or individually:
./bin/release_docs.sh
./bin/release_pip.sh
./bin/release_deb.sh
./bin/release_brew.sh
./bin/release_docker.sh
Futher Reading¶
- Home: https://archivebox.io
- Demo: https://demo.archivebox.io
- Docs: https://docs.archivebox.io
- Wiki: https://wiki.archivebox.io
- Issues: https://issues.archivebox.io
- Forum: https://forum.archivebox.io
- Releases: https://releases.archivebox.io
- Donations: https://github.com/sponsors/pirate

This project is maintained mostly in my spare time with the help from generous contributors and Monadical (✨ hire them for dev work!).
Sponsor this project on Github
✨ Have spare CPU/disk/bandwidth and want to help the world? Check out our Good Karma Kit…
Getting Started¶
Quickstart¶

▶️ It only takes about 5 minutes to get up and running with ArchiveBox.
ArchiveBox officially supports macOS, Ubuntu/Debian, and BSD, but likely runs on many other systems. You can run it on any system that supports Docker, including Windows (using Docker in WSL2).
If you want to use Docker or Docker Compose to run ArchiveBox, see the [[Docker]] page.
First, we install the ArchiveBox dependencies, then we create a folder to store the archive data, and finally, we import the list of links to the archive by running:archivebox add < [links_file]
1. Set up ArchiveBox¶
We recommend using Docker because it has all the extractors and dependencies working out-of-the-box:
# first make sure you have docker: https://docs.docker.com/get-docker/
# then run this to get started with a collection in the current directory
docker run -v $PWD:/data -it archivebox/archivebox init
# alternatively, install ArchiveBox and its dependencies directly on your system without docker
# (script prompts for user confirmation before installing anything)
curl https://raw.githubusercontent.com/ArchiveBox/ArchiveBox/master/bin/setup.sh | sh
# or follow the manual setup instructions if you don't like using curl | sh
(The above are shell commands to run. If you’re not used to those, consult your operating system’s manual for how to run a terminal emulator.)

For more details, including the manual setup and docker instructions, see the [[Install]] page.
2. Get your list of URLs to archive¶
Follow the links here to find instructions for exporting a list of URLs from each service.
- Pinboard
- Instapaper
- Reddit Saved Posts
- Shaarli
- Unmark.it
- Wallabag
- Chrome Bookmarks
- Firefox Bookmarks
- Safari Bookmarks
- Opera Bookmarks
- Internet Explorer Bookmarks
- Chrome History:
./bin/export-browser-history.sh --chrome
- Firefox History:
./bin/export-browser-history.sh --firefox
- Safari History:
./bin/export-browser-history.sh --safari
- Other File or URL: (e.g. RSS feed url, text file path) pass as second argument in the next step
(If any of these links are broken, please submit an issue and I’ll fix it)
3. Add your URLs to the archive¶
Pass in URLs directly, import a list of links from a file, or import from a feed URL. All via stdin:
# if using docker
docker run -v $PWD:/data -it archivebox/archivebox add 'https://example.com'
# or if not using docker
archivebox add 'https://example.com'
# any text containing links can also be passed in via stdin (works with docker as well)
curl https://getpocket.com/users/example/feed/all | archivebox add
✅ Done!¶
Open ./index.html
to view your archive. (favicons will appear next to each title once they have finished downloading)
You can also use the interactive Web UI to view/manage/add links to your archive:
# with docker:
docker run -v $PWD:/data -it -p 8000:8000 archivebox/archivebox
# or without docker:
archivebox server
open http://127.0.0.1:8000
Next Steps:
- Read [[Usage]] to learn about the various CLI and web UI functions
- Read [[Configuration]] to learn about the various archive method options
- Read [[Scheduled Archiving]] to learn how to set up automatic daily archiving
- Read [[Publishing Your Archive]] if you want to host your archive for others to access online
- Read [[Troubleshooting]] if you encounter any problems
Install¶
ArchiveBox only has a few main dependencies apart from python3
, and they can all be installed using your normal package manager. It usually takes 1min to get up and running if you use the helper script, or about 5min if you install everything manually.
Supported Systems¶
ArchiveBox officially supports the following operating systems:


- macOS: >=10.12 (with homebrew)
- Linux: Ubuntu, Debian, etc (with apt)
- BSD: FreeBSD, OpenBSD, NetBSD etc (with pkg)
Other systems that are not officially supported but probably work to varying degrees:


- Windows: Via [[Docker]] or WSL
- Other Linux distros: Fedora, SUSE, Arch, CentOS, etc.
Platforms other than Linux, BSD, and macOS are untested, but you can probably get it working on them without too much effort.
It’s recommended to use a filesystem with compression and/or deduplication abilities (e.g. ZFS or BTRFS) for maximum archive storage efficiency.
You will also need 500MB of RAM (bare minimum), though 2GB or greater recommended. You may be able to reduce the RAM requirements if you disable all the chrome-based archiving methods with USE_CHROME=False
.
Dependencies¶
Not all the dependencies are required for all modes. If you disable some archive methods you can avoid those dependencies, for example, if you set FETCH_MEDIA=False
you don’t need to install youtube-dl
, and if you set FETCH_[PDF,SCREENSHOT,DOM]=False
you don’t need chromium
.
python3 >= 3.7
wget >= 1.16
chromium >= 59
(google-chrome >= v59
works fine as well)youtube-dl
curl
(usually already on most systems)git
(usually already on most systems)
More info:
- For help installing these, see the Manual Setup, [[Troubleshooting]] and [[Chromium Install]] pages.
- To use specific binaries for dependencies, see the Configuration: Dependencies page.
- To disable unwanted dependencies, see the Configuration: Archive Method Toggles page.
Automatic Setup¶
If you’re on Linux with apt
, or macOS with brew
there is an automatic setup script provided to install all the dependencies.
BSD, Windows, and other OS users should follow the Manual Setup or [[Docker]] instructions.
# docker or the manual setup are preferred on all platforms now, if you want to use the old install script you can run:
curl https://raw.githubusercontent.com/pirate/ArchiveBox/master/bin/setup.sh | sh
The script explains what it installs beforehand, and will prompt for user confirmation before making any changes to your system.

After running the setup script, continue with the [[Quickstart]] guide…
Manual Setup¶
If you don’t like running random setup scripts off the internet (:+1:), you can follow these manual setup instructions.
1. Install dependencies¶
macOS¶
brew tap homebrew-ffmpeg/ffmpeg
brew install homebrew-ffmpeg/ffmpeg/ffmpeg --with-fdk-aac
brew install python3 git wget curl youtube-dl
brew cask install chromium # Skip this if you already have Google Chrome/Chromium installed in /Applications/
Ubuntu/Debian¶
apt install python3 python3-pip python3-distutils git wget curl youtube-dl
apt install chromium-browser # Skip this if you already have Google Chrome/Chromium installed
BSD¶
FreeBSD:
pkg install python git wget curl youtube-dl
pkg install chromium-browser # Skip this if you already have Google Chrome/Chromium installed
OpenBSD:
pkg_add python3 git wget curl youtube-dl chromium
Install ArchiveBox using pip¶
python3 -m pip install --upgrade archivebox
Check that everything worked and the versions are high enough.¶
python3 --version | head -n 1 &&
git --version | head -n 1 &&
wget --version | head -n 1 &&
curl --version | head -n 1 &&
youtube-dl --version | head -n 1 &&
echo "[√] All dependencies installed."
archivebox version
If you have issues setting up Chromium / Google Chrome, see the [[Chromium Install]] page for more detailed setup instructions.
2. Get your bookmark export file¶
Follow the [[Quickstart]] guide to download your bookmarks export file containing a list of links to archive.
3. Run archivebox¶
# create a new folder to hold your data and cd into it
mkdir data && cd data
archivebox init
archivebox version
archivebox add < ~/Downloads/bookmarks_export.html
You can also use the update
subcommand to resume the archive update at a specific timestamp archivebox update --resume=153242424324.123
.
Next Steps¶
- Read [[Usage]] to learn how to use the ArchiveBox CLI and HTML output
- Read [[Configuration]] to learn about the various archive method options
- Read [[Scheduled Archiving]] to learn how to set up automatic daily archiving
- Read [[Publishing Your Archive]] if you want to host your archive for others to access online
- Read [[Troubleshooting]] if you encounter any problems
Docker Setup¶
First, if you don’t already have docker installed, follow the official install instructions for Linux, macOS, or Windows https://docs.docker.com/install/#supported-platforms.
Then see the [[Docker]] page for next steps.
Docker¶
Overview¶
Running ArchiveBox with Docker allows you to manage it in a container without exposing it to the rest of your system. Usage with Docker is similar to usage of ArchiveBox normally, with a few small differences.
Make sure you have Docker installed and set up on your machine before following these instructions. If you don’t already have Docker installed, follow the official install instructions for Linux, macOS, or Windows here: https://docs.docker.com/install/#supported-platforms.

- Overview
- Docker Compose (recommended way)
- Plain Docker
Official Docker Hub image:https://hub.docker.com/r/archivebox/archivebox
Usage:
docker run -v $PWD:/data archivebox/archivebox init
docker run -v $PWD:/data archivebox/archivebox add 'https://example.com'
docker run -v $PWD:/data -it archivebox/archivebox manage createsuperuser
docker run -v $PWD:/data -p 8000:8000 archivebox/archivebox server 0.0.0.0:8000

Docker Compose¶
An example docker-compose.yml
config with ArchiveBox and an Nginx server to serve the archive is included in the project root. You can edit it as you see fit, or just run it as it comes out-of-the-box.
Just make sure you have a Docker version that’s new enough to support version: 3
format:
docker --version
Docker version 18.09.1, build 4c52b90 # must be >= 17.04.0
Setup¶
mkdir archivebox && cd archivebox
wget https://raw.githubusercontent.com/ArchiveBox/ArchiveBox/master/docker-compose.yml
mkdir -p etc/sonic
wget https://raw.githubusercontent.com/ArchiveBox/ArchiveBox/master/etc/sonic/config.cfg -O etc/sonic/config.cfg
docker-compose up -d
docker-compose run archivebox init
docker-compose run archivebox manage createsuperuser
docker-compose run archivebox add 'https://example.com'
Usage¶
First, make sure you’re cd
’ed into the same folder as your docker-compose.yml
file (e.g. the project root) and that your containers have been started with docker-compose up -d
.
Then open http://127.0.0.1:8000
or data/index.html
to view the archive (HTTP, not HTTPS).
To add new URLs, you can use docker-compose just like the normal archivebox <subcommand> [args]
CLI.
To add an individual link or list of links, pass in URLs via stdin.
echo "https://example.com" | docker-compose run archivebox add
To import links from a file you can either cat
the file and pass it via stdin like above, or move it into your data folder so that ArchiveBox can access it from within the container.
mv ~/Downloads/bookmarks.html data/sources/bookmarks.html
docker-compose run archivebox add /data/sources/bookmarks.html
docker-compose run archivebox add < data/sources/bookmarks.html
To pull in links from a feed or remote file, pass the URL or path to the feed as an argument.
docker-compose run archivebox add --depth=1 https://example.com/some/feed.rss
The depth
argument controls if you want to save the links contained in that URL, or only the specified URL.
Accessing the data¶
The outputted archive data is stored in data/
(relative to the project root), or whatever folder path you specified in the docker-compose.yml
volumes:
section. Make sure the data/
folder on the host has permissions initially set to 777
so that the ArchiveBox command is able to set it to the specified OUTPUT_PERMISSIONS
config setting on the first run.
To access your archive, you can open data/index.html
directly, or you can use the provided Django development server running inside docker on http://127.0.0.1:8000
.
Configuration¶
ArchiveBox running with docker-compose accepts all the same environment variables as normal, see the full list on the [[Configuration]] page.
The recommended way to pass in config variables is to edit the environment:
section in docker-compose.yml
directly or add an env_file: ./path/to/ArchiveBox.conf
line before environment:
to import variables from an env file.
Example of adding config options to docker-compose.yml
:
...
services:
archivebox:
...
environment:
- USE_COLOR=False
- SHOW_PROGRESS=False
- CHECK_SSL_VALIDITY=False
- RESOLUTION=1900,1820
- MEDIA_TIMEOUT=512000
...
You can also specify an env file via CLI when running compose using docker-compose --env-file=/path/to/config.env ...
although you must specify the variables in the environment:
section that you want to have passed down to the ArchiveBox container from the passed env file.
If you want to access your archive server with HTTPS, put a reverse proxy like Nginx or Caddy in front of http://127.0.0.1:8098
to do SSL termination. You can find many instructions to do this online if you search “SSL reverse proxy”.
Docker¶
Setup¶
Fetch and run the ArchiveBox Docker image to create your initial archive.
echo 'https://example.com' | docker run -it -v $PWD:/data archivebox/archivebox add
Replace ~/ArchiveBox
in the command above with the full path to a folder to use to store your archive on the host, or name of a Docker data volume.
Make sure the data folder you use host is either a new, uncreated path, or if it already exists make sure it has permissions initially set to 777
so that the ArchiveBox command is able to set it to the specified OUTPUT_PERMISSIONS
config setting on the first run.
Usage¶
To add a single URL to the archive or a list of links from a file, pipe them in via stdin. This will archive each link passed in.
echo 'https://example.com' | docker run -it -v $PWD:/data archivebox/archivebox add
# or
docker run -it -v $PWD:/data archivebox/archivebox add < bookmarks.html
To add a list of pages via feed URL or remote file, pass the URL of the feed as an argument.
docker run -it -v $PWD:/data archivebox/archivebox add 'https://example.com/some/rss/feed.xml'
The depth
argument controls if you want to save the links contained in that URL, or only the specified URL.
Accessing the data¶
Using a bind folder¶
Use the flag:
-v /full/path/to/folder/on/host:/data
This will use the folder /full/path/to/folder/on/host
on your host to store the ArchiveBox output.
Using a named Docker data volume¶
(not recommended unless you know what you’re doing)
docker volume create archivebox-data
Then use the flag:
-v archivebox-data:/data
You can mount your data volume using standard docker tools, or access the contents directly here:/var/lib/docker/volumes/archivebox-data/_data
(on most Linux systems)
On a Mac you’ll have to enter the base Docker Linux VM first to access the volume data:
screen ~/Library/Containers/com.docker.docker/Data/vms/0/tty
cd /var/lib/docker/volumes/archivebox-data/_data
Configuration¶
The easiest way is to use the a .env
file or add your config to your docker-compose.yml
environment:
section.
The next easiest way to get/set config is using the archivebox CLI:
docker-compose run archivebox config --get RESOLUTION
docker-compose run archivebox config --set RESOLUTION=1440,900
# or
docker run -it -v $PWD:/data archivebox/archivebox config --set MEDIA_TIMEOUT=120
ArchiveBox in Docker accepts all the same environment variables as normal, see the list on the [[Configuration]] page.
To set environment variables for a single run, you can use the env KEY=VAL ...
command, -e KEY=VAL
, or --env-file=somefile.env
.
echo 'https://example.com' | docker run -it -v $PWD:/data -e FETCH_SCREENSHOT=False archivebox/archivebox add
docker run -i -v --env-file=ArchiveBox.env archivebox/archivebox
You can also edit the data/ArchiveBox.conf
file directly and the changes will take effect on the next run.
General¶
Usage¶
▶️ Make sure the dependencies are fully installed before running any ArchiveBox commands.
ArchiveBox API Reference:

- CLI Usage: Docs and examples for the ArchiveBox command line interface.
- UI Usage: Docs and screenshots for the outputted HTML archive interface.
- Disk Layout: Description of the archive folder structure and contents.
Related:
- [[Docker]]: Learn about ArchiveBox usage with Docker and Docker Compose
- [[Configuration]]: Learn about the various archive method options
- [[Scheduled Archiving]]: Learn how to set up automatic daily archiving
- [[Publishing Your Archive]]: Learn how to host your archive for others to access
- [[Troubleshooting]]: Resources if you encounter any problems
CLI Usage¶

All three of these ways of running ArchiveBox are equivalent and interchangeable:
archivebox [subcommand] [...args]
Using the PyPI package viapip install archivebox
archivebox run -it -v $PWD:/data archivebox/archivebox [subcommand] [...args]
Using the official Docker imagedocker-compose run archivebox [subcommand] [...args]
Using the official Docker image w/ Docker Compose
You can share a single archivebox data directory between Docker and non-Docker instances as well, allowing you to run the server in a container but still execute CLI commands on the host for example.
For more examples see the [[Docker]] page.
- Run ArchiveBox with configuration options
- Import a single URL
- Import a list of URLs from a text file
- Import list of links from browser history
Run ArchiveBox with configuration options¶
You can set environment variables in your shell profile, a config file, or by using the env
command.
# via the CLI
archivebox config --set TIMEOUT=3600
# by modifying the config file
nano ArchiveBox.conf
# TIMEOUT=3600
# or by using environment variables
env TIMEOUT=3600 archivebox add 'https://example.com'
See [[Configuration]] page for more details about the available options and ways to pass config.If you’re using Docker, also make sure to read the Configuration section on the [[Docker]] page.
Import a single URL¶
archivebox add 'https://example.com'
# or
echo 'https://example.com' | archivebox add
You can also add --depth=1
to any of these commands if you want to recursively archive the URLs and all URLs one hop away. (e.g. all the outlinks on a page + the page).
Import a list of URLs from a text file¶
cat urls_to_archive.txt | archivebox add
# or
archivebox add < urls_to_archive.txt
# or
curl https://getpocket.com/users/USERNAME/feed/all | archivebox add
You can also pipe in RSS, XML, Netscape, or any of the other supported import formats via stdin.
archivebox add < ~/Downloads/browser_bookmarks_export.html
# or
archivebox add < ~/Downloads/pinboard_bookmarks.json
# or
archivebox add < ~/Downloads/other_links.txt
Import list of links from browser history¶
Look in the bin/
folder of this repo to find a script to parse your browser’s SQLite history database for URLs.
Specify the type of the browser as the first argument, and optionally the path to the SQLite history file as the second argument.
./bin/export-browser-history --chrome
archivebox add < output/sources/chrome_history.json
# or
./bin/export-browser-history --firefox
archivebox add < output/sources/firefox_history.json
# or
./bin/export-browser-history --safari
archivebox add < output/sources/safari_history.json
UI Usage¶
archivebox server
open http://127.0.0.1:8000
Or if you prefer to use the static HTML UI instead of the interactive UI provided by the server,
you can open ./index.html
in a browser. You should see something like this.
You can sort by column, search using the box in the upper right, and see the total number of links at the bottom.
Click the Favicon under the “Files” column to go to the details page for each link.


Disk Layout¶
The OUTPUT_DIR
folder (usually whatever folder you run the archivebox
command in), contains the UI HTML and archived data with the structure outlined below.
- data/
- index.sqlite3 # Main index of all archived URLs
- ArchiveBox.conf # Main config file in ini format
- archive/
- 155243135/ # Archived links are stored in folders by timestamp
- index.json # Index/details page for individual archived link
- index.html
# Archive method outputs:
- warc/
- media/
- git/
...
- sources/ # Each imported URL list is saved as a copy here
- getpocket.com-1552432264.txt
- stdin-1552291774.txt
...
Large Archives¶
I’ve found it takes about an hour to download 1000 articles, and they’ll take up roughly 1GB.Those numbers are from running it single-threaded on my i5 machine with 50mbps down. YMMV.
Storage requirements go up immensely if you’re using FETCH_MEDIA=True
and are archiving many pages with audio & video.
You can run it in parallel by manually splitting your URLs into separate chunks:
archivebox add < urls_chunk_1.txt &
archivebox add < urls_chunk_2.txt &
archivebox add < urls_chunk_3.txt &
(though this may not be faster if you have a very large collection/main index)
Users have reported running it with 50k+ bookmarks with success (though it will take more RAM while running).
If you already imported a huge list of bookmarks and want to import only new
bookmarks, you can use the ONLY_NEW
environment variable. This is useful if
you want to import a bookmark dump periodically and want to skip broken links
which are already in the index.
Python Shell Usage¶
Explore the Python API a bit to see whats available using the archivebox shell:
$ archivebox shell
[i] [2020-09-17 16:57:07] ArchiveBox v0.4.21: archivebox shell
> /Users/squash/Documents/opt/ArchiveBox/data
# Shell Plus Model Imports
from core.models import Snapshot
from django.contrib.admin.models import LogEntry
from django.contrib.auth.models import Group, Permission, User
from django.contrib.contenttypes.models import ContentType
from django.contrib.sessions.models import Session
# Shell Plus Django Imports
from django.core.cache import cache
from django.conf import settings
from django.contrib.auth import get_user_model
from django.db import transaction
from django.db.models import Avg, Case, Count, F, Max, Min, Prefetch, Q, Sum, When
from django.utils import timezone
from django.urls import reverse
from django.db.models import Exists, OuterRef, Subquery
# ArchiveBox Imports
from archivebox.core.models import Snapshot, User
from archivebox import *
help
version
init
config
add
remove
update
list
shell
server
status
manage
oneshot
schedule
[i] Welcome to the ArchiveBox Shell!
https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#Shell-Usage
Hint: Example use:
print(Snapshot.objects.filter(is_archived=True).count())
Snapshot.objects.get(url="https://example.com").as_json()
add("https://example.com/some/new/url")
Python API Usage¶
import os
DATA_DIR = '~/some/path/containing/your/archivebox/data'
os.chdir(DATA_DIR)
from archivebox.main import check_data_folder, setup_django, add, remove, server
check_data_folder(DATA_DIR)
setup_django(DATA_DIR)
add('https://example.com', index_only=True, out_dir=DATA_DIR)
remove(...)
server(...)
...
For more information see the Python API Reference.
Configuration¶
▶️ The full ArchiveBox config file definition with defaults can be found here: archivebox/config.py
.
Configuration of ArchiveBox is done by using the archivebox config
command, modifying the ArchiveBox.conf
file in the data folder, or by using environment variables. All three methods work equivalently when using Docker as well.
Some equivalent examples of setting some configuration options:
archivebox config --set CHROME_BINARY=google-chrome-stable
# OR
echo "CHROME_BINARY=google-chrome-stable" >> ArchiveBox.conf
# OR
env CHROME_BINARY=google-chrome-stable archivebox add ~/Downloads/bookmarks_export.html
Environment variables take precedence over the config file, which is useful if you only want to use a certain option temporarily during a single run.

Available Configuration Options:
- General Settings: Archiving process, output format, and timing.
- Archive Method Toggles: On/off switches for methods.
- Archive Method Options: Method tunables and parameters.
- Shell Options: Format & behavior of CLI output.
- Dependency Options: Specify exact paths to dependencies.
In case this document is ever out of date, it’s recommended to read the code that loads the config directly in archivebox/config.py
.

General Settings¶
General options around the archiving process, output format, and timing.
OUTPUT_DIR
¶
Possible Values: [.
]/~/archivebox
/…Path to an output folder to store the archive in.
Defaults to the current folder you’re in ./
($PWD
) when you run the archivebox
command.
Note: make sure the user running ArchiveBox has permissions set to allow writing to this folder!
OUTPUT_PERMISSIONS
¶
Possible Values: [755
]/644
/…Permissions to set the output directory and file contents to.
This is useful when running ArchiveBox inside Docker as root and you need to explicitly set the permissions to something that the users on the host can access.
ONLY_NEW
¶
Possible Values: [True
]/False
Toggle whether or not to attempt rechecking old links when adding new ones, or leave old incomplete links alone and only archive the new links.
By default, ArchiveBox will only archive new links on each import. If you want it to go back through all links in the index and download any missing files on every run, set this to False
.
Note: Regardless of how this is set, ArchiveBox will never re-download sites that have already succeeded previously. When this is False
it only attempts to fix previous pages have missing archive extractor outputs, it does not re-archive pages that have already been successfully archived.
TIMEOUT
¶
Possible Values: [60
]/120
/…Maximum allowed download time per archive method for each link in seconds. If you have a slow network connection or are seeing frequent timeout errors, you can raise this value.
Note: Do not set this to anything less than 15
seconds as it will cause Chrome to hang indefinitely and many sites to fail completely.
MEDIA_TIMEOUT
¶
Possible Values: [3600
]/120
/…Maximum allowed download time for fetching media when SAVE_MEDIA=True
in seconds. This timeout is separate and usually much longer than TIMEOUT
because media downloaded with youtube-dl
can often be quite large and take many minutes/hours to download. Tweak this setting based on your network speed and maximum media file size you plan on downloading.
Note: Do not set this to anything less than 10
seconds as it can often take 5-10 seconds for youtube-dl
just to parse the page before it starts downloading media files.
Related options:SAVE_MEDIA
TEMPLATES_DIR
¶
Possible Values: [$REPO_DIR/archivebox/templates
]//path/to/custom/templates
/…Path to a directory containing custom index html templates for theming your archive output. Files found in the folder at the specified path can override any of the defaults in the archivebox/themes
directory. If you’ve used django
before, this works exactly the same way that django
template overrides work (because it uses django
under the hood).
Related options:FOOTER_INFO
URL_BLACKLIST
¶
Possible Values: [None
]/.+\.exe$
/http(s)?:\/\/(.+)?example.com\/.*'
/…
A regex expression used to exclude certain URLs from the archive. You can use if there are certain domains, extensions, or other URL patterns that you want to ignore whenever they get imported. Blacklisted URLs wont be included in the index, and their page content wont be archived.
When building your blacklist, you can check whether a given URL matches your regex expression like so:
>>>import re
>>>URL_BLACKLIST = r'http(s)?:\/\/(.+)?(youtube\.com)|(amazon\.com)\/.*' # replace this with your regex to test
>>>test_url = 'https://test.youtube.com/example.php?abc=123'
>>>bool(re.compile(URL_BLACKLIST, re.IGNORECASE).match(test_url))
True
Related options:SAVE_MEDIA
, SAVE_GIT
, GIT_DOMAINS
Archive Method Toggles¶
High-level on/off switches for all the various methods used to archive URLs.
SAVE_TITLE
¶
Possible Values: [True
]/False
By default ArchiveBox uses the title provided by the import file, but not all types of imports provide titles (e.g. Plain texts lists of URLs). When this is True, ArchiveBox downloads the page (and follows all redirects), then it attempts to parse the link’s title from the first <title></title>
tag found in the response. It may be buggy or not work for certain sites that use JS to set the title, disabling it will lead to links imported without a title showing up with their URL as the title in the UI.
Related options:ONLY_NEW
, CHECK_SSL_VALIDITY
SAVE_FAVICON
¶
Possible Values: [True
]/False
Fetch and save favicon for the URL from Google’s public favicon service: https://www.google.com/s2/favicons?domain={domain}
. Set this to FALSE
if you don’t need favicons.
Related options:TEMPLATES_DIR
, CHECK_SSL_VALIDITY
, CURL_BINARY
SAVE_WGET
¶
Possible Values: [True
]/False
Fetch page with wget, and save responses into folders for each domain, e.g. example.com/index.html
, with .html
appended if not present. For a full list of options used during the wget
download process, see the archivebox/archive_methods.py:save_wget(...)
function.
Related options:TIMEOUT
, SAVE_WGET_REQUISITES
, CHECK_SSL_VALIDITY
, COOKIES_FILE
, WGET_USER_AGENT
, SAVE_WARC
, WGET_BINARY
SAVE_WARC
¶
Possible Values: [True
]/False
Save a timestamped WARC archive of all the page requests and responses during the wget archive process.
Related options:TIMEOUT
, SAVE_WGET_REQUISITES
, CHECK_SSL_VALIDITY
, COOKIES_FILE
, WGET_USER_AGENT
, SAVE_WGET
, WGET_BINARY
SAVE_PDF
¶
Possible Values: [True
]/False
Print page as PDF.
Related options:TIMEOUT
, CHECK_SSL_VALIDITY
, CHROME_USER_DATA_DIR
, CHROME_BINARY
SAVE_SCREENSHOT
¶
Possible Values: [True
]/False
Fetch a screenshot of the page.
Related options:RESOLUTION
, TIMEOUT
, CHECK_SSL_VALIDITY
, CHROME_USER_DATA_DIR
, CHROME_BINARY
SAVE_DOM
¶
Possible Values: [True
]/False
Fetch a DOM dump of the page.
Related options:TIMEOUT
, CHECK_SSL_VALIDITY
, CHROME_USER_DATA_DIR
, CHROME_BINARY
SAVE_SINGLEFILE
¶
Possible Values: [True
]/False
Fetch an HTML file with all assets embedded using Single File.
Related options:TIMEOUT
, CHECK_SSL_VALIDITY
, CHROME_USER_DATA_DIR
, CHROME_BINARY
, SINGLEFILE_BINARY
SAVE_READABILITY
¶
Possible Values: [True
]/False
Extract article text, summary, and byline using Mozilla’s Readability library.
Unlike the other methods, this does not download any additional files, so it’s practically free from a disk usage perspective. It works by using any existing downloaded HTML version (e.g. wget, DOM dump, singlefile) and piping it into readability.
Related options:TIMEOUT
, SAVE_WGET
, SAVE_DOM
, SAVE_SINGLEFILE
SAVE_GIT
¶
Possible Values: [True
]/False
Fetch any git repositories on the page.
Related options:TIMEOUT
, GIT_DOMAINS
, CHECK_SSL_VALIDITY
, GIT_BINARY
SAVE_MEDIA
¶
Possible Values: [True
]/False
Fetch all audio, video, annotations, and media metadata on the page using youtube-dl
. Warning, this can use up a lot of storage very quickly.
Related options:MEDIA_TIMEOUT
, CHECK_SSL_VALIDITY
, YOUTUBEDL_BINARY
SUBMIT_ARCHIVE_DOT_ORG
¶
Possible Values: [True
]/False
Submit the page’s URL to be archived on Archive.org. (The Internet Archive)
Related options:TIMEOUT
, CHECK_SSL_VALIDITY
, CURL_BINARY
Archive Method Options¶
Specific options for individual archive methods above. Some of these are shared between multiple archive methods, others are specific to a single method.
CHECK_SSL_VALIDITY
¶
Possible Values: [True
]/False
Whether to enforce HTTPS certificate and HSTS chain of trust when archiving sites. Set this to False
if you want to archive pages even if they have expired or invalid certificates. Be aware that when False
you cannot guarantee that you have not been man-in-the-middle’d while archiving content, so the content cannot be verified to be what’s on the original site.
SAVE_WGET_REQUISITES
¶
Possible Values: [True
]/False
Fetch images/css/js with wget. (True is highly recommended, otherwise your won’t download many critical assets to render the page, like images, js, css, etc.)
Related options:TIMEOUT
, SAVE_WGET
, SAVE_WARC
, WGET_BINARY
RESOLUTION
¶
Possible Values: [1440,2000
]/1024,768
/…Screenshot resolution in pixels width,height.
Related options:SAVE_SCREENSHOT
CURL_USER_AGENT
¶
Possible Values: [Curl/1.19.1
]/"Mozilla/5.0 ..."
/…This is the user agent to use during curl archiving. You can set this to impersonate a more common browser like Chrome or Firefox if you’re getting blocked by servers for having an unknown/blacklisted user agent.
Related options:USE_CURL
, SAVE_TITLE
, CHECK_SSL_VALIDITY
, CURL_BINARY
, WGET_USER_AGENT
, CHROME_USER_AGENT
WGET_USER_AGENT
¶
Possible Values: [Wget/1.19.1
]/"Mozilla/5.0 ..."
/…This is the user agent to use during wget archiving. You can set this to impersonate a more common browser like Chrome or Firefox if you’re getting blocked by servers for having an unknown/blacklisted user agent.
Related options:SAVE_WGET
, SAVE_WARC
, CHECK_SSL_VALIDITY
, WGET_BINARY
, CHROME_USER_AGENT
CHROME_USER_AGENT
¶
Possible Values: ["Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/73.0.3683.75 Safari/537.36"
]/"Mozilla/5.0 ..."
/…
This is the user agent to use during Chrome headless archiving. If you’re experiencing being blocked by many sites, you can set this to hide the Headless
string that reveals to servers that you’re using a headless browser.
Related options:SAVE_PDF
, SAVE_SCREENSHOT
, SAVE_DOM
, CHECK_SSL_VALIDITY
, CHROME_USER_DATA_DIR
, CHROME_HEADLESS
, CHROME_BINARY
, WGET_USER_AGENT
GIT_DOMAINS
¶
Possible Values: [github.com,bitbucket.org,gitlab.com
]/git.example.com
/…Domains to attempt download of git repositories on using git clone
.
Related options:SAVE_GIT
, CHECK_SSL_VALIDITY
COOKIES_FILE
¶
Possible Values: [None
]//path/to/cookies.txt
/…Cookies file to pass to wget. To capture sites that require a user to be logged in, you can specify a path to a netscape-format cookies.txt
file for wget to use. You can generate this file by using a browser extension to export your cookies in this format, or by using wget with --save-cookies
.
Related options:SAVE_WGET
, SAVE_WARC
, CHECK_SSL_VALIDITY
, WGET_BINARY
CHROME_USER_DATA_DIR
¶
Possible Values: [~/.config/google-chrome
]//tmp/chrome-profile
/…Path to a Chrome user profile directory. To capture sites that require a user to be logged in, you can specify a path to a chrome user profile (which loads the cookies needed for the user to be logged in). If you don’t have an existing Chrome profile, create one with chromium-browser --user-data-dir=/tmp/chrome-profile
, and log into the sites you need. Then set CHROME_USER_DATA_DIR=/tmp/chrome-profile
to make ArchiveBox use that profile.
Note: Make sure the path does not have Default
at the end (it should the the parent folder of Default
), e.g. set it to CHROME_USER_DATA_DIR=~/.config/chromium
and not CHROME_USER_DATA_DIR=~/.config/chromium/Default
.
By default when set to None
, ArchiveBox tries all the following User Data Dir paths in order:https://chromium.googlesource.com/chromium/src/+/HEAD/docs/user_data_dir.md
Related options:SAVE_PDF
, SAVE_SCREENSHOT
, SAVE_DOM
, CHECK_SSL_VALIDITY
, CHROME_HEADLESS
, CHROME_BINARY
CHROME_HEADLESS
¶
Possible Values: [True
]/False
Whether or not to use Chrome/Chromium in --headless
mode (no browser UI displayed). When set to False
, the full Chrome UI will be launched each time it’s used to archive a page, which greatly slows down the process but allows you to watch in real-time as it saves each page.
Related options:SAVE_PDF
, SAVE_SCREENSHOT
, SAVE_DOM
, CHROME_USER_DATA_DIR
, CHROME_BINARY
CHROME_SANDBOX
¶
Possible Values: [True
]/False
Whether or not to use the Chrome sandbox when archiving.
If you see an error message like this, it means you are trying to run ArchiveBox as root:
:ERROR:zygote_host_impl_linux.cc(89)] Running as root without --no-sandbox is not supported. See https://crbug.com/638180
*Note: Do not run ArchiveBox as root! The solution to this error is not to override it by setting CHROME_SANDBOX=False
, it’s to use create another user (e.g. www-data
) and run ArchiveBox under that new, less privileged user. This is a security-critical setting, only set this to False
if you’re running ArchiveBox inside a container or VM where it doesn’t have access to the rest of your system!
Related options:SAVE_PDF
, SAVE_SCREENSHOT
, SAVE_DOM
, CHECK_SSL_VALIDITY
, CHROME_USER_DATA_DIR
, CHROME_HEADLESS
, CHROME_BINARY
Shell Options¶
Options around the format of the CLI output.
USE_COLOR
¶
Possible Values: [True
]/False
Colorize console output. Defaults to True
if stdin is a TTY (interactive session), otherwise False
(e.g. if run in a script or piped into a file).


SHOW_PROGRESS
¶
Possible Values: [True
]/False
Show real-time progress bar in console output. Defaults to True
if stdin is a TTY (interactive session), otherwise False
(e.g. if run in a script or piped into a file).

Dependency Options¶
Options for defining which binaries to use for the various archive method dependencies.
CHROME_BINARY
¶
Possible Values: [chromium-browser
]//usr/local/bin/google-chrome
/…Path or name of the Google Chrome / Chromium binary to use for all the headless browser archive methods.
Without setting this environment variable, ArchiveBox by default look for the following binaries in $PATH
in this order:
chromium-browser
chromium
google-chrome
google-chrome-stable
google-chrome-unstable
google-chrome-beta
google-chrome-canary
google-chrome-dev
You can override the default behavior to search for any available bin by setting the environment variable to your preferred Chrome binary name or path.
The chrome/chromium dependency is optional and only required for screenshots, PDF, and DOM dump output, it can be safely ignored if those three methods are disabled.
Related options:SAVE_PDF
, SAVE_SCREENSHOT
, SAVE_DOM
, SAVE_SINGLEFILE
, CHROME_USER_DATA_DIR
, CHROME_HEADLESS
, CHROME_SANDBOX
WGET_BINARY
¶
Possible Values: [wget
]//usr/local/bin/wget
/…Path or name of the wget binary to use.
YOUTUBEDL_BINARY
¶
Possible Values: [youtube-dl
]//usr/local/bin/youtube-dl
/…Path or name of the youtube-dl binary to use.
Related options:SAVE_MEDIA
GIT_BINARY
¶
Possible Values: [git
]//usr/local/bin/git
/…Path or name of the git binary to use.
Related options:SAVE_GIT
CURL_BINARY
¶
Possible Values: [curl
]//usr/local/bin/curl
/…Path or name of the curl binary to use.
Related options:SAVE_FAVICON
, SUBMIT_ARCHIVE_DOT_ORG
SINGLEFILE_BINARY
¶
Possible Values: [single-file
]//usr/local/bin/single-file
/…Path or name of the SingleFile binary to use.
This can be installed using npm install -g git+https://github.com/gildas-lormeau/SingleFile.git
.
Related options:SAVE_SINGLEFILE
, CHROME_BINARY
, CHROME_USER_DATA_DIR
, CHROME_HEADLESS
, CHROME_SANDBOX
READABILITY_BINARY
¶
Possible Values: [readability-extractor
]//usr/local/bin/readability-extractor
/…Path or name of the Readability extrator binary to use.
This can be installed using npm install -g git+https://github.com/pirate/readability-extractor.git
.
Related options:SAVE_READABILITY

Troubleshooting¶
▶️ If you need help or have a question, you can open an issue or reach out on Twitter.
What are you having an issue with?:
Installing¶
Make sure you’ve followed the Manual Setup guide in the [[Install]] instructions first. Then check here for help depending on what component you need help with:
Python¶
On some Linux distributions the python3 package might not be recent enough. If this is the case for you, resort to installing a recent enough version manually.
add-apt-repository ppa:fkrull/deadsnakes && apt update && apt install python3.6
If you still need help, the official Python docs are a good place to start.
Chromium/Google Chrome¶
For more info, see the [[Chromium Install]] page.
archive.py
depends on being able to access a chromium-browser
/google-chrome
executable. The executable used
defaults to chromium-browser
but can be manually specified with the environment variable CHROME_BINARY
:
env CHROME_BINARY=/usr/local/bin/chromium-browser ./archive ~/Downloads/bookmarks_export.html
- Test to make sure you have Chrome on your
$PATH
with:
which chromium-browser || which google-chrome
If no executable is displayed, follow the setup instructions to install and link one of them.
- If a path is displayed, the next step is to check that it’s runnable:
chromium-browser --version || google-chrome --version
If no version is displayed, try the setup instructions again, or confirm that you have permission to access chrome.
- If a version is displayed and it’s
<59
, upgrade it:
apt upgrade chromium-browser -y
# OR
brew cask upgrade chromium-browser
- If a version is displayed and it’s
>=59
, make surearchive.py
is running the right one:
env CHROME_BINARY=/path/from/step/1/chromium-browser ./archive bookmarks_export.html # replace the path with the one you got from step 1
Wget & Curl¶
If you’re missing wget
or curl
, simply install them using apt
or your package manager of choice.
See the “Manual Setup” instructions for more details.
If wget times out or randomly fails to download some sites that you have confirmed are online,
upgrade wget to the most recent version with brew upgrade wget
or apt upgrade wget
. There is
a bug in versions <=1.19.1_1
that caused wget to fail for perfectly valid sites.
Archiving¶
No links parsed from export file¶
Please open an issue with a description of where you got the export, and preferrably your export file attached (you can redact the links). We’ll fix the parser to support your format.
Lots of skipped sites¶
If you ran the archiver once, it wont re-download sites subsequent times, it will only download new links.
If you haven’t already run it, make sure you have a working internet connection and that the parsed URLs look correct.
You can check the archive.py
output or index.html
to see what links it’s downloading.
If you’re still having issues, try deleting or moving the output/archive
folder (back it up first!) and running ./archive
again.
Lots of errors¶
Make sure you have all the dependencies installed and that you’re able to visit the links from your browser normally. Open an issue with a description of the errors if you’re still having problems.
Lots of broken links from the index¶
Not all sites can be effectively archived with each method, that’s why it’s best to use a combination of wget
, PDFs, and screenshots.
If it seems like more than 10-20% of sites in the archive are broken, open an issue
with some of the URLs that failed to be archived and I’ll investigate.
Removing unwanted links from the index¶
If you accidentally added lots of unwanted links into index and they slow down your archiving, you can use the bin/purge
script to remove them from your index, which removes everything matching python regexes you pass into it. E.g: bin/purge -r 'amazon\.com' -r 'google\.com'
. It would prompt before removing links from index, but for extra safety you might want to back up index.json
first (or put in undex version control).
Security Overview¶
Usage Modes¶
ArchiveBox has three common usage modes outlined below.

Public Mode [Default]¶
This is the default (lax) mode, intended for archiving public (non-secret) URLs without authenticating the headless browser. This is the mode used if you’re archiving news articles, audio, video, etc. browser bookmarks to a folder published on your webserver. This allows you to access and link to content on http://your.archive.com/archive...
after the originals go down.
This mode should not be used for archiving entire browser history or authenticated private content like Google Docs, paywalled content, invite-only subreddits, etc.
IMPORTANT: Don’t use ArchiveBox for private archived content right now as we’re in the middle of resolving some security issues with how JS is executed in archived content.¶
See here for more info: Architecture: Archived JS executes in a context shared with all other archived content

Private Mode¶
ArchiveBox is designed to be able to archive content that requires authentication or cookies. This includes paywalled content, private forums, LAN-only content, etc.
To get started, set CHROME_USER_DATA_DIR
and COOKIES_FILE
to point to a Chrome user folder that has your sessions and a wget cookies.txt
file respectively.
If you’re importing private links or authenticated content, you probably don’t want to share your archive folder publicly on a webserver, so don’t follow the [[Publishing Your Archive]] instructions unless you are only serving it on a trusted LAN or have some sort of authentication in front of it. Make sure to point ArchiveBox to an output folder with conservative permissions, as it may contain archived content with secret session tokens or pieces of your user data. You may also wish to encrypt the archive using an encrypted disk image or filesystem like ZFS as it will contain all requests and response data, including session keys, user data, usernames, etc.

Stealth Mode¶
If you want ArchiveBox to be less noisy and avoid leaking any URLs to 3rd-party APIs during archiving, you can disable the options below. Disabling these are recommended if you plan on archiving any sites that use secret tokens in the URL to grant access to private content without authentication, e.g. Google Docs, CodiDM notepads, etc.
https://web.archive.org/save/{url}
whenSUBMIT_ARCHIVE_DOT_ORG
isTrue
, full URLs are submitted to the Wayback Machine for archiving, but no cookies or content from the local authenticated archive are sharedhttps://www.google.com/s2/favicons?domain={domain}
whenFETCH_FAVICON
isTrue
, the domains for each link are shared in order to get the favicon, but not the full URL~~
Do not run as root¶

Do not run ArchiveBox as root for a number of reasons:
- Chrome will execute as root and fail immediately because Chrome sandboxing is pointless when the data directory is opened as root (do not set
CHROME_SANDBOX=False
just to bypass that error!) - All dependencies will be run as root, if any of them have a vulnerability that’s exploited by sites you’re archiving you’re opening yourself up to full system compromise
- ArchiveBox does lots of HTML parsing, filesystem access, and shell command execution. A bug in any one of those subsystems could potentially lead to deleted/damaged data on your hard drive, or full system compromise unless restricted to a user that only has permissions to access the directories needed
- Do you really trust a project created by a Github user called
@pirate
😉? Why give a random program off the internet root access to your entire system? (I don’t have malicious intent, I’m just saying in principle you should not be running random Github projects as root)
Instead, you should run ArchiveBox as your normal user, or create a user with less privileged access:
useradd -r -g archivebox -G audio,video archivebox # the audio & video groups are used by chrome
mkdir -p /home/archivebox/data
chown -R archivebox:archivebox /home/archivebox
...
sudo -u archivebox archivebox add ...
~~If you absolutely must run it as root for some reason, a footgun is provided: you can set ALLOW_ROOT=True
via environment variable or in your ArchiveBox.conf file.~~ This footgun option was removed (I’m sorry, the support burden of helping people who messed up their systems by running this as root was too high).

Output Folder¶
Permissions¶
What are the permissions on the archive folder? Limit access to the fewest possible users by checking folder ownership and setting OUTPUT_PERMISSIONS
accordingly.
Filesystem¶
How much are you planning to archive? Only a few bookmarked articles, or thousands of pages of browsing history a day? If it’s only 1-50 pages a day, you can probably just stick it in a normal folder on your hard drive, but if you want to go over 100 pages a day, you will likely want to put your archive on a compressed/deduplicated/encrypted disk image or filesystem like ZFS.
Publishing¶
Are you publishing your archive? If so, make sure you’re only serving it as HTML and not accidentally running it as php or cgi, and put it on its own domain not shared with other services. This is done in order to avoid cookies leaking between your main domain and domains hosting content you don’t control. Many companies put user provided files on separate domains like googleusercontent.com and github.io to avoid this problem.
Published archives automatically include a robots.txt
Dissallow: /
to block search engines from indexing them. You may still wish to publish your contact info in the index footer though using FOOTER_INFO
so that you can respond to any DMCA and copyright takedown notices if you accidentally rehost copyrighted content.
Publishing Your Archive¶
There are two ways to publish your archive: using the archivebox server
or by exporting and hosting it as static HTML.
1. Use the built-in webserver¶
# set the permissions depending on how public/locked down you want it to be
archivebox config --set PUBLIC_INDEX=True
archivebox config --set PUBLIC_SNAPSHOTS=True
archivebox config --set PUBLIC_ADD_VIEW=True
# create an admin username and password for yourself
archivebox manage createsuperuser
# then start the webserver and open the web UI in your browser
archivebox server 0.0.0.0:8000
open http://127.0.0.1:8000
This server is enabled out-of-the-box if you’re using docker-compose
to run ArchiveBox,
and there is a commented-out example nginx config with SSL set up as well.
2. Export and host it as static HTML¶
archivebox list --html --with-headers > index.html
archivebox list --json --with-headers > index.json
# then upload the entire output folder containing index.html and archive/ somewhere
# e.g. github pages or another static hosting provider
# you can also serve it with the simple python HTTP server
python3 -m http.server --bind 0.0.0.0 --directory . 8000
open http://127.0.0.1:8000
Here’s a sample nginx configuration that works to serve your static archive folder:
location / {
alias /path/to/your/ArchiveBox/data/;
index index.html;
autoindex on;
try_files $uri $uri/ =404;
}
Make sure you’re not running any content as CGI or PHP, you only want to serve static files!
Urls look like: https://archive.example.com/archive/1493350273/en.wikipedia.org/wiki/Dining_philosophers_problem.html
Security Concerns¶
Re-hosting other people’s content has security implications for any other sites sharing your hosting domain. Make sure you understand the dangers of hosting untrusted archived HTML/JS/CSS on a shared domain. Due to the security risk of serving some malicious JS you archived by accident, it’s best to put this on a domain or subdomain of its own to keep cookies separate and help limit the effectiveness of CSRF attacks and other nastiness.
Copyright Concerns¶
Be aware that some sites you archive may not allow you to rehost their content publicly for copyright reasons, it’s up to you to host responsibly and respond to takedown requests appropriately.
You may also want to blacklist your archive in /robots.txt
if you don’t want to be publicly associated with all the links you archive via search engine results.
Please modify the FOOTER_INFO
config variable to add your contact info to the footer of your index.
Scheduled Archiving¶
Using Cron¶
To schedule regular archiving you can use any task scheduler like cron
, at
, systemd
, etc.
ArchiveBox ignores links that are imported multiple times (keeping the earliest version that it’s seen). This means you can add cron jobs that regularly poll the same file or URL for new links, adding only new ones as necessary.
For some example configs, see the etc/cron.d
and etc/supervisord
folders.
Examples¶
Example: Import Firefox browser history every 24 hours¶
This example exports your browser history and archives it once a day:
Create /opt/ArchiveBox/bin/firefox_custom.sh
:
#!/bin/bash
cd /opt/ArchiveBox
./bin/archivebox-export-browser-history --firefox ./output/sources/firefox_history.json
archivebox add < ./output/sources/firefox_history.json >> /var/log/ArchiveBox.log
Then create a new file /etc/cron.d/ArchiveBox-Firefox
to tell cron to run your script every 24 hours:
0 24 * * * www-data /opt/ArchiveBox/bin/firefox_custom.sh
Example: Import an RSS feed from Pocket every 12 hours¶
This example imports your Pocket bookmark feed and archives any new links every 12 hours:
First, set your Pocket RSS feed to “public” under https://getpocket.com/privacy_controls.
Create /opt/ArchiveBox/bin/pocket_custom.sh
:
#!/bin/bash
cd /opt/ArchiveBox
curl https://getpocket.com/users/yourusernamegoeshere/feed/all | archivebox add >> /var/log/ArchiveBox.log
Then create a new file /etc/cron.d/ArchiveBox-Pocket
to tell cron to run your script every 12 hours:
0 12 * * * www-data /opt/ArchiveBox/bin/pocket_custom.sh
Chromium Install¶
By default, ArchiveBox looks for any existing installed version of Chrome/Chromium and uses it if found. You can optionally install a specific version and set the environment variable CHROME_BINARY
to force ArchiveBox to use that one, e.g.:
CHROME_BINARY=google-chrome-beta
CHROME_BINARY=/usr/bin/chromium-browser
CHROME_BINARY='/Applications/Chromium.app/Contents/MacOS/Chromium'
If you don’t already have Chrome installed, I recommend installing Chromium instead of Google Chrome, as it’s the open-source fork of Chrome that doesn’t send as much tracking data to Google.
Check for existing Chrome/Chromium install:

google-chrome --version | chromium-browser --version
Google Chrome 73.0.3683.75 beta # should be >v59
Installing Chromium¶
macOS¶
If you already have /Applications/Chromium.app
, you don’t need to run this.
brew install chromium
Ubuntu/Debian¶
If you already have chromium-browser
>= v59 installed (run chromium-browser --version
, you don’t need to run this.
apt update
apt install chromium-browser
Installing Google Chrome¶
macOS¶
If you already have /Applications/Google Chrome.app
, you don’t need to run this.
brew install google-chrome
Ubuntu/Debian¶
If you already have google-chrome
>= v59 installed (run google-chrome --version
, you don’t need to run this.
wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | sudo apt-key add -
sudo sh -c 'echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list'
apt update
apt install google-chrome-beta
Troubleshooting¶
If you encounter problems setting up Google Chrome or Chromium, see the Troubleshooting page.
API Reference¶
archivebox¶
archivebox package¶
Subpackages¶
archivebox.cli package¶
-
archivebox.cli.
list_subcommands
() → Dict[str, str][source]¶ find and import all valid archivebox_<subcommand>.py files in CLI_DIR
-
archivebox.cli.
run_subcommand
(subcommand: str, subcommand_args: List[str] = None, stdin: Optional[IO] = None, pwd: Union[pathlib.Path, str, None] = None) → None[source]¶ Run a given ArchiveBox subcommand with the given list of args
-
archivebox.cli.
help
(args: Optional[List[str]] = None, stdin: Optional[IO] = None, pwd: Optional[str] = None) → None¶ Print the ArchiveBox help message and usage
-
archivebox.cli.
version
(args: Optional[List[str]] = None, stdin: Optional[IO] = None, pwd: Optional[str] = None) → None¶ Print the ArchiveBox version and dependency information
-
archivebox.cli.
init
(args: Optional[List[str]] = None, stdin: Optional[IO] = None, pwd: Optional[str] = None) → None¶ Initialize a new ArchiveBox collection in the current directory
-
archivebox.cli.
config
(args: Optional[List[str]] = None, stdin: Optional[IO] = None, pwd: Optional[str] = None) → None¶ Get and set your ArchiveBox project configuration values
-
archivebox.cli.
setup
(args: Optional[List[str]] = None, stdin: Optional[IO] = None, pwd: Optional[str] = None) → None¶ Automatically install all ArchiveBox dependencies and extras
-
archivebox.cli.
add
(args: Optional[List[str]] = None, stdin: Optional[IO] = None, pwd: Optional[str] = None) → None¶ Add a new URL or list of URLs to your archive
-
archivebox.cli.
remove
(args: Optional[List[str]] = None, stdin: Optional[IO] = None, pwd: Optional[str] = None) → None¶ Remove the specified URLs from the archive
-
archivebox.cli.
update
(args: Optional[List[str]] = None, stdin: Optional[IO] = None, pwd: Optional[str] = None) → None¶ Import any new links from subscriptions and retry any previously failed/skipped links
-
archivebox.cli.
list
(args: Optional[List[str]] = None, stdin: Optional[IO] = None, pwd: Optional[str] = None) → None¶ List, filter, and export information about archive entries
-
archivebox.cli.
status
(args: Optional[List[str]] = None, stdin: Optional[IO] = None, pwd: Optional[str] = None) → None¶ Print out some info and statistics about the archive collection
-
archivebox.cli.
shell
(args: Optional[List[str]] = None, stdin: Optional[IO] = None, pwd: Optional[str] = None) → None¶ Enter an interactive ArchiveBox Django shell
-
archivebox.cli.
manage
(args: Optional[List[str]] = None, stdin: Optional[IO] = None, pwd: Optional[str] = None) → None¶ Run an ArchiveBox Django management command
-
archivebox.cli.
server
(args: Optional[List[str]] = None, stdin: Optional[IO] = None, pwd: Optional[str] = None) → None¶ Run the ArchiveBox HTTP server
-
archivebox.cli.
oneshot
(args: Optional[List[str]] = None, stdin: Optional[IO] = None, pwd: Optional[str] = None) → None¶ Create a single URL archive folder with an index.json and index.html, and all the archive method outputs. You can run this to archive single pages without needing to create a whole collection with archivebox init.
-
archivebox.cli.
schedule
(args: Optional[List[str]] = None, stdin: Optional[IO] = None, pwd: Optional[str] = None) → None¶ Set ArchiveBox to regularly import URLs at specific times using cron
archivebox.config package¶
ArchiveBox config definitons (including defaults and dynamic config options).
Config Usage Example:
archivebox config –set MEDIA_TIMEOUT=600 env MEDIA_TIMEOUT=600 USE_COLOR=False … archivebox [subcommand] …
Config Precedence Order:
- cli args (–update-all / –index-only / etc.)
- shell environment vars (env USE_COLOR=False archivebox add ‘…’)
- config file (echo “SAVE_FAVICON=False” >> ArchiveBox.conf)
- defaults (defined below in Python)
Documentation:
-
archivebox.config.
get_real_name
(key: str) → str[source]¶ get the current canonical name for a given deprecated config key
-
archivebox.config.
load_config_val
(key: str, default: Union[str, bool, int, None, Pattern[AnyStr], Dict[str, Any], Dict[str, Union[str, bool, int, None, Pattern[AnyStr], Dict[str, Any]]], Callable[[], Union[str, bool, int, None, Pattern[AnyStr], Dict[str, Any]]], Callable[[archivebox.config_stubs.ConfigDict], Union[str, bool, int, None, Pattern[AnyStr], Dict[str, Any], Dict[str, Union[str, bool, int, None, Pattern[AnyStr], Dict[str, Any]]], Callable[[], Union[str, bool, int, None, Pattern[AnyStr], Dict[str, Any]]]]]] = None, type: Optional[Type[CT_co]] = None, aliases: Optional[Tuple[str, ...]] = None, config: Optional[archivebox.config_stubs.ConfigDict] = None, env_vars: Optional[os._Environ] = None, config_file_vars: Optional[Dict[str, str]] = None) → Union[str, bool, int, None, Pattern[AnyStr], Dict[str, Any], Dict[str, Union[str, bool, int, None, Pattern[AnyStr], Dict[str, Any]]], Callable[[], Union[str, bool, int, None, Pattern[AnyStr], Dict[str, Any]]]][source]¶ parse bool, int, and str key=value pairs from env
-
archivebox.config.
load_config_file
(out_dir: str = None) → Optional[Dict[str, str]][source]¶ load the ini-formatted config file from OUTPUT_DIR/Archivebox.conf
-
archivebox.config.
write_config_file
(config: Dict[str, str], out_dir: str = None) → archivebox.config_stubs.ConfigDict[source]¶ load the ini-formatted config file from OUTPUT_DIR/Archivebox.conf
-
archivebox.config.
load_config
(defaults: Dict[str, archivebox.config_stubs.ConfigDefault], config: Optional[archivebox.config_stubs.ConfigDict] = None, out_dir: Optional[str] = None, env_vars: Optional[os._Environ] = None, config_file_vars: Optional[Dict[str, str]] = None) → archivebox.config_stubs.ConfigDict[source]¶
-
archivebox.config.
stdout
(*args, color: Optional[str] = None, prefix: str = '', config: Optional[archivebox.config_stubs.ConfigDict] = None) → None[source]¶
-
archivebox.config.
stderr
(*args, color: Optional[str] = None, prefix: str = '', config: Optional[archivebox.config_stubs.ConfigDict] = None) → None[source]¶
-
archivebox.config.
hint
(text: Union[Tuple[str, ...], List[str], str], prefix=' ', config: Optional[archivebox.config_stubs.ConfigDict] = None) → None[source]¶
-
archivebox.config.
bin_version
(binary: Optional[str]) → Optional[str][source]¶ check the presence and return valid version line of a specified binary
-
archivebox.config.
find_chrome_binary
() → Optional[str][source]¶ find any installed chrome binaries in the default locations
-
archivebox.config.
find_chrome_data_dir
() → Optional[str][source]¶ find any installed chrome user data directories in the default locations
-
archivebox.config.
get_code_locations
(config: archivebox.config_stubs.ConfigDict) → Dict[str, Union[str, bool, int, None, Pattern[AnyStr], Dict[str, Any]]][source]¶
-
archivebox.config.
get_external_locations
(config: archivebox.config_stubs.ConfigDict) → Union[str, bool, int, None, Pattern[AnyStr], Dict[str, Any], Dict[str, Union[str, bool, int, None, Pattern[AnyStr], Dict[str, Any]]], Callable[[], Union[str, bool, int, None, Pattern[AnyStr], Dict[str, Any]]]][source]¶
-
archivebox.config.
get_data_locations
(config: archivebox.config_stubs.ConfigDict) → Union[str, bool, int, None, Pattern[AnyStr], Dict[str, Any], Dict[str, Union[str, bool, int, None, Pattern[AnyStr], Dict[str, Any]]], Callable[[], Union[str, bool, int, None, Pattern[AnyStr], Dict[str, Any]]]][source]¶
-
archivebox.config.
get_dependency_info
(config: archivebox.config_stubs.ConfigDict) → Union[str, bool, int, None, Pattern[AnyStr], Dict[str, Any], Dict[str, Union[str, bool, int, None, Pattern[AnyStr], Dict[str, Any]]], Callable[[], Union[str, bool, int, None, Pattern[AnyStr], Dict[str, Any]]]][source]¶
-
archivebox.config.
get_chrome_info
(config: archivebox.config_stubs.ConfigDict) → Union[str, bool, int, None, Pattern[AnyStr], Dict[str, Any], Dict[str, Union[str, bool, int, None, Pattern[AnyStr], Dict[str, Any]]], Callable[[], Union[str, bool, int, None, Pattern[AnyStr], Dict[str, Any]]]][source]¶
-
archivebox.config.
check_system_config
(config: archivebox.config_stubs.ConfigDict = {'ALLOWED_HOSTS': '*', 'ANSI': {'black': '', 'blue': '', 'green': '', 'lightblue': '', 'lightred': '', 'lightyellow': '', 'red': '', 'reset': '', 'white': ''}, 'ARCHIVEBOX_BINARY': '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/envs/v0.6.2/lib/python3.7/site-packages/sphinx/__main__.py', 'ARCHIVE_DIR': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs/archive'), 'BIND_ADDR': '127.0.0.1:8000', 'CHECK_SSL_VALIDITY': True, 'CHROME_BINARY': None, 'CHROME_HEADLESS': True, 'CHROME_OPTIONS': {'CHECK_SSL_VALIDITY': True, 'CHROME_BINARY': None, 'CHROME_HEADLESS': True, 'CHROME_SANDBOX': True, 'CHROME_USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.61 Safari/537.36 ArchiveBox/{VERSION} (+https://github.com/ArchiveBox/ArchiveBox/)', 'CHROME_USER_DATA_DIR': None, 'RESOLUTION': '1440,2000', 'TIMEOUT': 60}, 'CHROME_SANDBOX': True, 'CHROME_USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.61 Safari/537.36 ArchiveBox/{VERSION} (+https://github.com/ArchiveBox/ArchiveBox/)', 'CHROME_USER_DATA_DIR': None, 'CHROME_VERSION': None, 'CODE_LOCATIONS': {'CUSTOM_TEMPLATES_DIR': {'enabled': False, 'is_valid': None, 'path': None}, 'PACKAGE_DIR': {'enabled': True, 'is_valid': True, 'path': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/archivebox')}, 'TEMPLATES_DIR': {'enabled': True, 'is_valid': True, 'path': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/archivebox/templates')}}, 'CONFIG_FILE': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs/ArchiveBox.conf'), 'COOKIES_FILE': None, 'CURL_ARGS': ['--silent', '--location', '--compressed'], 'CURL_BINARY': 'curl', 'CURL_USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.61 Safari/537.36 ArchiveBox/0.6.2 (+https://github.com/ArchiveBox/ArchiveBox/) curl/curl 7.58.0 (x86_64-pc-linux-gnu)', 'CURL_VERSION': 'curl 7.58.0 (x86_64-pc-linux-gnu)', 'CUSTOM_TEMPLATES_DIR': None, 'DATA_LOCATIONS': {'ARCHIVE_DIR': {'enabled': True, 'is_valid': False, 'path': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs/archive')}, 'CONFIG_FILE': {'enabled': True, 'is_valid': False, 'path': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs/ArchiveBox.conf')}, 'LOGS_DIR': {'enabled': True, 'is_valid': False, 'path': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs/logs')}, 'OUTPUT_DIR': {'enabled': True, 'is_valid': True, 'path': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs')}, 'SOURCES_DIR': {'enabled': True, 'is_valid': False, 'path': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs/sources')}, 'SQL_INDEX': {'enabled': True, 'is_valid': True, 'path': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs/index.sqlite3')}}, 'DEBUG': False, 'DEPENDENCIES': {'ARCHIVEBOX_BINARY': {'enabled': True, 'hash': 'md5:dcfa2704a5f1dd5098ddc9081e29d235', 'is_valid': True, 'path': '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/envs/v0.6.2/lib/python3.7/site-packages/sphinx/__main__.py', 'version': '0.6.2'}, 'CHROME_BINARY': {'enabled': False, 'hash': None, 'is_valid': False, 'path': None, 'version': None}, 'CURL_BINARY': {'enabled': True, 'hash': 'md5:3fcaf8e88f72f038090ede1a4749ce15', 'is_valid': True, 'path': '/usr/bin/curl', 'version': 'curl 7.58.0 (x86_64-pc-linux-gnu)'}, 'DJANGO_BINARY': {'enabled': True, 'hash': 'md5:b104c708515bb424da6b8b4f4aecb821', 'is_valid': True, 'path': '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/envs/v0.6.2/lib/python3.7/site-packages/Django-3.1.8-py3.7.egg/django/bin/django-admin.py', 'version': '3.1.8 final (0)'}, 'GIT_BINARY': {'enabled': True, 'hash': 'md5:4df726b0eb84ce3f0e170c322995d734', 'is_valid': True, 'path': '/usr/bin/git', 'version': 'git version 2.17.1'}, 'MERCURY_BINARY': {'enabled': True, 'hash': None, 'is_valid': False, 'path': 'mercury-parser', 'version': None}, 'NODE_BINARY': {'enabled': True, 'hash': 'md5:da34bcaec808f532c6fad11797bf9cc6', 'is_valid': True, 'path': '/usr/bin/node', 'version': 'v8.10.0'}, 'PYTHON_BINARY': {'enabled': True, 'hash': 'md5:d7cf9722d98d814deda7822f260069c5', 'is_valid': True, 'path': '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/envs/v0.6.2/bin/python3.7', 'version': '3.7.9'}, 'READABILITY_BINARY': {'enabled': True, 'hash': None, 'is_valid': False, 'path': 'readability-extractor', 'version': None}, 'RIPGREP_BINARY': {'enabled': True, 'hash': None, 'is_valid': False, 'path': 'rg', 'version': None}, 'SINGLEFILE_BINARY': {'enabled': True, 'hash': None, 'is_valid': False, 'path': 'single-file', 'version': None}, 'WGET_BINARY': {'enabled': True, 'hash': 'md5:c3d53e47e50f2f61016331da435b3764', 'is_valid': True, 'path': '/usr/bin/wget', 'version': 'GNU Wget 1.19.4'}, 'YOUTUBEDL_BINARY': {'enabled': True, 'hash': 'md5:803b5c3264d56feed69a05bae2bff42b', 'is_valid': True, 'path': '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/envs/v0.6.2/bin/youtube-dl', 'version': '2021.04.07'}}, 'DJANGO_BINARY': '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/envs/v0.6.2/lib/python3.7/site-packages/Django-3.1.8-py3.7.egg/django/bin/django-admin.py', 'DJANGO_VERSION': '3.1.8 final (0)', 'EXTERNAL_LOCATIONS': {'CHROME_USER_DATA_DIR': {'enabled': False, 'is_valid': False, 'path': None}, 'COOKIES_FILE': {'enabled': None, 'is_valid': False, 'path': None}}, 'FOOTER_INFO': 'Content is hosted for personal archiving purposes only. Contact server owner for any takedown requests.', 'GIT_ARGS': ['--recursive'], 'GIT_BINARY': 'git', 'GIT_DOMAINS': 'github.com,bitbucket.org,gitlab.com,gist.github.com', 'GIT_VERSION': 'git version 2.17.1', 'IN_DOCKER': False, 'IS_TTY': False, 'LOGS_DIR': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs/logs'), 'MEDIA_MAX_SIZE': '750m', 'MEDIA_TIMEOUT': 3600, 'MERCURY_BINARY': 'mercury-parser', 'MERCURY_VERSION': None, 'NODE_BINARY': 'node', 'NODE_VERSION': 'v8.10.0', 'ONLY_NEW': True, 'OUTPUT_DIR': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs'), 'OUTPUT_PERMISSIONS': '755', 'PACKAGE_DIR': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/archivebox'), 'POCKET_ACCESS_TOKENS': {}, 'POCKET_CONSUMER_KEY': None, 'PUBLIC_ADD_VIEW': False, 'PUBLIC_INDEX': True, 'PUBLIC_SNAPSHOTS': True, 'PYTHON_BINARY': '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/envs/v0.6.2/bin/python', 'PYTHON_ENCODING': 'UTF-8', 'PYTHON_VERSION': '3.7.9', 'READABILITY_BINARY': 'readability-extractor', 'READABILITY_VERSION': None, 'RESOLUTION': '1440,2000', 'RESTRICT_FILE_NAMES': 'windows', 'RIPGREP_BINARY': 'rg', 'RIPGREP_VERSION': None, 'SAVE_ARCHIVE_DOT_ORG': True, 'SAVE_DOM': False, 'SAVE_FAVICON': True, 'SAVE_GIT': True, 'SAVE_HEADERS': True, 'SAVE_MEDIA': True, 'SAVE_MERCURY': True, 'SAVE_PDF': False, 'SAVE_READABILITY': True, 'SAVE_SCREENSHOT': False, 'SAVE_SINGLEFILE': False, 'SAVE_TITLE': True, 'SAVE_WARC': True, 'SAVE_WGET': True, 'SAVE_WGET_REQUISITES': True, 'SEARCH_BACKEND_ENGINE': 'ripgrep', 'SEARCH_BACKEND_HOST_NAME': 'localhost', 'SEARCH_BACKEND_PASSWORD': 'SecretPassword', 'SEARCH_BACKEND_PORT': 1491, 'SEARCH_BACKEND_TIMEOUT': 90, 'SECRET_KEY': None, 'SHOW_PROGRESS': False, 'SINGLEFILE_BINARY': 'single-file', 'SINGLEFILE_VERSION': None, 'SNAPSHOTS_PER_PAGE': 40, 'SONIC_BUCKET': 'snapshots', 'SONIC_COLLECTION': 'archivebox', 'SOURCES_DIR': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs/sources'), 'TEMPLATES_DIR': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/archivebox/templates'), 'TERM_WIDTH': <function <lambda>.<locals>.<lambda>>, 'TIMEOUT': 60, 'TIME_ZONE': 'UTC', 'URL_BLACKLIST': '\\.(css|js|otf|ttf|woff|woff2|gstatic\\.com|googleapis\\.com/css)(\\?.*)?$', 'URL_BLACKLIST_PTN': re.compile('\\.(css|js|otf|ttf|woff|woff2|gstatic\\.com|googleapis\\.com/css)(\\?.*)?$', re.IGNORECASE|re.MULTILINE), 'USER': 'docs', 'USE_CHROME': False, 'USE_COLOR': False, 'USE_CURL': True, 'USE_GIT': True, 'USE_INDEXING_BACKEND': True, 'USE_MERCURY': True, 'USE_NODE': True, 'USE_READABILITY': True, 'USE_RIPGREP': True, 'USE_SEARCHING_BACKEND': True, 'USE_SINGLEFILE': True, 'USE_WGET': True, 'USE_YOUTUBEDL': True, 'VERSION': '0.6.2', 'WGET_ARGS': ['--no-verbose', '--adjust-extension', '--convert-links', '--force-directories', '--backup-converted', '--span-hosts', '--no-parent', '-e', 'robots=off'], 'WGET_AUTO_COMPRESSION': False, 'WGET_BINARY': 'wget', 'WGET_USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.61 Safari/537.36 ArchiveBox/0.6.2 (+https://github.com/ArchiveBox/ArchiveBox/) wget/GNU Wget 1.19.4', 'WGET_VERSION': 'GNU Wget 1.19.4', 'YOUTUBEDL_ARGS': ['--write-description', '--write-info-json', '--write-annotations', '--write-thumbnail', '--no-call-home', '--write-sub', '--all-subs', '--write-auto-sub', '--convert-subs=srt', '--yes-playlist', '--continue', '--ignore-errors', '--geo-bypass', '--add-metadata', '--max-filesize=750m'], 'YOUTUBEDL_BINARY': 'youtube-dl', 'YOUTUBEDL_VERSION': '2021.04.07'}) → None[source]¶
-
archivebox.config.
check_dependencies
(config: archivebox.config_stubs.ConfigDict = {'ALLOWED_HOSTS': '*', 'ANSI': {'black': '', 'blue': '', 'green': '', 'lightblue': '', 'lightred': '', 'lightyellow': '', 'red': '', 'reset': '', 'white': ''}, 'ARCHIVEBOX_BINARY': '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/envs/v0.6.2/lib/python3.7/site-packages/sphinx/__main__.py', 'ARCHIVE_DIR': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs/archive'), 'BIND_ADDR': '127.0.0.1:8000', 'CHECK_SSL_VALIDITY': True, 'CHROME_BINARY': None, 'CHROME_HEADLESS': True, 'CHROME_OPTIONS': {'CHECK_SSL_VALIDITY': True, 'CHROME_BINARY': None, 'CHROME_HEADLESS': True, 'CHROME_SANDBOX': True, 'CHROME_USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.61 Safari/537.36 ArchiveBox/{VERSION} (+https://github.com/ArchiveBox/ArchiveBox/)', 'CHROME_USER_DATA_DIR': None, 'RESOLUTION': '1440,2000', 'TIMEOUT': 60}, 'CHROME_SANDBOX': True, 'CHROME_USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.61 Safari/537.36 ArchiveBox/{VERSION} (+https://github.com/ArchiveBox/ArchiveBox/)', 'CHROME_USER_DATA_DIR': None, 'CHROME_VERSION': None, 'CODE_LOCATIONS': {'CUSTOM_TEMPLATES_DIR': {'enabled': False, 'is_valid': None, 'path': None}, 'PACKAGE_DIR': {'enabled': True, 'is_valid': True, 'path': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/archivebox')}, 'TEMPLATES_DIR': {'enabled': True, 'is_valid': True, 'path': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/archivebox/templates')}}, 'CONFIG_FILE': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs/ArchiveBox.conf'), 'COOKIES_FILE': None, 'CURL_ARGS': ['--silent', '--location', '--compressed'], 'CURL_BINARY': 'curl', 'CURL_USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.61 Safari/537.36 ArchiveBox/0.6.2 (+https://github.com/ArchiveBox/ArchiveBox/) curl/curl 7.58.0 (x86_64-pc-linux-gnu)', 'CURL_VERSION': 'curl 7.58.0 (x86_64-pc-linux-gnu)', 'CUSTOM_TEMPLATES_DIR': None, 'DATA_LOCATIONS': {'ARCHIVE_DIR': {'enabled': True, 'is_valid': False, 'path': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs/archive')}, 'CONFIG_FILE': {'enabled': True, 'is_valid': False, 'path': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs/ArchiveBox.conf')}, 'LOGS_DIR': {'enabled': True, 'is_valid': False, 'path': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs/logs')}, 'OUTPUT_DIR': {'enabled': True, 'is_valid': True, 'path': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs')}, 'SOURCES_DIR': {'enabled': True, 'is_valid': False, 'path': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs/sources')}, 'SQL_INDEX': {'enabled': True, 'is_valid': True, 'path': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs/index.sqlite3')}}, 'DEBUG': False, 'DEPENDENCIES': {'ARCHIVEBOX_BINARY': {'enabled': True, 'hash': 'md5:dcfa2704a5f1dd5098ddc9081e29d235', 'is_valid': True, 'path': '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/envs/v0.6.2/lib/python3.7/site-packages/sphinx/__main__.py', 'version': '0.6.2'}, 'CHROME_BINARY': {'enabled': False, 'hash': None, 'is_valid': False, 'path': None, 'version': None}, 'CURL_BINARY': {'enabled': True, 'hash': 'md5:3fcaf8e88f72f038090ede1a4749ce15', 'is_valid': True, 'path': '/usr/bin/curl', 'version': 'curl 7.58.0 (x86_64-pc-linux-gnu)'}, 'DJANGO_BINARY': {'enabled': True, 'hash': 'md5:b104c708515bb424da6b8b4f4aecb821', 'is_valid': True, 'path': '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/envs/v0.6.2/lib/python3.7/site-packages/Django-3.1.8-py3.7.egg/django/bin/django-admin.py', 'version': '3.1.8 final (0)'}, 'GIT_BINARY': {'enabled': True, 'hash': 'md5:4df726b0eb84ce3f0e170c322995d734', 'is_valid': True, 'path': '/usr/bin/git', 'version': 'git version 2.17.1'}, 'MERCURY_BINARY': {'enabled': True, 'hash': None, 'is_valid': False, 'path': 'mercury-parser', 'version': None}, 'NODE_BINARY': {'enabled': True, 'hash': 'md5:da34bcaec808f532c6fad11797bf9cc6', 'is_valid': True, 'path': '/usr/bin/node', 'version': 'v8.10.0'}, 'PYTHON_BINARY': {'enabled': True, 'hash': 'md5:d7cf9722d98d814deda7822f260069c5', 'is_valid': True, 'path': '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/envs/v0.6.2/bin/python3.7', 'version': '3.7.9'}, 'READABILITY_BINARY': {'enabled': True, 'hash': None, 'is_valid': False, 'path': 'readability-extractor', 'version': None}, 'RIPGREP_BINARY': {'enabled': True, 'hash': None, 'is_valid': False, 'path': 'rg', 'version': None}, 'SINGLEFILE_BINARY': {'enabled': True, 'hash': None, 'is_valid': False, 'path': 'single-file', 'version': None}, 'WGET_BINARY': {'enabled': True, 'hash': 'md5:c3d53e47e50f2f61016331da435b3764', 'is_valid': True, 'path': '/usr/bin/wget', 'version': 'GNU Wget 1.19.4'}, 'YOUTUBEDL_BINARY': {'enabled': True, 'hash': 'md5:803b5c3264d56feed69a05bae2bff42b', 'is_valid': True, 'path': '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/envs/v0.6.2/bin/youtube-dl', 'version': '2021.04.07'}}, 'DJANGO_BINARY': '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/envs/v0.6.2/lib/python3.7/site-packages/Django-3.1.8-py3.7.egg/django/bin/django-admin.py', 'DJANGO_VERSION': '3.1.8 final (0)', 'EXTERNAL_LOCATIONS': {'CHROME_USER_DATA_DIR': {'enabled': False, 'is_valid': False, 'path': None}, 'COOKIES_FILE': {'enabled': None, 'is_valid': False, 'path': None}}, 'FOOTER_INFO': 'Content is hosted for personal archiving purposes only. Contact server owner for any takedown requests.', 'GIT_ARGS': ['--recursive'], 'GIT_BINARY': 'git', 'GIT_DOMAINS': 'github.com,bitbucket.org,gitlab.com,gist.github.com', 'GIT_VERSION': 'git version 2.17.1', 'IN_DOCKER': False, 'IS_TTY': False, 'LOGS_DIR': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs/logs'), 'MEDIA_MAX_SIZE': '750m', 'MEDIA_TIMEOUT': 3600, 'MERCURY_BINARY': 'mercury-parser', 'MERCURY_VERSION': None, 'NODE_BINARY': 'node', 'NODE_VERSION': 'v8.10.0', 'ONLY_NEW': True, 'OUTPUT_DIR': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs'), 'OUTPUT_PERMISSIONS': '755', 'PACKAGE_DIR': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/archivebox'), 'POCKET_ACCESS_TOKENS': {}, 'POCKET_CONSUMER_KEY': None, 'PUBLIC_ADD_VIEW': False, 'PUBLIC_INDEX': True, 'PUBLIC_SNAPSHOTS': True, 'PYTHON_BINARY': '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/envs/v0.6.2/bin/python', 'PYTHON_ENCODING': 'UTF-8', 'PYTHON_VERSION': '3.7.9', 'READABILITY_BINARY': 'readability-extractor', 'READABILITY_VERSION': None, 'RESOLUTION': '1440,2000', 'RESTRICT_FILE_NAMES': 'windows', 'RIPGREP_BINARY': 'rg', 'RIPGREP_VERSION': None, 'SAVE_ARCHIVE_DOT_ORG': True, 'SAVE_DOM': False, 'SAVE_FAVICON': True, 'SAVE_GIT': True, 'SAVE_HEADERS': True, 'SAVE_MEDIA': True, 'SAVE_MERCURY': True, 'SAVE_PDF': False, 'SAVE_READABILITY': True, 'SAVE_SCREENSHOT': False, 'SAVE_SINGLEFILE': False, 'SAVE_TITLE': True, 'SAVE_WARC': True, 'SAVE_WGET': True, 'SAVE_WGET_REQUISITES': True, 'SEARCH_BACKEND_ENGINE': 'ripgrep', 'SEARCH_BACKEND_HOST_NAME': 'localhost', 'SEARCH_BACKEND_PASSWORD': 'SecretPassword', 'SEARCH_BACKEND_PORT': 1491, 'SEARCH_BACKEND_TIMEOUT': 90, 'SECRET_KEY': None, 'SHOW_PROGRESS': False, 'SINGLEFILE_BINARY': 'single-file', 'SINGLEFILE_VERSION': None, 'SNAPSHOTS_PER_PAGE': 40, 'SONIC_BUCKET': 'snapshots', 'SONIC_COLLECTION': 'archivebox', 'SOURCES_DIR': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs/sources'), 'TEMPLATES_DIR': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/archivebox/templates'), 'TERM_WIDTH': <function <lambda>.<locals>.<lambda>>, 'TIMEOUT': 60, 'TIME_ZONE': 'UTC', 'URL_BLACKLIST': '\\.(css|js|otf|ttf|woff|woff2|gstatic\\.com|googleapis\\.com/css)(\\?.*)?$', 'URL_BLACKLIST_PTN': re.compile('\\.(css|js|otf|ttf|woff|woff2|gstatic\\.com|googleapis\\.com/css)(\\?.*)?$', re.IGNORECASE|re.MULTILINE), 'USER': 'docs', 'USE_CHROME': False, 'USE_COLOR': False, 'USE_CURL': True, 'USE_GIT': True, 'USE_INDEXING_BACKEND': True, 'USE_MERCURY': True, 'USE_NODE': True, 'USE_READABILITY': True, 'USE_RIPGREP': True, 'USE_SEARCHING_BACKEND': True, 'USE_SINGLEFILE': True, 'USE_WGET': True, 'USE_YOUTUBEDL': True, 'VERSION': '0.6.2', 'WGET_ARGS': ['--no-verbose', '--adjust-extension', '--convert-links', '--force-directories', '--backup-converted', '--span-hosts', '--no-parent', '-e', 'robots=off'], 'WGET_AUTO_COMPRESSION': False, 'WGET_BINARY': 'wget', 'WGET_USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.61 Safari/537.36 ArchiveBox/0.6.2 (+https://github.com/ArchiveBox/ArchiveBox/) wget/GNU Wget 1.19.4', 'WGET_VERSION': 'GNU Wget 1.19.4', 'YOUTUBEDL_ARGS': ['--write-description', '--write-info-json', '--write-annotations', '--write-thumbnail', '--no-call-home', '--write-sub', '--all-subs', '--write-auto-sub', '--convert-subs=srt', '--yes-playlist', '--continue', '--ignore-errors', '--geo-bypass', '--add-metadata', '--max-filesize=750m'], 'YOUTUBEDL_BINARY': 'youtube-dl', 'YOUTUBEDL_VERSION': '2021.04.07'}, show_help: bool = True) → None[source]¶
-
archivebox.config.
check_data_folder
(out_dir: Union[str, pathlib.Path, None] = None, config: archivebox.config_stubs.ConfigDict = {'ALLOWED_HOSTS': '*', 'ANSI': {'black': '', 'blue': '', 'green': '', 'lightblue': '', 'lightred': '', 'lightyellow': '', 'red': '', 'reset': '', 'white': ''}, 'ARCHIVEBOX_BINARY': '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/envs/v0.6.2/lib/python3.7/site-packages/sphinx/__main__.py', 'ARCHIVE_DIR': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs/archive'), 'BIND_ADDR': '127.0.0.1:8000', 'CHECK_SSL_VALIDITY': True, 'CHROME_BINARY': None, 'CHROME_HEADLESS': True, 'CHROME_OPTIONS': {'CHECK_SSL_VALIDITY': True, 'CHROME_BINARY': None, 'CHROME_HEADLESS': True, 'CHROME_SANDBOX': True, 'CHROME_USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.61 Safari/537.36 ArchiveBox/{VERSION} (+https://github.com/ArchiveBox/ArchiveBox/)', 'CHROME_USER_DATA_DIR': None, 'RESOLUTION': '1440,2000', 'TIMEOUT': 60}, 'CHROME_SANDBOX': True, 'CHROME_USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.61 Safari/537.36 ArchiveBox/{VERSION} (+https://github.com/ArchiveBox/ArchiveBox/)', 'CHROME_USER_DATA_DIR': None, 'CHROME_VERSION': None, 'CODE_LOCATIONS': {'CUSTOM_TEMPLATES_DIR': {'enabled': False, 'is_valid': None, 'path': None}, 'PACKAGE_DIR': {'enabled': True, 'is_valid': True, 'path': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/archivebox')}, 'TEMPLATES_DIR': {'enabled': True, 'is_valid': True, 'path': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/archivebox/templates')}}, 'CONFIG_FILE': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs/ArchiveBox.conf'), 'COOKIES_FILE': None, 'CURL_ARGS': ['--silent', '--location', '--compressed'], 'CURL_BINARY': 'curl', 'CURL_USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.61 Safari/537.36 ArchiveBox/0.6.2 (+https://github.com/ArchiveBox/ArchiveBox/) curl/curl 7.58.0 (x86_64-pc-linux-gnu)', 'CURL_VERSION': 'curl 7.58.0 (x86_64-pc-linux-gnu)', 'CUSTOM_TEMPLATES_DIR': None, 'DATA_LOCATIONS': {'ARCHIVE_DIR': {'enabled': True, 'is_valid': False, 'path': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs/archive')}, 'CONFIG_FILE': {'enabled': True, 'is_valid': False, 'path': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs/ArchiveBox.conf')}, 'LOGS_DIR': {'enabled': True, 'is_valid': False, 'path': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs/logs')}, 'OUTPUT_DIR': {'enabled': True, 'is_valid': True, 'path': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs')}, 'SOURCES_DIR': {'enabled': True, 'is_valid': False, 'path': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs/sources')}, 'SQL_INDEX': {'enabled': True, 'is_valid': True, 'path': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs/index.sqlite3')}}, 'DEBUG': False, 'DEPENDENCIES': {'ARCHIVEBOX_BINARY': {'enabled': True, 'hash': 'md5:dcfa2704a5f1dd5098ddc9081e29d235', 'is_valid': True, 'path': '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/envs/v0.6.2/lib/python3.7/site-packages/sphinx/__main__.py', 'version': '0.6.2'}, 'CHROME_BINARY': {'enabled': False, 'hash': None, 'is_valid': False, 'path': None, 'version': None}, 'CURL_BINARY': {'enabled': True, 'hash': 'md5:3fcaf8e88f72f038090ede1a4749ce15', 'is_valid': True, 'path': '/usr/bin/curl', 'version': 'curl 7.58.0 (x86_64-pc-linux-gnu)'}, 'DJANGO_BINARY': {'enabled': True, 'hash': 'md5:b104c708515bb424da6b8b4f4aecb821', 'is_valid': True, 'path': '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/envs/v0.6.2/lib/python3.7/site-packages/Django-3.1.8-py3.7.egg/django/bin/django-admin.py', 'version': '3.1.8 final (0)'}, 'GIT_BINARY': {'enabled': True, 'hash': 'md5:4df726b0eb84ce3f0e170c322995d734', 'is_valid': True, 'path': '/usr/bin/git', 'version': 'git version 2.17.1'}, 'MERCURY_BINARY': {'enabled': True, 'hash': None, 'is_valid': False, 'path': 'mercury-parser', 'version': None}, 'NODE_BINARY': {'enabled': True, 'hash': 'md5:da34bcaec808f532c6fad11797bf9cc6', 'is_valid': True, 'path': '/usr/bin/node', 'version': 'v8.10.0'}, 'PYTHON_BINARY': {'enabled': True, 'hash': 'md5:d7cf9722d98d814deda7822f260069c5', 'is_valid': True, 'path': '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/envs/v0.6.2/bin/python3.7', 'version': '3.7.9'}, 'READABILITY_BINARY': {'enabled': True, 'hash': None, 'is_valid': False, 'path': 'readability-extractor', 'version': None}, 'RIPGREP_BINARY': {'enabled': True, 'hash': None, 'is_valid': False, 'path': 'rg', 'version': None}, 'SINGLEFILE_BINARY': {'enabled': True, 'hash': None, 'is_valid': False, 'path': 'single-file', 'version': None}, 'WGET_BINARY': {'enabled': True, 'hash': 'md5:c3d53e47e50f2f61016331da435b3764', 'is_valid': True, 'path': '/usr/bin/wget', 'version': 'GNU Wget 1.19.4'}, 'YOUTUBEDL_BINARY': {'enabled': True, 'hash': 'md5:803b5c3264d56feed69a05bae2bff42b', 'is_valid': True, 'path': '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/envs/v0.6.2/bin/youtube-dl', 'version': '2021.04.07'}}, 'DJANGO_BINARY': '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/envs/v0.6.2/lib/python3.7/site-packages/Django-3.1.8-py3.7.egg/django/bin/django-admin.py', 'DJANGO_VERSION': '3.1.8 final (0)', 'EXTERNAL_LOCATIONS': {'CHROME_USER_DATA_DIR': {'enabled': False, 'is_valid': False, 'path': None}, 'COOKIES_FILE': {'enabled': None, 'is_valid': False, 'path': None}}, 'FOOTER_INFO': 'Content is hosted for personal archiving purposes only. Contact server owner for any takedown requests.', 'GIT_ARGS': ['--recursive'], 'GIT_BINARY': 'git', 'GIT_DOMAINS': 'github.com,bitbucket.org,gitlab.com,gist.github.com', 'GIT_VERSION': 'git version 2.17.1', 'IN_DOCKER': False, 'IS_TTY': False, 'LOGS_DIR': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs/logs'), 'MEDIA_MAX_SIZE': '750m', 'MEDIA_TIMEOUT': 3600, 'MERCURY_BINARY': 'mercury-parser', 'MERCURY_VERSION': None, 'NODE_BINARY': 'node', 'NODE_VERSION': 'v8.10.0', 'ONLY_NEW': True, 'OUTPUT_DIR': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs'), 'OUTPUT_PERMISSIONS': '755', 'PACKAGE_DIR': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/archivebox'), 'POCKET_ACCESS_TOKENS': {}, 'POCKET_CONSUMER_KEY': None, 'PUBLIC_ADD_VIEW': False, 'PUBLIC_INDEX': True, 'PUBLIC_SNAPSHOTS': True, 'PYTHON_BINARY': '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/envs/v0.6.2/bin/python', 'PYTHON_ENCODING': 'UTF-8', 'PYTHON_VERSION': '3.7.9', 'READABILITY_BINARY': 'readability-extractor', 'READABILITY_VERSION': None, 'RESOLUTION': '1440,2000', 'RESTRICT_FILE_NAMES': 'windows', 'RIPGREP_BINARY': 'rg', 'RIPGREP_VERSION': None, 'SAVE_ARCHIVE_DOT_ORG': True, 'SAVE_DOM': False, 'SAVE_FAVICON': True, 'SAVE_GIT': True, 'SAVE_HEADERS': True, 'SAVE_MEDIA': True, 'SAVE_MERCURY': True, 'SAVE_PDF': False, 'SAVE_READABILITY': True, 'SAVE_SCREENSHOT': False, 'SAVE_SINGLEFILE': False, 'SAVE_TITLE': True, 'SAVE_WARC': True, 'SAVE_WGET': True, 'SAVE_WGET_REQUISITES': True, 'SEARCH_BACKEND_ENGINE': 'ripgrep', 'SEARCH_BACKEND_HOST_NAME': 'localhost', 'SEARCH_BACKEND_PASSWORD': 'SecretPassword', 'SEARCH_BACKEND_PORT': 1491, 'SEARCH_BACKEND_TIMEOUT': 90, 'SECRET_KEY': None, 'SHOW_PROGRESS': False, 'SINGLEFILE_BINARY': 'single-file', 'SINGLEFILE_VERSION': None, 'SNAPSHOTS_PER_PAGE': 40, 'SONIC_BUCKET': 'snapshots', 'SONIC_COLLECTION': 'archivebox', 'SOURCES_DIR': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs/sources'), 'TEMPLATES_DIR': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/archivebox/templates'), 'TERM_WIDTH': <function <lambda>.<locals>.<lambda>>, 'TIMEOUT': 60, 'TIME_ZONE': 'UTC', 'URL_BLACKLIST': '\\.(css|js|otf|ttf|woff|woff2|gstatic\\.com|googleapis\\.com/css)(\\?.*)?$', 'URL_BLACKLIST_PTN': re.compile('\\.(css|js|otf|ttf|woff|woff2|gstatic\\.com|googleapis\\.com/css)(\\?.*)?$', re.IGNORECASE|re.MULTILINE), 'USER': 'docs', 'USE_CHROME': False, 'USE_COLOR': False, 'USE_CURL': True, 'USE_GIT': True, 'USE_INDEXING_BACKEND': True, 'USE_MERCURY': True, 'USE_NODE': True, 'USE_READABILITY': True, 'USE_RIPGREP': True, 'USE_SEARCHING_BACKEND': True, 'USE_SINGLEFILE': True, 'USE_WGET': True, 'USE_YOUTUBEDL': True, 'VERSION': '0.6.2', 'WGET_ARGS': ['--no-verbose', '--adjust-extension', '--convert-links', '--force-directories', '--backup-converted', '--span-hosts', '--no-parent', '-e', 'robots=off'], 'WGET_AUTO_COMPRESSION': False, 'WGET_BINARY': 'wget', 'WGET_USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.61 Safari/537.36 ArchiveBox/0.6.2 (+https://github.com/ArchiveBox/ArchiveBox/) wget/GNU Wget 1.19.4', 'WGET_VERSION': 'GNU Wget 1.19.4', 'YOUTUBEDL_ARGS': ['--write-description', '--write-info-json', '--write-annotations', '--write-thumbnail', '--no-call-home', '--write-sub', '--all-subs', '--write-auto-sub', '--convert-subs=srt', '--yes-playlist', '--continue', '--ignore-errors', '--geo-bypass', '--add-metadata', '--max-filesize=750m'], 'YOUTUBEDL_BINARY': 'youtube-dl', 'YOUTUBEDL_VERSION': '2021.04.07'}) → None[source]¶
-
archivebox.config.
check_migrations
(out_dir: Union[str, pathlib.Path, None] = None, config: archivebox.config_stubs.ConfigDict = {'ALLOWED_HOSTS': '*', 'ANSI': {'black': '', 'blue': '', 'green': '', 'lightblue': '', 'lightred': '', 'lightyellow': '', 'red': '', 'reset': '', 'white': ''}, 'ARCHIVEBOX_BINARY': '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/envs/v0.6.2/lib/python3.7/site-packages/sphinx/__main__.py', 'ARCHIVE_DIR': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs/archive'), 'BIND_ADDR': '127.0.0.1:8000', 'CHECK_SSL_VALIDITY': True, 'CHROME_BINARY': None, 'CHROME_HEADLESS': True, 'CHROME_OPTIONS': {'CHECK_SSL_VALIDITY': True, 'CHROME_BINARY': None, 'CHROME_HEADLESS': True, 'CHROME_SANDBOX': True, 'CHROME_USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.61 Safari/537.36 ArchiveBox/{VERSION} (+https://github.com/ArchiveBox/ArchiveBox/)', 'CHROME_USER_DATA_DIR': None, 'RESOLUTION': '1440,2000', 'TIMEOUT': 60}, 'CHROME_SANDBOX': True, 'CHROME_USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.61 Safari/537.36 ArchiveBox/{VERSION} (+https://github.com/ArchiveBox/ArchiveBox/)', 'CHROME_USER_DATA_DIR': None, 'CHROME_VERSION': None, 'CODE_LOCATIONS': {'CUSTOM_TEMPLATES_DIR': {'enabled': False, 'is_valid': None, 'path': None}, 'PACKAGE_DIR': {'enabled': True, 'is_valid': True, 'path': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/archivebox')}, 'TEMPLATES_DIR': {'enabled': True, 'is_valid': True, 'path': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/archivebox/templates')}}, 'CONFIG_FILE': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs/ArchiveBox.conf'), 'COOKIES_FILE': None, 'CURL_ARGS': ['--silent', '--location', '--compressed'], 'CURL_BINARY': 'curl', 'CURL_USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.61 Safari/537.36 ArchiveBox/0.6.2 (+https://github.com/ArchiveBox/ArchiveBox/) curl/curl 7.58.0 (x86_64-pc-linux-gnu)', 'CURL_VERSION': 'curl 7.58.0 (x86_64-pc-linux-gnu)', 'CUSTOM_TEMPLATES_DIR': None, 'DATA_LOCATIONS': {'ARCHIVE_DIR': {'enabled': True, 'is_valid': False, 'path': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs/archive')}, 'CONFIG_FILE': {'enabled': True, 'is_valid': False, 'path': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs/ArchiveBox.conf')}, 'LOGS_DIR': {'enabled': True, 'is_valid': False, 'path': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs/logs')}, 'OUTPUT_DIR': {'enabled': True, 'is_valid': True, 'path': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs')}, 'SOURCES_DIR': {'enabled': True, 'is_valid': False, 'path': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs/sources')}, 'SQL_INDEX': {'enabled': True, 'is_valid': True, 'path': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs/index.sqlite3')}}, 'DEBUG': False, 'DEPENDENCIES': {'ARCHIVEBOX_BINARY': {'enabled': True, 'hash': 'md5:dcfa2704a5f1dd5098ddc9081e29d235', 'is_valid': True, 'path': '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/envs/v0.6.2/lib/python3.7/site-packages/sphinx/__main__.py', 'version': '0.6.2'}, 'CHROME_BINARY': {'enabled': False, 'hash': None, 'is_valid': False, 'path': None, 'version': None}, 'CURL_BINARY': {'enabled': True, 'hash': 'md5:3fcaf8e88f72f038090ede1a4749ce15', 'is_valid': True, 'path': '/usr/bin/curl', 'version': 'curl 7.58.0 (x86_64-pc-linux-gnu)'}, 'DJANGO_BINARY': {'enabled': True, 'hash': 'md5:b104c708515bb424da6b8b4f4aecb821', 'is_valid': True, 'path': '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/envs/v0.6.2/lib/python3.7/site-packages/Django-3.1.8-py3.7.egg/django/bin/django-admin.py', 'version': '3.1.8 final (0)'}, 'GIT_BINARY': {'enabled': True, 'hash': 'md5:4df726b0eb84ce3f0e170c322995d734', 'is_valid': True, 'path': '/usr/bin/git', 'version': 'git version 2.17.1'}, 'MERCURY_BINARY': {'enabled': True, 'hash': None, 'is_valid': False, 'path': 'mercury-parser', 'version': None}, 'NODE_BINARY': {'enabled': True, 'hash': 'md5:da34bcaec808f532c6fad11797bf9cc6', 'is_valid': True, 'path': '/usr/bin/node', 'version': 'v8.10.0'}, 'PYTHON_BINARY': {'enabled': True, 'hash': 'md5:d7cf9722d98d814deda7822f260069c5', 'is_valid': True, 'path': '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/envs/v0.6.2/bin/python3.7', 'version': '3.7.9'}, 'READABILITY_BINARY': {'enabled': True, 'hash': None, 'is_valid': False, 'path': 'readability-extractor', 'version': None}, 'RIPGREP_BINARY': {'enabled': True, 'hash': None, 'is_valid': False, 'path': 'rg', 'version': None}, 'SINGLEFILE_BINARY': {'enabled': True, 'hash': None, 'is_valid': False, 'path': 'single-file', 'version': None}, 'WGET_BINARY': {'enabled': True, 'hash': 'md5:c3d53e47e50f2f61016331da435b3764', 'is_valid': True, 'path': '/usr/bin/wget', 'version': 'GNU Wget 1.19.4'}, 'YOUTUBEDL_BINARY': {'enabled': True, 'hash': 'md5:803b5c3264d56feed69a05bae2bff42b', 'is_valid': True, 'path': '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/envs/v0.6.2/bin/youtube-dl', 'version': '2021.04.07'}}, 'DJANGO_BINARY': '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/envs/v0.6.2/lib/python3.7/site-packages/Django-3.1.8-py3.7.egg/django/bin/django-admin.py', 'DJANGO_VERSION': '3.1.8 final (0)', 'EXTERNAL_LOCATIONS': {'CHROME_USER_DATA_DIR': {'enabled': False, 'is_valid': False, 'path': None}, 'COOKIES_FILE': {'enabled': None, 'is_valid': False, 'path': None}}, 'FOOTER_INFO': 'Content is hosted for personal archiving purposes only. Contact server owner for any takedown requests.', 'GIT_ARGS': ['--recursive'], 'GIT_BINARY': 'git', 'GIT_DOMAINS': 'github.com,bitbucket.org,gitlab.com,gist.github.com', 'GIT_VERSION': 'git version 2.17.1', 'IN_DOCKER': False, 'IS_TTY': False, 'LOGS_DIR': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs/logs'), 'MEDIA_MAX_SIZE': '750m', 'MEDIA_TIMEOUT': 3600, 'MERCURY_BINARY': 'mercury-parser', 'MERCURY_VERSION': None, 'NODE_BINARY': 'node', 'NODE_VERSION': 'v8.10.0', 'ONLY_NEW': True, 'OUTPUT_DIR': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs'), 'OUTPUT_PERMISSIONS': '755', 'PACKAGE_DIR': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/archivebox'), 'POCKET_ACCESS_TOKENS': {}, 'POCKET_CONSUMER_KEY': None, 'PUBLIC_ADD_VIEW': False, 'PUBLIC_INDEX': True, 'PUBLIC_SNAPSHOTS': True, 'PYTHON_BINARY': '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/envs/v0.6.2/bin/python', 'PYTHON_ENCODING': 'UTF-8', 'PYTHON_VERSION': '3.7.9', 'READABILITY_BINARY': 'readability-extractor', 'READABILITY_VERSION': None, 'RESOLUTION': '1440,2000', 'RESTRICT_FILE_NAMES': 'windows', 'RIPGREP_BINARY': 'rg', 'RIPGREP_VERSION': None, 'SAVE_ARCHIVE_DOT_ORG': True, 'SAVE_DOM': False, 'SAVE_FAVICON': True, 'SAVE_GIT': True, 'SAVE_HEADERS': True, 'SAVE_MEDIA': True, 'SAVE_MERCURY': True, 'SAVE_PDF': False, 'SAVE_READABILITY': True, 'SAVE_SCREENSHOT': False, 'SAVE_SINGLEFILE': False, 'SAVE_TITLE': True, 'SAVE_WARC': True, 'SAVE_WGET': True, 'SAVE_WGET_REQUISITES': True, 'SEARCH_BACKEND_ENGINE': 'ripgrep', 'SEARCH_BACKEND_HOST_NAME': 'localhost', 'SEARCH_BACKEND_PASSWORD': 'SecretPassword', 'SEARCH_BACKEND_PORT': 1491, 'SEARCH_BACKEND_TIMEOUT': 90, 'SECRET_KEY': None, 'SHOW_PROGRESS': False, 'SINGLEFILE_BINARY': 'single-file', 'SINGLEFILE_VERSION': None, 'SNAPSHOTS_PER_PAGE': 40, 'SONIC_BUCKET': 'snapshots', 'SONIC_COLLECTION': 'archivebox', 'SOURCES_DIR': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs/sources'), 'TEMPLATES_DIR': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/archivebox/templates'), 'TERM_WIDTH': <function <lambda>.<locals>.<lambda>>, 'TIMEOUT': 60, 'TIME_ZONE': 'UTC', 'URL_BLACKLIST': '\\.(css|js|otf|ttf|woff|woff2|gstatic\\.com|googleapis\\.com/css)(\\?.*)?$', 'URL_BLACKLIST_PTN': re.compile('\\.(css|js|otf|ttf|woff|woff2|gstatic\\.com|googleapis\\.com/css)(\\?.*)?$', re.IGNORECASE|re.MULTILINE), 'USER': 'docs', 'USE_CHROME': False, 'USE_COLOR': False, 'USE_CURL': True, 'USE_GIT': True, 'USE_INDEXING_BACKEND': True, 'USE_MERCURY': True, 'USE_NODE': True, 'USE_READABILITY': True, 'USE_RIPGREP': True, 'USE_SEARCHING_BACKEND': True, 'USE_SINGLEFILE': True, 'USE_WGET': True, 'USE_YOUTUBEDL': True, 'VERSION': '0.6.2', 'WGET_ARGS': ['--no-verbose', '--adjust-extension', '--convert-links', '--force-directories', '--backup-converted', '--span-hosts', '--no-parent', '-e', 'robots=off'], 'WGET_AUTO_COMPRESSION': False, 'WGET_BINARY': 'wget', 'WGET_USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.61 Safari/537.36 ArchiveBox/0.6.2 (+https://github.com/ArchiveBox/ArchiveBox/) wget/GNU Wget 1.19.4', 'WGET_VERSION': 'GNU Wget 1.19.4', 'YOUTUBEDL_ARGS': ['--write-description', '--write-info-json', '--write-annotations', '--write-thumbnail', '--no-call-home', '--write-sub', '--all-subs', '--write-auto-sub', '--convert-subs=srt', '--yes-playlist', '--continue', '--ignore-errors', '--geo-bypass', '--add-metadata', '--max-filesize=750m'], 'YOUTUBEDL_BINARY': 'youtube-dl', 'YOUTUBEDL_VERSION': '2021.04.07'})[source]¶
-
archivebox.config.
setup_django
(out_dir: pathlib.Path = None, check_db=False, config: archivebox.config_stubs.ConfigDict = {'ALLOWED_HOSTS': '*', 'ANSI': {'black': '', 'blue': '', 'green': '', 'lightblue': '', 'lightred': '', 'lightyellow': '', 'red': '', 'reset': '', 'white': ''}, 'ARCHIVEBOX_BINARY': '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/envs/v0.6.2/lib/python3.7/site-packages/sphinx/__main__.py', 'ARCHIVE_DIR': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs/archive'), 'BIND_ADDR': '127.0.0.1:8000', 'CHECK_SSL_VALIDITY': True, 'CHROME_BINARY': None, 'CHROME_HEADLESS': True, 'CHROME_OPTIONS': {'CHECK_SSL_VALIDITY': True, 'CHROME_BINARY': None, 'CHROME_HEADLESS': True, 'CHROME_SANDBOX': True, 'CHROME_USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.61 Safari/537.36 ArchiveBox/{VERSION} (+https://github.com/ArchiveBox/ArchiveBox/)', 'CHROME_USER_DATA_DIR': None, 'RESOLUTION': '1440,2000', 'TIMEOUT': 60}, 'CHROME_SANDBOX': True, 'CHROME_USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.61 Safari/537.36 ArchiveBox/{VERSION} (+https://github.com/ArchiveBox/ArchiveBox/)', 'CHROME_USER_DATA_DIR': None, 'CHROME_VERSION': None, 'CODE_LOCATIONS': {'CUSTOM_TEMPLATES_DIR': {'enabled': False, 'is_valid': None, 'path': None}, 'PACKAGE_DIR': {'enabled': True, 'is_valid': True, 'path': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/archivebox')}, 'TEMPLATES_DIR': {'enabled': True, 'is_valid': True, 'path': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/archivebox/templates')}}, 'CONFIG_FILE': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs/ArchiveBox.conf'), 'COOKIES_FILE': None, 'CURL_ARGS': ['--silent', '--location', '--compressed'], 'CURL_BINARY': 'curl', 'CURL_USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.61 Safari/537.36 ArchiveBox/0.6.2 (+https://github.com/ArchiveBox/ArchiveBox/) curl/curl 7.58.0 (x86_64-pc-linux-gnu)', 'CURL_VERSION': 'curl 7.58.0 (x86_64-pc-linux-gnu)', 'CUSTOM_TEMPLATES_DIR': None, 'DATA_LOCATIONS': {'ARCHIVE_DIR': {'enabled': True, 'is_valid': False, 'path': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs/archive')}, 'CONFIG_FILE': {'enabled': True, 'is_valid': False, 'path': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs/ArchiveBox.conf')}, 'LOGS_DIR': {'enabled': True, 'is_valid': False, 'path': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs/logs')}, 'OUTPUT_DIR': {'enabled': True, 'is_valid': True, 'path': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs')}, 'SOURCES_DIR': {'enabled': True, 'is_valid': False, 'path': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs/sources')}, 'SQL_INDEX': {'enabled': True, 'is_valid': True, 'path': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs/index.sqlite3')}}, 'DEBUG': False, 'DEPENDENCIES': {'ARCHIVEBOX_BINARY': {'enabled': True, 'hash': 'md5:dcfa2704a5f1dd5098ddc9081e29d235', 'is_valid': True, 'path': '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/envs/v0.6.2/lib/python3.7/site-packages/sphinx/__main__.py', 'version': '0.6.2'}, 'CHROME_BINARY': {'enabled': False, 'hash': None, 'is_valid': False, 'path': None, 'version': None}, 'CURL_BINARY': {'enabled': True, 'hash': 'md5:3fcaf8e88f72f038090ede1a4749ce15', 'is_valid': True, 'path': '/usr/bin/curl', 'version': 'curl 7.58.0 (x86_64-pc-linux-gnu)'}, 'DJANGO_BINARY': {'enabled': True, 'hash': 'md5:b104c708515bb424da6b8b4f4aecb821', 'is_valid': True, 'path': '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/envs/v0.6.2/lib/python3.7/site-packages/Django-3.1.8-py3.7.egg/django/bin/django-admin.py', 'version': '3.1.8 final (0)'}, 'GIT_BINARY': {'enabled': True, 'hash': 'md5:4df726b0eb84ce3f0e170c322995d734', 'is_valid': True, 'path': '/usr/bin/git', 'version': 'git version 2.17.1'}, 'MERCURY_BINARY': {'enabled': True, 'hash': None, 'is_valid': False, 'path': 'mercury-parser', 'version': None}, 'NODE_BINARY': {'enabled': True, 'hash': 'md5:da34bcaec808f532c6fad11797bf9cc6', 'is_valid': True, 'path': '/usr/bin/node', 'version': 'v8.10.0'}, 'PYTHON_BINARY': {'enabled': True, 'hash': 'md5:d7cf9722d98d814deda7822f260069c5', 'is_valid': True, 'path': '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/envs/v0.6.2/bin/python3.7', 'version': '3.7.9'}, 'READABILITY_BINARY': {'enabled': True, 'hash': None, 'is_valid': False, 'path': 'readability-extractor', 'version': None}, 'RIPGREP_BINARY': {'enabled': True, 'hash': None, 'is_valid': False, 'path': 'rg', 'version': None}, 'SINGLEFILE_BINARY': {'enabled': True, 'hash': None, 'is_valid': False, 'path': 'single-file', 'version': None}, 'WGET_BINARY': {'enabled': True, 'hash': 'md5:c3d53e47e50f2f61016331da435b3764', 'is_valid': True, 'path': '/usr/bin/wget', 'version': 'GNU Wget 1.19.4'}, 'YOUTUBEDL_BINARY': {'enabled': True, 'hash': 'md5:803b5c3264d56feed69a05bae2bff42b', 'is_valid': True, 'path': '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/envs/v0.6.2/bin/youtube-dl', 'version': '2021.04.07'}}, 'DJANGO_BINARY': '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/envs/v0.6.2/lib/python3.7/site-packages/Django-3.1.8-py3.7.egg/django/bin/django-admin.py', 'DJANGO_VERSION': '3.1.8 final (0)', 'EXTERNAL_LOCATIONS': {'CHROME_USER_DATA_DIR': {'enabled': False, 'is_valid': False, 'path': None}, 'COOKIES_FILE': {'enabled': None, 'is_valid': False, 'path': None}}, 'FOOTER_INFO': 'Content is hosted for personal archiving purposes only. Contact server owner for any takedown requests.', 'GIT_ARGS': ['--recursive'], 'GIT_BINARY': 'git', 'GIT_DOMAINS': 'github.com,bitbucket.org,gitlab.com,gist.github.com', 'GIT_VERSION': 'git version 2.17.1', 'IN_DOCKER': False, 'IS_TTY': False, 'LOGS_DIR': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs/logs'), 'MEDIA_MAX_SIZE': '750m', 'MEDIA_TIMEOUT': 3600, 'MERCURY_BINARY': 'mercury-parser', 'MERCURY_VERSION': None, 'NODE_BINARY': 'node', 'NODE_VERSION': 'v8.10.0', 'ONLY_NEW': True, 'OUTPUT_DIR': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs'), 'OUTPUT_PERMISSIONS': '755', 'PACKAGE_DIR': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/archivebox'), 'POCKET_ACCESS_TOKENS': {}, 'POCKET_CONSUMER_KEY': None, 'PUBLIC_ADD_VIEW': False, 'PUBLIC_INDEX': True, 'PUBLIC_SNAPSHOTS': True, 'PYTHON_BINARY': '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/envs/v0.6.2/bin/python', 'PYTHON_ENCODING': 'UTF-8', 'PYTHON_VERSION': '3.7.9', 'READABILITY_BINARY': 'readability-extractor', 'READABILITY_VERSION': None, 'RESOLUTION': '1440,2000', 'RESTRICT_FILE_NAMES': 'windows', 'RIPGREP_BINARY': 'rg', 'RIPGREP_VERSION': None, 'SAVE_ARCHIVE_DOT_ORG': True, 'SAVE_DOM': False, 'SAVE_FAVICON': True, 'SAVE_GIT': True, 'SAVE_HEADERS': True, 'SAVE_MEDIA': True, 'SAVE_MERCURY': True, 'SAVE_PDF': False, 'SAVE_READABILITY': True, 'SAVE_SCREENSHOT': False, 'SAVE_SINGLEFILE': False, 'SAVE_TITLE': True, 'SAVE_WARC': True, 'SAVE_WGET': True, 'SAVE_WGET_REQUISITES': True, 'SEARCH_BACKEND_ENGINE': 'ripgrep', 'SEARCH_BACKEND_HOST_NAME': 'localhost', 'SEARCH_BACKEND_PASSWORD': 'SecretPassword', 'SEARCH_BACKEND_PORT': 1491, 'SEARCH_BACKEND_TIMEOUT': 90, 'SECRET_KEY': None, 'SHOW_PROGRESS': False, 'SINGLEFILE_BINARY': 'single-file', 'SINGLEFILE_VERSION': None, 'SNAPSHOTS_PER_PAGE': 40, 'SONIC_BUCKET': 'snapshots', 'SONIC_COLLECTION': 'archivebox', 'SOURCES_DIR': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs/sources'), 'TEMPLATES_DIR': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/archivebox/templates'), 'TERM_WIDTH': <function <lambda>.<locals>.<lambda>>, 'TIMEOUT': 60, 'TIME_ZONE': 'UTC', 'URL_BLACKLIST': '\\.(css|js|otf|ttf|woff|woff2|gstatic\\.com|googleapis\\.com/css)(\\?.*)?$', 'URL_BLACKLIST_PTN': re.compile('\\.(css|js|otf|ttf|woff|woff2|gstatic\\.com|googleapis\\.com/css)(\\?.*)?$', re.IGNORECASE|re.MULTILINE), 'USER': 'docs', 'USE_CHROME': False, 'USE_COLOR': False, 'USE_CURL': True, 'USE_GIT': True, 'USE_INDEXING_BACKEND': True, 'USE_MERCURY': True, 'USE_NODE': True, 'USE_READABILITY': True, 'USE_RIPGREP': True, 'USE_SEARCHING_BACKEND': True, 'USE_SINGLEFILE': True, 'USE_WGET': True, 'USE_YOUTUBEDL': True, 'VERSION': '0.6.2', 'WGET_ARGS': ['--no-verbose', '--adjust-extension', '--convert-links', '--force-directories', '--backup-converted', '--span-hosts', '--no-parent', '-e', 'robots=off'], 'WGET_AUTO_COMPRESSION': False, 'WGET_BINARY': 'wget', 'WGET_USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.61 Safari/537.36 ArchiveBox/0.6.2 (+https://github.com/ArchiveBox/ArchiveBox/) wget/GNU Wget 1.19.4', 'WGET_VERSION': 'GNU Wget 1.19.4', 'YOUTUBEDL_ARGS': ['--write-description', '--write-info-json', '--write-annotations', '--write-thumbnail', '--no-call-home', '--write-sub', '--all-subs', '--write-auto-sub', '--convert-subs=srt', '--yes-playlist', '--continue', '--ignore-errors', '--geo-bypass', '--add-metadata', '--max-filesize=750m'], 'YOUTUBEDL_BINARY': 'youtube-dl', 'YOUTUBEDL_VERSION': '2021.04.07'}, in_memory_db=False) → None[source]¶
-
archivebox.config.
TERM_WIDTH
()¶
archivebox.core package¶
-
class
archivebox.core.migrations.0001_initial.
Migration
(name, app_label)[source]¶ Bases:
django.db.migrations.migration.Migration
-
initial
= True¶
-
dependencies
= []¶
-
operations
= [<CreateModel name='Snapshot', fields=[('id', <django.db.models.fields.UUIDField>), ('url', <django.db.models.fields.URLField>), ('timestamp', <django.db.models.fields.CharField>), ('title', <django.db.models.fields.CharField>), ('tags', <django.db.models.fields.CharField>), ('added', <django.db.models.fields.DateTimeField>), ('updated', <django.db.models.fields.DateTimeField>)]>]¶
-
-
class
archivebox.core.admin.
ArchiveResultInline
(parent_model, admin_site)[source]¶ Bases:
django.contrib.admin.options.TabularInline
-
model
¶ alias of
core.models.ArchiveResult
-
media
¶
-
-
class
archivebox.core.admin.
TagInline
(parent_model, admin_site)[source]¶ Bases:
django.contrib.admin.options.TabularInline
-
model
¶ alias of
core.models.Snapshot_tags
-
media
¶
-
-
class
archivebox.core.admin.
AutocompleteTags
[source]¶ Bases:
object
-
model
¶ alias of
core.models.Tag
-
search_fields
= ['name']¶
-
-
class
archivebox.core.admin.
SnapshotActionForm
(data=None, files=None, auto_id='id_%s', prefix=None, initial=None, error_class=<class 'django.forms.utils.ErrorList'>, label_suffix=None, empty_permitted=False, field_order=None, use_required_attribute=None, renderer=None)[source]¶ Bases:
django.contrib.admin.helpers.ActionForm
-
base_fields
= {'action': <django.forms.fields.ChoiceField object>, 'select_across': <django.forms.fields.BooleanField object>, 'tags': <django.forms.models.ModelMultipleChoiceField object>}¶
-
declared_fields
= {'action': <django.forms.fields.ChoiceField object>, 'select_across': <django.forms.fields.BooleanField object>, 'tags': <django.forms.models.ModelMultipleChoiceField object>}¶
-
media
¶
-
-
class
archivebox.core.admin.
SnapshotAdmin
(model, admin_site)[source]¶ Bases:
core.mixins.SearchResultsAdminMixin
,django.contrib.admin.options.ModelAdmin
-
list_display
= ('added', 'title_str', 'files', 'size', 'url_str')¶
-
sort_fields
= ('title_str', 'url_str', 'added', 'files')¶
-
readonly_fields
= ('info', 'bookmarked', 'added', 'updated')¶
-
search_fields
= ('id', 'url', 'timestamp', 'title', 'tags__name')¶
-
fields
= ('timestamp', 'url', 'title', 'tags', 'info', 'bookmarked', 'added', 'updated')¶
-
list_filter
= ('added', 'updated', 'tags', 'archiveresult__status')¶
-
ordering
= ['-added']¶
-
actions
= ['add_tags', 'remove_tags', 'update_titles', 'update_snapshots', 'resnapshot_snapshot', 'overwrite_snapshots', 'delete_snapshots']¶
-
autocomplete_fields
= ['tags']¶
-
inlines
= [<class 'archivebox.core.admin.ArchiveResultInline'>]¶
-
list_per_page
= 40¶
-
action_form
¶ alias of
SnapshotActionForm
-
get_queryset
(request)[source]¶ Return a QuerySet of all model instances that can be edited by the admin site. This is used by changelist_view.
-
media
¶
-
-
class
archivebox.core.admin.
TagAdmin
(model, admin_site)[source]¶ Bases:
django.contrib.admin.options.ModelAdmin
-
list_display
= ('slug', 'name', 'num_snapshots', 'snapshots', 'id')¶
-
sort_fields
= ('id', 'name', 'slug')¶
-
readonly_fields
= ('id', 'num_snapshots', 'snapshots')¶
-
search_fields
= ('id', 'name', 'slug')¶
-
fields
= ('id', 'num_snapshots', 'snapshots', 'name', 'slug')¶
-
actions
= ['delete_selected']¶
-
ordering
= ['-id']¶
-
media
¶
-
-
class
archivebox.core.admin.
ArchiveResultAdmin
(model, admin_site)[source]¶ Bases:
django.contrib.admin.options.ModelAdmin
-
list_display
= ('id', 'start_ts', 'extractor', 'snapshot_str', 'tags_str', 'cmd_str', 'status', 'output_str')¶
-
sort_fields
= ('start_ts', 'extractor', 'status')¶
-
readonly_fields
= ('id', 'uuid', 'snapshot_str', 'tags_str')¶
-
search_fields
= ('id', 'uuid', 'snapshot__url', 'extractor', 'output', 'cmd_version', 'cmd', 'snapshot__timestamp')¶
-
fields
= ('id', 'uuid', 'snapshot_str', 'tags_str', 'snapshot', 'extractor', 'status', 'start_ts', 'end_ts', 'output', 'pwd', 'cmd', 'cmd_version')¶
-
autocomplete_fields
= ['snapshot']¶
-
list_filter
= ('status', 'extractor', 'start_ts', 'cmd_version')¶
-
ordering
= ['-start_ts']¶
-
list_per_page
= 40¶
-
media
¶
-
-
class
archivebox.core.admin.
ArchiveBoxAdmin
(name='admin')[source]¶ Bases:
django.contrib.admin.sites.AdminSite
-
site_header
= 'ArchiveBox'¶
-
index_title
= 'Links'¶
-
site_title
= 'Index'¶
-
-
archivebox.core.admin.
path
(route, view, kwargs=None, name=None, *, Pattern=<class 'django.urls.resolvers.RoutePattern'>)¶
-
archivebox.core.urls.
path
(route, view, kwargs=None, name=None, *, Pattern=<class 'django.urls.resolvers.RoutePattern'>)¶
-
class
archivebox.core.views.
PublicIndexView
(**kwargs)[source]¶ Bases:
django.views.generic.list.ListView
-
template_name
= 'public_index.html'¶
-
model
¶ alias of
core.models.Snapshot
-
paginate_by
= 40¶
-
ordering
= ['-added']¶
-
WSGI config for archivebox project.
It exposes the WSGI callable as a module-level variable named application
.
For more information on this file, see https://docs.djangoproject.com/en/2.1/howto/deployment/wsgi/
archivebox.extractors package¶
-
archivebox.extractors.archive_org.
should_save_archive_dot_org
(link: archivebox.index.schema.Link, out_dir: Optional[pathlib.Path] = None, overwrite: Optional[bool] = False) → bool[source]¶
-
class
archivebox.extractors.title.
TitleParser
(*args, **kwargs)[source]¶ Bases:
html.parser.HTMLParser
-
title
¶
-
-
archivebox.extractors.wget.
should_save_wget
(link: archivebox.index.schema.Link, out_dir: Optional[pathlib.Path] = None, overwrite: Optional[bool] = False) → bool[source]¶
-
archivebox.extractors.
archive_link
(link: archivebox.index.schema.Link, overwrite: bool = False, methods: Optional[Iterable[str]] = None, out_dir: Optional[pathlib.Path] = None) → archivebox.index.schema.Link[source]¶ download the DOM, PDF, and a screenshot into a folder named after the link’s timestamp
archivebox.index package¶
-
archivebox.index.html.
parse_html_main_index
(out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs')) → Iterator[str][source]¶ parse an archive index html file and return the list of urls
-
archivebox.index.html.
generate_index_from_links
(links: List[archivebox.index.schema.Link], with_headers: bool)[source]¶
-
archivebox.index.html.
main_index_template
(links: List[archivebox.index.schema.Link], template: str = 'static_index.html') → str[source]¶ render the template for the entire main index
-
archivebox.index.html.
write_html_link_details
(link: archivebox.index.schema.Link, out_dir: Optional[str] = None) → None[source]¶
-
archivebox.index.json.
generate_json_index_from_links
(links: List[archivebox.index.schema.Link], with_headers: bool)[source]¶
-
archivebox.index.json.
parse_json_main_index
(out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs')) → Iterator[archivebox.index.schema.Link][source]¶ parse an archive index json file and return the list of links
-
archivebox.index.json.
write_json_link_details
(link: archivebox.index.schema.Link, out_dir: Optional[str] = None) → None[source]¶ write a json file with some info about the link
-
archivebox.index.json.
parse_json_link_details
(out_dir: Union[pathlib.Path, str], guess: Optional[bool] = False) → Optional[archivebox.index.schema.Link][source]¶ load the json link index from a given directory
-
archivebox.index.json.
parse_json_links_details
(out_dir: Union[pathlib.Path, str]) → Iterator[archivebox.index.schema.Link][source]¶ read through all the archive data folders and return the parsed links
-
class
archivebox.index.json.
ExtendedEncoder
(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)[source]¶ Bases:
json.encoder.JSONEncoder
Extended json serializer that supports serializing several model fields and objects
-
default
(obj)[source]¶ Implement this method in a subclass such that it returns a serializable object for
o
, or calls the base implementation (to raise aTypeError
).For example, to support arbitrary iterators, you could implement default like this:
def default(self, o): try: iterable = iter(o) except TypeError: pass else: return list(iterable) # Let the base class default method raise the TypeError return JSONEncoder.default(self, o)
-
WARNING: THIS FILE IS ALL LEGACY CODE TO BE REMOVED.
DO NOT ADD ANY NEW FEATURES TO THIS FILE, NEW CODE GOES HERE: core/models.py
-
class
archivebox.index.schema.
ArchiveResult
(cmd: List[str], pwd: Union[str, NoneType], cmd_version: Union[str, NoneType], output: Union[str, Exception, NoneType], status: str, start_ts: datetime.datetime, end_ts: datetime.datetime, index_texts: Union[List[str], NoneType] = None, schema: str = 'ArchiveResult')[source]¶ Bases:
object
-
index_texts
= None¶
-
schema
= 'ArchiveResult'¶
-
duration
¶
-
-
class
archivebox.index.schema.
Link
(timestamp: str, url: str, title: Union[str, NoneType], tags: Union[str, NoneType], sources: List[str], history: Dict[str, List[archivebox.index.schema.ArchiveResult]] = <factory>, updated: Union[datetime.datetime, NoneType] = None, schema: str = 'Link')[source]¶ Bases:
object
-
updated
= None¶
-
schema
= 'Link'¶
-
snapshot_id
¶
-
link_dir
¶
-
archive_path
¶
-
archive_size
¶
-
url_hash
¶
-
scheme
¶
-
extension
¶
-
domain
¶
-
path
¶
-
basename
¶
-
base_url
¶
-
bookmarked_date
¶
-
updated_date
¶
-
archive_dates
¶
-
oldest_archive_date
¶
-
newest_archive_date
¶
-
num_outputs
¶
-
num_failures
¶
-
is_static
¶
-
is_archived
¶
-
-
archivebox.index.sql.
parse_sql_main_index
(out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs')) → Iterator[archivebox.index.schema.Link][source]¶
-
archivebox.index.sql.
remove_from_sql_main_index
(snapshots: django.db.models.query.QuerySet, atomic: bool = False, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs')) → None[source]¶
-
archivebox.index.sql.
write_sql_main_index
(links: List[archivebox.index.schema.Link], out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs')) → None[source]¶
-
archivebox.index.sql.
write_sql_link_details
(link: archivebox.index.schema.Link, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs')) → None[source]¶
-
archivebox.index.sql.
list_migrations
(out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs')) → List[Tuple[bool, str]][source]¶
-
archivebox.index.
merge_links
(a: archivebox.index.schema.Link, b: archivebox.index.schema.Link) → archivebox.index.schema.Link[source]¶ deterministially merge two links, favoring longer field values over shorter, and “cleaner” values over worse ones.
-
archivebox.index.
validate_links
(links: Iterable[archivebox.index.schema.Link]) → List[archivebox.index.schema.Link][source]¶
-
archivebox.index.
archivable_links
(links: Iterable[archivebox.index.schema.Link]) → Iterable[archivebox.index.schema.Link][source]¶ remove chrome://, about:// or other schemed links that cant be archived
-
archivebox.index.
fix_duplicate_links
(sorted_links: Iterable[archivebox.index.schema.Link]) → Iterable[archivebox.index.schema.Link][source]¶ ensures that all non-duplicate links have monotonically increasing timestamps
-
archivebox.index.
sorted_links
(links: Iterable[archivebox.index.schema.Link]) → Iterable[archivebox.index.schema.Link][source]¶
-
archivebox.index.
links_after_timestamp
(links: Iterable[archivebox.index.schema.Link], resume: Optional[float] = None) → Iterable[archivebox.index.schema.Link][source]¶
-
archivebox.index.
lowest_uniq_timestamp
(used_timestamps: collections.OrderedDict, timestamp: str) → str[source]¶ resolve duplicate timestamps by appending a decimal 1234, 1234 -> 1234.1, 1234.2
-
archivebox.index.
write_main_index
(links: List[archivebox.index.schema.Link], out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs')) → None[source]¶ Writes links to sqlite3 file for a given list of links
-
archivebox.index.
load_main_index
(out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs'), warn: bool = True) → List[archivebox.index.schema.Link][source]¶ parse and load existing index with any new links from import_path merged in
-
archivebox.index.
load_main_index_meta
(out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs')) → Optional[dict][source]¶
-
archivebox.index.
parse_links_from_source
(source_path: str, root_url: Optional[str] = None, parser: str = 'auto') → Tuple[List[archivebox.index.schema.Link], List[archivebox.index.schema.Link]][source]¶
-
archivebox.index.
fix_duplicate_links_in_index
(snapshots: django.db.models.query.QuerySet, links: Iterable[archivebox.index.schema.Link]) → Iterable[archivebox.index.schema.Link][source]¶ Given a list of in-memory Links, dedupe and merge them with any conflicting Snapshots in the DB.
-
archivebox.index.
dedupe_links
(snapshots: django.db.models.query.QuerySet, new_links: List[archivebox.index.schema.Link]) → List[archivebox.index.schema.Link][source]¶ The validation of links happened at a different stage. This method will focus on actual deduplication and timestamp fixing.
-
archivebox.index.
write_link_details
(link: archivebox.index.schema.Link, out_dir: Optional[str] = None, skip_sql_index: bool = False) → None[source]¶
-
archivebox.index.
load_link_details
(link: archivebox.index.schema.Link, out_dir: Optional[str] = None) → archivebox.index.schema.Link[source]¶ check for an existing link archive in the given directory, and load+merge it into the given link dict
-
archivebox.index.
q_filter
(snapshots: django.db.models.query.QuerySet, filter_patterns: List[str], filter_type: str = 'exact') → django.db.models.query.QuerySet[source]¶
-
archivebox.index.
search_filter
(snapshots: django.db.models.query.QuerySet, filter_patterns: List[str], filter_type: str = 'search') → django.db.models.query.QuerySet[source]¶
-
archivebox.index.
snapshot_filter
(snapshots: django.db.models.query.QuerySet, filter_patterns: List[str], filter_type: str = 'exact') → django.db.models.query.QuerySet[source]¶
-
archivebox.index.
get_indexed_folders
(snapshots, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs')) → Dict[str, Optional[archivebox.index.schema.Link]][source]¶ indexed links without checking archive status or data directory validity
-
archivebox.index.
get_archived_folders
(snapshots, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs')) → Dict[str, Optional[archivebox.index.schema.Link]][source]¶ indexed links that are archived with a valid data directory
-
archivebox.index.
get_unarchived_folders
(snapshots, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs')) → Dict[str, Optional[archivebox.index.schema.Link]][source]¶ indexed links that are unarchived with no data directory or an empty data directory
-
archivebox.index.
get_present_folders
(snapshots, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs')) → Dict[str, Optional[archivebox.index.schema.Link]][source]¶ dirs that actually exist in the archive/ folder
-
archivebox.index.
get_valid_folders
(snapshots, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs')) → Dict[str, Optional[archivebox.index.schema.Link]][source]¶ dirs with a valid index matched to the main index and archived content
-
archivebox.index.
get_invalid_folders
(snapshots, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs')) → Dict[str, Optional[archivebox.index.schema.Link]][source]¶ dirs that are invalid for any reason: corrupted/duplicate/orphaned/unrecognized
-
archivebox.index.
get_duplicate_folders
(snapshots, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs')) → Dict[str, Optional[archivebox.index.schema.Link]][source]¶ dirs that conflict with other directories that have the same link URL or timestamp
-
archivebox.index.
get_orphaned_folders
(snapshots, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs')) → Dict[str, Optional[archivebox.index.schema.Link]][source]¶ dirs that contain a valid index but aren’t listed in the main index
-
archivebox.index.
get_corrupted_folders
(snapshots, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs')) → Dict[str, Optional[archivebox.index.schema.Link]][source]¶ dirs that don’t contain a valid index and aren’t listed in the main index
-
archivebox.index.
get_unrecognized_folders
(snapshots, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs')) → Dict[str, Optional[archivebox.index.schema.Link]][source]¶ dirs that don’t contain recognizable archive data and aren’t listed in the main index
archivebox.parsers package¶
-
archivebox.parsers.generic_json.
parse_generic_json_export
(json_file: IO[str], **_kwargs) → Iterable[archivebox.index.schema.Link][source]¶ Parse JSON-format bookmarks export files (produced by pinboard.in/export/, or wallabag)
-
archivebox.parsers.generic_json.
PARSER
(json_file: IO[str], **_kwargs) → Iterable[archivebox.index.schema.Link]¶ Parse JSON-format bookmarks export files (produced by pinboard.in/export/, or wallabag)
-
archivebox.parsers.generic_rss.
parse_generic_rss_export
(rss_file: IO[str], **_kwargs) → Iterable[archivebox.index.schema.Link][source]¶ Parse RSS XML-format files into links
-
archivebox.parsers.generic_rss.
PARSER
(rss_file: IO[str], **_kwargs) → Iterable[archivebox.index.schema.Link]¶ Parse RSS XML-format files into links
-
archivebox.parsers.generic_txt.
parse_generic_txt_export
(text_file: IO[str], **_kwargs) → Iterable[archivebox.index.schema.Link][source]¶ Parse links from a text file, ignoring other text
-
archivebox.parsers.generic_txt.
PARSER
(text_file: IO[str], **_kwargs) → Iterable[archivebox.index.schema.Link]¶ Parse links from a text file, ignoring other text
-
archivebox.parsers.medium_rss.
parse_medium_rss_export
(rss_file: IO[str], **_kwargs) → Iterable[archivebox.index.schema.Link][source]¶ Parse Medium RSS feed files into links
-
archivebox.parsers.medium_rss.
PARSER
(rss_file: IO[str], **_kwargs) → Iterable[archivebox.index.schema.Link]¶ Parse Medium RSS feed files into links
-
archivebox.parsers.netscape_html.
parse_netscape_html_export
(html_file: IO[str], **_kwargs) → Iterable[archivebox.index.schema.Link][source]¶ Parse netscape-format bookmarks export files (produced by all browsers)
-
archivebox.parsers.netscape_html.
PARSER
(html_file: IO[str], **_kwargs) → Iterable[archivebox.index.schema.Link]¶ Parse netscape-format bookmarks export files (produced by all browsers)
-
archivebox.parsers.pinboard_rss.
parse_pinboard_rss_export
(rss_file: IO[str], **_kwargs) → Iterable[archivebox.index.schema.Link][source]¶ Parse Pinboard RSS feed files into links
-
archivebox.parsers.pinboard_rss.
PARSER
(rss_file: IO[str], **_kwargs) → Iterable[archivebox.index.schema.Link]¶ Parse Pinboard RSS feed files into links
-
archivebox.parsers.pocket_html.
parse_pocket_html_export
(html_file: IO[str], **_kwargs) → Iterable[archivebox.index.schema.Link][source]¶ Parse Pocket-format bookmarks export files (produced by getpocket.com/export/)
-
archivebox.parsers.pocket_html.
PARSER
(html_file: IO[str], **_kwargs) → Iterable[archivebox.index.schema.Link]¶ Parse Pocket-format bookmarks export files (produced by getpocket.com/export/)
-
archivebox.parsers.shaarli_rss.
parse_shaarli_rss_export
(rss_file: IO[str], **_kwargs) → Iterable[archivebox.index.schema.Link][source]¶ Parse Shaarli-specific RSS XML-format files into links
-
archivebox.parsers.shaarli_rss.
PARSER
(rss_file: IO[str], **_kwargs) → Iterable[archivebox.index.schema.Link]¶ Parse Shaarli-specific RSS XML-format files into links
Everything related to parsing links from input sources.
For a list of supported services, see the README.md. For examples of supported import formats see tests/.
-
archivebox.parsers.
parse_links_memory
(urls: List[str], root_url: Optional[str] = None)[source]¶ parse a list of URLS without touching the filesystem
-
archivebox.parsers.
parse_links
(source_file: str, root_url: Optional[str] = None, parser: str = 'auto') → Tuple[List[archivebox.index.schema.Link], str][source]¶ parse a list of URLs with their metadata from an RSS feed, bookmarks export, or text file
-
archivebox.parsers.
run_parser_functions
(to_parse: IO[str], timer, root_url: Optional[str] = None, parser: str = 'auto') → Tuple[List[archivebox.index.schema.Link], Optional[str]][source]¶
-
archivebox.parsers.
save_text_as_source
(raw_text: str, filename: str = '{ts}-stdin.txt', out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs')) → str[source]¶
-
archivebox.parsers.
save_file_as_source
(path: str, timeout: int = 60, filename: str = '{ts}-{basename}.txt', out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs')) → str[source]¶ download a given url’s content into output/sources/domain-<timestamp>.txt
Submodules¶
archivebox.main module¶
-
archivebox.main.
help
(out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs')) → None[source]¶ Print the ArchiveBox help message and usage
-
archivebox.main.
version
(quiet: bool = False, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs')) → None[source]¶ Print the ArchiveBox version and dependency information
-
archivebox.main.
run
(subcommand: str, subcommand_args: Optional[List[str]], stdin: Optional[IO] = None, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs')) → None[source]¶ Run a given ArchiveBox subcommand with the given list of args
-
archivebox.main.
init
(force: bool = False, quick: bool = False, setup: bool = False, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs')) → None[source]¶ Initialize a new ArchiveBox collection in the current directory
-
archivebox.main.
status
(out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs')) → None[source]¶ Print out some info and statistics about the archive collection
-
archivebox.main.
oneshot
(url: str, extractors: str = '', out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs'))[source]¶ Create a single URL archive folder with an index.json and index.html, and all the archive method outputs. You can run this to archive single pages without needing to create a whole collection with archivebox init.
-
archivebox.main.
add
(urls: Union[str, List[str]], tag: str = '', depth: int = 0, update_all: bool = False, index_only: bool = False, overwrite: bool = False, init: bool = False, extractors: str = '', parser: str = 'auto', out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs')) → List[archivebox.index.schema.Link][source]¶ Add a new URL or list of URLs to your archive
-
archivebox.main.
remove
(filter_str: Optional[str] = None, filter_patterns: Optional[List[str]] = None, filter_type: str = 'exact', snapshots: Optional[django.db.models.query.QuerySet] = None, after: Optional[float] = None, before: Optional[float] = None, yes: bool = False, delete: bool = False, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs')) → List[archivebox.index.schema.Link][source]¶ Remove the specified URLs from the archive
-
archivebox.main.
update
(resume: Optional[float] = None, only_new: bool = True, index_only: bool = False, overwrite: bool = False, filter_patterns_str: Optional[str] = None, filter_patterns: Optional[List[str]] = None, filter_type: Optional[str] = None, status: Optional[str] = None, after: Optional[str] = None, before: Optional[str] = None, extractors: str = '', out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs')) → List[archivebox.index.schema.Link][source]¶ Import any new links from subscriptions and retry any previously failed/skipped links
-
archivebox.main.
list_all
(filter_patterns_str: Optional[str] = None, filter_patterns: Optional[List[str]] = None, filter_type: str = 'exact', status: Optional[str] = None, after: Optional[float] = None, before: Optional[float] = None, sort: Optional[str] = None, csv: Optional[str] = None, json: bool = False, html: bool = False, with_headers: bool = False, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs')) → Iterable[archivebox.index.schema.Link][source]¶ List, filter, and export information about archive entries
-
archivebox.main.
list_links
(snapshots: Optional[django.db.models.query.QuerySet] = None, filter_patterns: Optional[List[str]] = None, filter_type: str = 'exact', after: Optional[float] = None, before: Optional[float] = None, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs')) → Iterable[archivebox.index.schema.Link][source]¶
-
archivebox.main.
list_folders
(links: List[archivebox.index.schema.Link], status: str, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs')) → Dict[str, Optional[archivebox.index.schema.Link]][source]¶
-
archivebox.main.
setup
(out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs')) → None[source]¶ Automatically install all ArchiveBox dependencies and extras
-
archivebox.main.
config
(config_options_str: Optional[str] = None, config_options: Optional[List[str]] = None, get: bool = False, set: bool = False, reset: bool = False, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs')) → None[source]¶ Get and set your ArchiveBox project configuration values
-
archivebox.main.
schedule
(add: bool = False, show: bool = False, clear: bool = False, foreground: bool = False, run_all: bool = False, quiet: bool = False, every: Optional[str] = None, depth: int = 0, overwrite: bool = False, import_path: Optional[str] = None, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs'))[source]¶ Set ArchiveBox to regularly import URLs at specific times using cron
-
archivebox.main.
server
(runserver_args: Optional[List[str]] = None, reload: bool = False, debug: bool = False, init: bool = False, quick_init: bool = False, createsuperuser: bool = False, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.6.2/docs')) → None[source]¶ Run the ArchiveBox HTTP server
archivebox.manage module¶
archivebox.system module¶
-
archivebox.system.
run
(*args, input=None, capture_output=True, timeout=None, check=False, text=False, start_new_session=True, **kwargs)[source]¶ Patched of subprocess.run to kill forked child subprocesses and fix blocking io making timeout=innefective Mostly copied from https://github.com/python/cpython/blob/master/Lib/subprocess.py
-
archivebox.system.
atomic_write
(path: Union[pathlib.Path, str], contents: Union[dict, str, bytes], overwrite: bool = True) → None[source]¶ Safe atomic write to filesystem by writing to temp file + atomic rename
-
archivebox.system.
chmod_file
(path: str, cwd: str = '.', permissions: str = '755') → None[source]¶ chmod -R <permissions> <cwd>/<path>
-
archivebox.system.
copy_and_overwrite
(from_path: Union[str, pathlib.Path], to_path: Union[str, pathlib.Path])[source]¶ copy a given file or directory to a given path, overwriting the destination
-
archivebox.system.
get_dir_size
(path: Union[str, pathlib.Path], recursive: bool = True, pattern: Optional[str] = None) → Tuple[int, int, int][source]¶ get the total disk size of a given directory, optionally summing up recursively and limiting to a given filter list
-
class
archivebox.system.
suppress_output
(stdout=True, stderr=True)[source]¶ Bases:
object
A context manager for doing a “deep suppression” of stdout and stderr in Python, i.e. will suppress all print, even if the print originates in a compiled C/Fortran sub-function.
This will not suppress raised exceptions, since exceptions are printedto stderr just before a script exits, and after the context manager has exited (at least, I think that is why it lets exceptions through).
- with suppress_stdout_stderr():
- rogue_function()
archivebox.util module¶
-
archivebox.util.
detect_encoding
(rawdata)¶
-
archivebox.util.
scheme
(url)¶
-
archivebox.util.
without_scheme
(url)¶
-
archivebox.util.
without_query
(url)¶
-
archivebox.util.
without_fragment
(url)¶
-
archivebox.util.
without_path
(url)¶
-
archivebox.util.
path
(url)¶
-
archivebox.util.
basename
(url)¶
-
archivebox.util.
domain
(url)¶
-
archivebox.util.
query
(url)¶
-
archivebox.util.
fragment
(url)¶
-
archivebox.util.
extension
(url)¶
-
archivebox.util.
base_url
(url)¶
-
archivebox.util.
without_www
(url)¶
-
archivebox.util.
without_trailing_slash
(url)¶
-
archivebox.util.
hashurl
(url)¶
-
archivebox.util.
urlencode
(s)¶
-
archivebox.util.
urldecode
(s)¶
-
archivebox.util.
htmlencode
(s)¶
-
archivebox.util.
htmldecode
(s)¶
-
archivebox.util.
short_ts
(ts)¶
-
archivebox.util.
ts_to_date_str
(ts)¶
-
archivebox.util.
ts_to_iso
(ts)¶
-
archivebox.util.
enforce_types
(func)[source]¶ Enforce function arg and kwarg types at runtime using its python3 type hints
-
archivebox.util.
docstring
(text: Optional[str])[source]¶ attach the given docstring to the decorated function
-
archivebox.util.
str_between
(string: str, start: str, end: str = None) → str[source]¶ (<abc>12345</def>, <abc>, </def>) -> 12345
-
archivebox.util.
parse_date
(date: Any) → Optional[datetime.datetime][source]¶ Parse unix timestamps, iso format, and human-readable strings
-
archivebox.util.
download_url
(url: str, timeout: int = None) → str[source]¶ Download the contents of a remote url and return the text
-
archivebox.util.
get_headers
(url: str, timeout: int = None) → str[source]¶ Download the contents of a remote url and return the headers
-
archivebox.util.
chrome_args
(**options) → List[str][source]¶ helper to build up a chrome shell command with arguments
-
archivebox.util.
ansi_to_html
(text)[source]¶ Based on: https://stackoverflow.com/questions/19212665/python-converting-ansi-color-codes-to-html
-
class
archivebox.util.
AttributeDict
(*args, **kwargs)[source]¶ Bases:
dict
Helper to allow accessing dict values via Example.key or Example[‘key’]
-
class
archivebox.util.
ExtendedEncoder
(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)[source]¶ Bases:
json.encoder.JSONEncoder
Extended json serializer that supports serializing several model fields and objects
-
default
(obj)[source]¶ Implement this method in a subclass such that it returns a serializable object for
o
, or calls the base implementation (to raise aTypeError
).For example, to support arbitrary iterators, you could implement default like this:
def default(self, o): try: iterable = iter(o) except TypeError: pass else: return list(iterable) # Let the base class default method raise the TypeError return JSONEncoder.default(self, o)
-
Module contents¶
Meta¶
Roadmap¶

▶️ Comment here to discuss the contribution roadmap:Official Roadmap Discussion.
Planned Specification¶
(this is not set in stone, just a rough estimate)
v0.5: Remove live-updated JSON & HTML index in favor of archivebox export
¶
- use SQLite as the main db and export staticfile indexes once at the end of the whole process instead of live-updating them during each extractor run (i.e. remove
patch_main_index
) - create archivebox export command
- we have to create a public view to replace
index.html
/old.html
used for non-logged in users
v0.6: Code cleanup / refactor
¶
- move config loading logic into settings.py
- move all the extractors into “plugin” style folders that register their own config
- right now, the paths of the extractor output are scattered all over the codebase, e.g.
output.pdf
(should be moved to constants at the top of the plugin config file) - make out_dir, link_dir, extractor_dir, naming consistent across codebase
- convert all
os.path
calls and raw string paths toPathlib
v0.7: Schema improvements
¶
- remove
timestamps
as primary keys in favor of hashes, UUIDs, or some other slug - create a migration system for folder layout independent of the index (
mv
is atomic at the FS level, so we just need atransaction.atomic(): move(oldpath, newpath); snap.data_dir = newpath; snap.save()
) - make
Tag
a real modelManyToMany
with Snapshots - allow multiple Snapshots of the same site over time + CLI / UI to manage those, + migration from old style
#2020-01-01
hack to proper versioned snapshots
v0.8: Security
¶
- Add CSRF/CSP/XSS protection to rendered archive pages
- Provide secure reverse proxy in front of archivebox server in docker-compose.yml
- Create UX flow for users to setup session cookies / auth for archiving private sites
- cookies for wget, curl, etc low-level commands
- localstorage, cookies, indexedb setup for chrome archiving methods
v0.9: Performance
¶
- setup huey, break up archiving process into tasks on a queue that a worker pool executes
- setup pyppeteer2 to wrap chrome so that it’s not open/closed during each extractor
v1.0: Full headless browser control
¶
- run user-scripts / extensions in the context of the page during archiving
- community userscripts for unrolling twitter threads, reddit threads, youtube comment sections, etc.
- pywb-based headless browser session recording and warc replay
- archive proxy support
- support sending upstream requests through an external proxy
- support for exposing a proxy that archives all downstream traffic
…
v2.0 Federated or distributed archiving + paid hosted service offering
¶
- ZFS / merkel tree for storing archive output subresource hashes
- DHT for assigning merkel tree hash:file shards to nodes
- tag system for tagging certain hashes with human-readable names, e.g. title, url, tags, filetype etc.
- distributed tag lookup system
Major long-term changes¶
- release
pip
,apt
,pkg
, andbrew
packaged distributions for installing ArchiveBox - add an optional web GUI for managing sources, adding new links, and viewing the archive
- switch to django + sqlite db with migrations system & json/html export for managing archive schema changes and persistence
- modularize internals to allow importing individual components
- switch to sha256 of URL as unique link ID
- support storing multiple snapshots of pages over time
- support custom user puppeteer scripts to run while archiving (e.g. for expanding reddit threads, scrolling thread on twitter, etc)
- support named collections of archived content with different user access permissions
- support sharing archived assets via DHT + torrent / ipfs / ZeroNet / other sharing system
Smaller planned features¶
- support pushing pages to multiple 3rd party services using ArchiveNow instead of just archive.org
- body text extraction to markdown (using fathom?)
- featured image / thumbnail extraction
- auto-tagging links based on important/frequent keywords in extracted text (like pocket)
- automatic article summary paragraphs from extracted text with nlp summarization library
- full-text search of extracted text with elasticsearch/elasticlunr/ag
- download closed-caption subtitles from Youtube and other video sites for full-text indexing of video content
- try pulling dead sites from archive.org and other sources if original is down (https://github.com/hartator/wayback-machine-downloader)
- And more in the issues list…
IMPORTANT: Please don’t work on any of these major long-term tasks without contacting me first, work is already in progress for many of these, and I may have to reject your PR if it doesn’t align with the existing work!
Changelog¶
▶️ If you’re having an issue with a breaking change, or migrating your data between versions, open an issue to get help.
ArchiveBox
was previously named Pocket Archive Stream
and then Bookmark Archiver
.
See the releases page for versioned source downloads and full changelog.
🍰 Many thanks to our 60+ contributors and everyone in the web archiving community! 🏛
Expand old release notes...
v0.4.9 released
pip install archivebox
https://pypi.org/project/archivebox/docker run archivebox/archivebox
https://hub.docker.com/r/archivebox/archivebox- https://archivebox.readthedocs.io/en/latest/
- https://github.com/ArchiveBox/ArchiveBox/releases
easy migration from previous versions
cd path/to/your/archive/folder archivebox init archviebox add 'https://example.com' archviebox add 'https://getpocket.com/users/USERNAME/feed/all' --depth=1
full transition to Django Sqlite DB with migrations (making upgrades between versions much safer now)
maintains an intuitive and helpful CLI that’s backwards-compatible with all previous archivebox data versions
uses argparse instead of hand-written CLI system: see
archivebox/cli/archivebox.py
new subcommands-based CLI for
archivebox
(see below)new Web UI with pagination, better search, filtering, permissions, and more
30+ assorted bugfixes, new features, and tickets closed
for more info, see: https://github.com/ArchiveBox/ArchiveBox/releases/tag/v0.4.9
- v0.2.4 released
- better archive corruption guards (check structure invariants on every parse & save)
- remove title prefetching in favor of new FETCH_TITLE archive method
- slightly improved CLI output for parsing and remote url downloading
- re-save index after archiving completes to update titles and urls
- remove redundant derivable data from link json schema
- markdown link parsing support
- faster link parsing and better symbol handling using a new compiled URL_REGEX
- v0.2.3 released
- fixed issues with parsing titles including trailing tags
- fixed issues with titles defaulting to URLs instead of attempting to fetch
- fixed issue where bookmark timestamps from RSS would be ignored and current ts used instead
- fixed issue where ONLY_NEW would overwrite existing links in archive with only new ones
- fixed lots of issues with URL parsing by using
urllib.parse
instead of hand-written lambdas - ignore robots.txt when using wget (ssshhh don’t tell anyone 😁)
- fix RSS parser bailing out when there’s whitespace around XML tags
- fix issue with browser history export trying to run ls on wrong directory
- v0.2.2 released
- Shaarli RSS export support
- Fix issues with plain text link parsing including quotes, whitespace, and closing tags in URLs
- add USER_AGENT to archive.org submissions so they can track archivebox usage
- remove all icons similar to archive.org branding from archive UI
- hide some of the noisier youtubedl and wget errors
- set permissions on youtubedl media folder
- fix chrome data dir incorrect path and quoting
- better chrome binary finding
- show which parser is used when importing links, show progress when fetching titles
- v0.2.1 released with new logo
- ability to import plain lists of links and almost all other raw filetypes
- WARC saving support via wget
- Git repository downloading with git clone
- Media downloading with youtube-dl (video, audio, subtitles, description, playlist, etc)
- v0.2.0 released with new name
- renamed from Bookmark Archiver -> ArchiveBox
- v0.1.0 released
- support for browser history exporting added with
./bin/archivebox-export-browser-history
- support for chrome
--dump-dom
to output full page HTML after JS executes
- v0.0.3 released
- support for chrome
--user-data-dir
to archive sites that need logins - fancy individual html & json indexes for each link
- smartly append new links to existing index instead of overwriting
- v0.0.2 released
- proper HTML templating instead of format strings (thanks to https://github.com/bardisty!)
- refactored into separate files, wip audio & video archiving
- v0.0.1 released
- Index links now work without nginx url rewrites, archive can now be hosted on github pages
- added setup.sh script & docstrings & help commands
- made Chromium the default instead of Google Chrome (yay free software)
- added env-variable configuration (thanks to https://github.com/hannah98!)
- renamed from Pocket Archive Stream -> Bookmark Archiver
- added Netscape-format export support (thanks to https://github.com/ilvar!)
- added Pinboard-format export support (thanks to https://github.com/sconeyard!)
- front-page of HN, oops! apparently I have users to support now :grin:?
- added Pocket-format export support
- v0.0.0 released: created Pocket Archive Stream 2017/05/05
Donations¶
Patreon: https://www.patreon.com/theSquashSH
Paypal: https://paypal.me/NicholasSweeting
I develop this project solely in my spare time right now. If you want to help me keep it alive and flourishing, donate to support more development!
If you have any questions or want to partner with this project, contact me at: archivebox-hello@sweeting.me
Web Archiving Community¶
🔢 Just getting started and want to learn more about why Web Archiving is important?
Check out this article: On the Importance of Web Archiving.
The internet archiving community is surprisingly far-reaching and almost universally friendly!
Whether you want to learn which organizations are the big players in the web archiving space, want to find a specific open source tool for your web archiving need, or just want to see where archivists hang out online, this is my attempt at an index of the entire web archiving community.

- The Master ListsCommunity-maintained indexes of web archiving tools and groups by IIPC, COPTR, ArchiveTeam, Wikipedia, & the ASA.
- Web Archiving SoftwareOpen source tools and projects in the internet archiving space.
- Reading ListArticles, posts, and blogs relevant to ArchiveBox and web archiving in general.
- CommunitiesA collection of the most active internet archiving communities and initiatives.
The Master Lists¶

Indexes of archiving institutions and software maintained by other people. If there’s anything archivists love doing, it’s making lists.
- COPTR Wiki of Web Archiving Tools (COPTR)
- Awesome Web Archiving Tools (IIPC)
- Spreadsheet Comparison of Archiving Tools (DataTogether)
- Awesome Web Crawling Tools
- Awesome Web Scraping Toolsb
- ArchiveTeam’s List of Software (ArchiveTeam.org)
- List of Web Archiving Initiatives (Wikipedia.org)
- Directory of Archiving Organizations (American Society of Archivists)
Web Archiving Projects¶


Bookmarking Services¶
- Pocket Premium Bookmarking tool that provides an archiving service in their paid version, run by Mozilla
- Pinboard Bookmarking tool that provides archiving in a paid version, run by a single independent developer
- Instapaper Bookmarking alternative to Pocket/Pinboard (with no archiving)
- Wallabag / Wallabag.it Self-hostable web archiving server that can import via RSS
- Shaarli Self-hostable bookmark tagging, archiving, and sharing service
- ReadWise A paid Pocket/Pinboard alternative that includes article snippet and highlight saving
From the Archive.org & Archive-It teams¶


- Archive.org The O.G. wayback machine provided publicly by the Internet Archive (Archive.org)
- Archive.it commercial Wayback-Machine solution
- Heretrix The king of internet archiving crawlers, powers the Wayback Machine
- Brozzler chrome headless crawler + WARC archiver maintained by Archive.org
- WarcProx warc proxy recording and playback utility
- WarcTools utilities for dealing with WARCs
- Grab-Site An easy preconfigured web crawler designed for backing up websites
- WPull A pure python implementation of wget with WARC saving
- More on their Github…
From the Rhizome.org/WebRecorder.io/Conifer team¶


- Conifer by Rhizome.org An open-source personal archiving server that uses pywb under the hood previously known as Webrecorder.io
- Webrecorder.net Suite of open source projects and tools, led by Ilya Kreymer, to capture interactive websites and replay them at a later time as accurately as possible
- pywb The python wayback machine, the codebase forked off archive.org that powers webrecorder
- warcit Create a warc file out of a folder full of assets
- WebArchivePlayer A tool for replaying web archives
- warcio fast streaming asynchronous WARC reader and writer
- More on their Github…
From the Old Dominion University: Web Science Team¶
- ipwb A distributed web archiving solution using pywb with ipfs for storage
- archivenow tool that pushes urls into all the online archive services like Archive.is and Archive.org
- node-warc Parse And Create Web ARChive (WARC) files with node.js
- WAIL Web archiver GUI using Heritrix and OpenWayback
- Squidwarc User-scriptable, archival crawler using Chrome
- WAIL (Electron) Electron app version of the original wail for creating and interacting with web archives
- warcreate a Chrome extension for creating WARCs from any webpage
- More on their Github…
From the Archives Unleashed Team¶

- AUT Archives Unleashed Toolkit for analyzing web archives (formerly WarcBase)
- Warclight A Rails engine for finding and searching web archives
- More on their Github…

From the IIPC team¶
- OpenWayback Open source project developing core Wayback-Machine components
- awesome-web-archiving Large list of archiving projects and orgs
- JWARC A Java library for reading and writing WARC files.
- More on their Github…
Other Public Archiving Services¶

- https://perma.cc
- https://www.pagefreezer.com
- https://www.smarsh.com
- https://www.stillio.com
- https://archive.is / https://archive.today
- https://archive.st
- http://theoldnet.com
- https://timetravel.mementoweb.org/
- https://freezepage.com/
- https://webcitation.org/archive
- https://archiveofourown.org/
- https://megalodon.jp/
- https://www.webarchive.org.uk/ukwa/
- https://github.com/HelloZeroNet/ZeroNet (super cool project)
- Google, Bing, DuckDuckGo, and other search engine caches
Other ArchiveBox Alternatives¶
- browsertrix-crawler / ArchiveWeb.page + ReplayWeb.page + pywb Webrecorder.io’s archiving suite has the highest fidelity, and can flawlessly archive YouTube, Twitter, FB and other complex, JS-heavy SPAs
- SingleFile Web Extension / CLI util for Firefox and Chrome to save a web page as a single HTML file
- Memex by Worldbrain.io a beautiful, user-friendly browser extension that archives all history with full-text search, annotation support, and more
- Hypothes.is a web/pdf/ebook annotation tool that also archives content
- Reminiscence extremely similar to ArchiveBox, uses a Django backend + UI and provides auto-tagging and summary features with NLTK
- Shaarchiver very similar project that archives Firefox, Shaarli, or Delicious bookmarks and all linked media, generating a markdown/HTML index
- Archivy Python-based self-hosted knowledge base embedded into your filesystem
- Polarized a desktop application for bookmarking, annotating, and archiving articles offline
- 22120 Archiving tool that uses the Chrome debugger protocol to save each page as-loaded in the browser
- Photon a fast crawler with archiving and asset extraction support
- LinkAce A self-hosted bookmark management tool that saves snapshots to archive.org
- Trilium Personal web UI based knowledge-base with web clipping and note-taking
- Herodotus Django-based web archiving tool with a focus on collecting text-based content
- Buku Browser-independent bookmark manager CLI written in Python3 and SQLite3
- ReadableWebProxy A proxying archiver that downloads content from sites and can snapshot multiple versions of sites over time
- Perkeep “Perkeep lets you permanently keep your stuff, for life.”
- Fetching.io A personal search engine/archiver that lets you search through all archived websites that you’ve bookmarked
- Fossilo A commercial archiving solution that appears to be very similar to ArchiveBox
- Archivematica web GUI for institutional long-term archiving of web and other content
- Headless Chrome Crawler distributed web crawler built on puppeteer with screenshots
- WWWofle old proxying recorder software similar to ArchiveBox
- Erised Super simple CLI utility to bookmark and archive webpages
- Zotero collect, organize, cite, and share research (mainly for technical/scientific papers & citations)
- TiddlyWiki Non-linear bookmark and note-taking tool with archiving support
- Joplin Desktop + mobile app for knowledge-base-style info collection and notes (w/ optional plugin for archiving)
- Hunchly A paid web archiving / session recording tool design for OSINT
Smaller Utilities¶
Random helpful utilities for web archiving, WARC creation and replay, and more…
- https://github.com/karlicoss/promnesia A browser extension that collects and collates all the URLs you visit into a hierarchical/graph structure with metadata
- https://github.com/vrtdev/save-page-state A Chrome extension for saving the state of a page in multiple formats
- https://github.com/jsvine/waybackpack command-line tool that lets you download the entire Wayback Machine archive for a given URL
- https://github.com/hartator/wayback-machine-downloader Download an entire website from the Internet Archive Wayback Machine.
- https://github.com/Lifesgood123/prevent-link-rot Replace any broken URLs in some content with Wayback machine URL equivalents
- https://en.archivarix.com download an archived page or entire site from the Wayback Machine
- https://proofofexistence.com prove that a certain file existed at a given time using the blockchain
- https://github.com/chfoo/warcat for merging, extracting, and verifying WARC files
- https://github.com/mozilla/readability tool for extracting article contents and text
- https://github.com/mholt/timeliner All your digital life on a single timeline, stored locally
- https://github.com/wkhtmltopdf/wkhtmltopdf Webkit HTML to PDF archiver/saver
- Sheetsee-Pocket project that provides a pretty auto-updating index of your Pocket links (without archiving them)
- Pocket -> IFTTT -> Dropbox Post by Christopher Su on his Pocket saving IFTTT recipe
- http://squidman.net/squidman/index.html
- https://wordpress.org/plugins/broken-link-checker/
- https://github.com/ArchiveTeam/wpull
- http://freedup.org/
- https://en.wikipedia.org/wiki/Furl
- https://preservica.com/digital-archive-software-1/active-digital-preservation For-profit company offering a digital preservation software suite
- https://github.com/karlicoss/grasp capture webpages from Firefox and Chrome into Org-mode documents
- And many more on the other lists…
Reading List¶
A collection of blog posts and articles about internet archiving, contact me / open an issue if you want to add a link here!
Blogs¶

- https://blog.archive.org
- https://netpreserveblog.wordpress.com
- https://blog.conifer.rhizome.org/ (formerly https://blog.webrecorder.io/)
- https://ws-dl.blogspot.com
- https://siarchives.si.edu/blog
- https://parameters.ssrc.org
- https://sr.ithaka.org/publications
- https://ait.blog.archive.org
- https://brewster.kahle.org
- https://ianmilligan.ca
- https://medium.com/@giovannidamiola
Articles¶
- https://parameters.ssrc.org/2018/09/on-the-importance-of-web-archiving/
- https://theconversation.com/your-internet-data-is-rotting-115891
- https://www.bbc.com/future/story/20190401-why-theres-so-little-left-of-the-early-internet
- https://sr.ithaka.org/publications/the-state-of-digital-preservation-in-2018/
- https://gizmodo.com/delete-never-the-digital-hoarders-who-collect-tumblrs-1832900423
- https://siarchives.si.edu/blog/we-are-not-alone-progress-digital-preservation-community
- https://www.gwern.net/Archiving-URLs
- http://brewster.kahle.org/2015/08/11/locking-the-web-open-a-call-for-a-distributed-web-2/
- https://lwn.net/Articles/766374/
- https://en.wikipedia.org/wiki/List_of_Web_archiving_initiatives
- https://medium.com/@giovannidamiola/making-the-internet-archives-full-text-search-faster-30fb11574ea9
- https://xkcd.com/1909/
- https://samsaffron.com/archive/2012/06/07/testing-3-million-hyperlinks-lessons-learned#comment-31366
- https://www.gwern.net/docs/linkrot/2011-muflax-backup.pdf
- https://thoughtstreams.io/higgins/permalinking-vs-transience/
- http://ait.blog.archive.org/files/2014/04/archiveit_life_cycle_model.pdf
- https://blog.archive.org/2016/05/26/web-archiving-with-national-libraries/
- https://blog.archive.org/2014/10/28/building-libraries-together/
- https://ianmilligan.ca/2018/03/27/ethics-and-the-archived-web-presentation-the-ethics-of-studying-geocities/
- https://ianmilligan.ca/2018/05/22/new-article-if-these-crawls-could-talk-studying-and-documenting-web-archives-provenance/
- https://ws-dl.blogspot.com/2019/02/2019-02-08-google-is-being-shuttered.html
If any of these links are dead, you can find an archived version on https://archive.sweeting.me.
ArchiveBox-Specific Posts, Tutorials, and Guides¶
- https://www.cyberpunks.com/preserve-the-internet-with-archivebox/
- https://nixintel.info/osint-tools/make-your-own-internet-archive-with-archive-box/
- “How to install ArchiveBox to preserve websites you care about” https://blog.sleeplessbeastie.eu/2019/06/19/how-to-install-archivebox-to-preserve-websites-you-care-about/
- “How to remotely archive websites using ArchiveBox” https://blog.sleeplessbeastie.eu/2019/06/26/how-to-remotely-archive-websites-using-archivebox/
- “How to use CutyCapt inside ArchiveBox” https://blog.sleeplessbeastie.eu/2019/07/10/how-to-use-cutycapt-inside-archivebox/
- “Automate ArchiveBox with Google Spreadsheet to Backup your internet” https://manfred.life/archivebox
- “【デモ有♪】ConoHaのArchiveBoxアプリケーションを使ってみたよ” https://qiita.com/CloudRemix/items/691caf91efa3ef19a7ad
- “WEB-ARCHIV TEIL 8: WALLABAG UND ARCHIVEBOX” http://webermartin.net/blog/web-archiv-teil-8-wallabag-und-archivebox/
- https://metaxyntax.neocities.org/entries/7.html
ArchiveBox Discussions in News & Social Media¶

- Aggregators:ProductHunt, AlternativeTo, SteemHunt, Recurse Center: The Joy of Computing, Github Changelog, Dev.To Ultra List, O’Reilly 4 Short Links, JaxEnter
- Blog Posts & Podcasts:Korben.info, Defining Desktop Linux Podcast #296 (0:55:00), Binärgewitter Podcast #221, Schrankmonster.de, La Ferme Du Web
- Hacker News:#1, #2, #3, #4
- Reddit r/DataHoarder:#1, #2, #3, #4, #5 , #6
- Reddit r/SelfHosted:#1, #2
- Twitter:Python Trending, PyCoder’s Weekly, Python Hub, Smashing Magazine
- More on:Twitter, Reddit, HN, Google…
Communities¶
Most Active Communities¶

- The Internet Archive (Archive.org) (USA)
- International Internet Preservation Consortium (IIPC) (International)
- The Archive Team, URL Team, r/ArchiveTeam (International)
- Rhizome.org The digital preservation group that works on Conifer by Rhizome formerly Webrecorder.io (USA)
- Webrecorder.net Formerly known¹ as Webrecorder.io is a project Led by Ilya Kreymer, that researchs and develops web archiving tools, widely used by the community.
- Old Dominion University: Web Science and Digital Libraries (WS-DL @ ODU) (Virginia, USA)
- r/DataHoarder, r/Archivists, r/DHExchange (International)
- The Eye Non-profit working on content archival and long-term preservation (Europe)
- Digital Preservation Coalition & their Software Tool Registry (COPTR) (UK & Wales)
- Archives Unleashed Project and UAP Github (Canada)
Web Archiving Communities¶

Follow these technological and organizational archiving hubs for the latest archiving news.
- Canadian Web Archiving Coalition (Canada)
- Web Archives for Historical Research Group (Canada)
- Smithsonian Institution Archives: Digital Curation (Washington D.C., USA)
- National Digital Stewardship Alliance (NDSA) (USA)
- Digital Library Federation (DLF) (USA)
- Council on Library and Information Resources (CLIR) (USA)
- Digital Curation Centre (DCC) (UK)
- ArchiveMatica & their Community Wiki (International)
- Professional Development Institutes for Digital Preservation (POWRR) (USA)
- Institute of Museum and Library Services (IMLS) (USA)
- Stanford Libraries Web Archiving (USA)
- Society of American Archivists: Electronic Records (SAA) (USA)
- BitCurator Consortium (BCC) (USA)
- Ethics & Archiving the Web Conference (Rhizome/Webrecorder.io) (USA)
- Archivists Round Table of NYC (USA)
General Archiving Foundations, Coalitions, Initiatives, and Institutes¶

Find your local archiving group in the list and see how you can contribute!
- Community Archives and Heritage Group (UK & Ireland)
- Open Preservation Foundation (OPF) (UK & Europe)
- Software Preservation Network (International)
- ITHAKA, Portico, JSTOR, ARTSTOR, S+R (USA)
- Archives and Records Association (UK & Ireland)
- Arkivrådet AAS (Sweden)
- Asociación Española de Archiveros, Bibliotecarios, Museologos y Documentalistas (ANABAD) (Spain)
- Associação dos Arquivistas Brasileiros (AAB) (Brazil)
- Associação Portuguesa de Bibliotecários, Archivistas e Documentalistas (BAD) (Portugal)
- Association des archivistes français (AAF) (France)
- Associazione Nazionale Archivistica Italiana (ANAI) (Italy)
- Australian Society of Archivists Inc. (Australia)
- International Council on Archives (ICA)
- International Records Management Trust (IRMT)
- Irish Society for Archives (Ireland)
- Koninklijke Vereniging van Archivarissen in Nederland (Netherlands)
- State Archives Administration of the People’s Republic of China (China)
- Academy of Certified Archivists
- Archivists and Librarians in the History of the Health Sciences
- Archivists for Congregations of Women Religious
- Archivists of Religious Institutions
- Association of Catholic Diocesan Archivists
- Association of Moving Image Archivists
- Council of State Archivists
- National Association of Government Archives and Records Administrators
- National Episcopal Historians and Archivists
- Archival Education and Research Institute
- Archives Leadership Institute
- Georgia Archives Institute
- Modern Archives Institute
- Western Archives Institute
- Association des archivistes du Québec
- Association of Canadian Archivists
- Canadian Council of Archives/Conseil canadien des archives
- Archives Association of British Columbia
- Archives Association of Ontario
- Archives Council of Prince Edward Island
- Archives Society of Alberta
- Association for Manitoba Archives
- Association of Newfoundland and Labrador Archives
- Council of Nova Scotia Archives
- Réseau des services d’archives du Québec
- Saskatchewan Council for Archives and Archivists
You can find more organizations and initiatives on these other lists:
- Wikipedia.org List of Web Archiving Initiatives
- SAA List of USA & Canada Based Archiving Organizations
- SAA List of International Archiving Organizations
- Digital Preservation Coalition’s Member List