ArchiveBox Logo
  • Contents
  • Overview
    • Key Features
    • 🤝 Professional Integration
    • Quickstart
      • ✳️  Easy Setup
      • 🛠  Package Manager Setup
      • 🎗  Other Options
      • ➡️  Next Steps
      • Usage
        • ⚡️  CLI Usage
        • ArchiveBox Subcommands
    • Overview
      • Input Formats: How to pass URLs into ArchiveBox for saving
      • Output Formats: What ArchiveBox saves for each URL
      • Configuration
      • Dependencies
      • Archive Layout
      • Static Archive Exporting
      • Caveats
        • Archiving Private Content
        • Security Risks of Viewing Archived JS
        • Working Around Sites that Block Archiving
        • Saving Multiple Snapshots of a Single URL
        • Storage Requirements
      • Screenshots
    • Background & Motivation
      • Comparison to Other Projects
      • Internet Archiving Ecosystem
    • Documentation
      • Getting Started
      • Advanced
      • Developers
      • More Info
    • ArchiveBox Development
      • Setup the dev environment
        • 1. Clone the main code repo (making sure to pull the submodules as well)
        • 2. Option A: Install the Python, JS, and system dependencies directly on your machine
        • 2. Option B: Build the docker container and use that for development instead
      • Common development tasks
        • Run in DEBUG mode
        • Install and run a specific GitHub branch
          • Use a Pre-Built Image
          • Build Branch from Source
        • Run the linters / tests
        • Make DB migrations, enter Django shell, other dev helper commands
        • Contributing a new extractor
        • Build the docs, pip package, and docker image
        • Roll a release
      • Further Reading
  • Getting Started
    • Quickstart
      • 1. Set up ArchiveBox
      • 2. Get your list of URLs to archive
      • 3. Add your URLs to the archive
      • ✅ Done!
    • Install
      • Supported Systems
      • Option A. Docker / Docker Compose Setup ⭐️
      • Option B. Automatic Setup Script
      • Option C. Bare Metal Setup
        • 1. Install base system dependencies needed for your OS
          • macOS
          • Ubuntu/Debian-based Systems
          • FreeBSD
          • OpenBSD
          • Arch Linux / Nix / Guix / etc. Other OSs
        • 2. Install the Python dependencies using pip
        • 3. Install the JS dependencies using archivebox setup
        • Troubleshooting
        • Next Steps: Add some URLs to archive and try out CLI / Web UI
        • Next Steps: Upgrading Archivebox to a new version
        • Further Reading
    • Docker
      • Overview
      • Docker Compose
        • Setup
        • Upgrading
        • Usage
        • Accessing the data
        • Configuration
      • Docker
        • Setup
        • Upgrading
        • Usage
        • Accessing the data
        • Configuration
    • Configuration
      • General Settings
        • OUTPUT_PERMISSIONS
        • PUID / PGID
        • ONLY_NEW
        • TIMEOUT
        • MEDIA_TIMEOUT
        • ADMIN_USERNAME / ADMIN_PASSWORD
        • PUBLIC_INDEX / PUBLIC_SNAPSHOTS / PUBLIC_ADD_VIEW
        • CUSTOM_TEMPLATES_DIR
        • REVERSE_PROXY_USER_HEADER
        • REVERSE_PROXY_WHITELIST
        • LOGOUT_REDIRECT_URL
        • LDAP
        • SNAPSHOTS_PER_PAGE
        • FOOTER_INFO
        • URL_DENYLIST
        • URL_ALLOWLIST
      • Archive Method Toggles
        • SAVE_TITLE
        • SAVE_FAVICON
        • SAVE_WGET
        • SAVE_WARC
        • SAVE_PDF
        • SAVE_SCREENSHOT
        • SAVE_DOM
        • SAVE_SINGLEFILE
        • SAVE_READABILITY
        • SAVE_MERCURY
        • SAVE_GIT
        • SAVE_MEDIA
        • SAVE_ARCHIVE_DOT_ORG
      • Archive Method Options
        • CHECK_SSL_VALIDITY
        • SAVE_WGET_REQUISITES
        • RESOLUTION
        • CURL_USER_AGENT
        • WGET_USER_AGENT
        • CHROME_USER_AGENT
        • GIT_DOMAINS
        • COOKIES_FILE
        • CHROME_USER_DATA_DIR
        • CHROME_HEADLESS
        • CHROME_SANDBOX
      • Shell Options
        • USE_COLOR
        • SHOW_PROGRESS
      • Dependency Options
        • CHROME_BINARY
        • WGET_BINARY
        • YOUTUBEDL_BINARY
        • GIT_BINARY
        • CURL_BINARY
        • SINGLEFILE_BINARY
        • READABILITY_BINARY
        • MERCURY_BINARY
        • RIPGREP_BINARY
        • SINGLEFILE_ARGS
        • CURL_ARGS
        • WGET_ARGS
        • YOUTUBEDL_ARGS
        • GIT_ARGS
    • Security Overview
      • Web UI Permissions
      • ArchiveBox Use-Cases
        • Archiving Public Content Only ⭐️ [Default, recommended for most people]
        • Archiving Content Behind Log-Ins 🚨 [Advanced users only]
        • ⚠️ Things to watch out for: ⚠️
        • Publishing
      • Do not run as root
      • Output Folder
        • Database
        • Filesystem
          • Purging entries
          • Permissions
    • Usage
      • CLI Usage
        • Run ArchiveBox with configuration options
        • Import a single URL
        • Import a list of URLs from a text file
        • Import list of links from browser history
      • UI Usage
        • Explanation of buttons in the web UI - admin snapshots list
      • Browser Extension Usage
        • More Info
      • Disk Layout
        • Large Archives
      • SQL Shell Usage
      • Python Shell Usage
      • Python API Usage
  • Guides
    • Setting Up Storage
      • Supported Local Filesystems
        • EXT4 (default on Linux), APFS (default on macOS)
        • ZFS (recommended for best experience on Linux/BSD) ⭐️
        • NTFS, HFS+, BTRFS
        • EXT2, EXT3, FAT32, exFAT
      • Supported Remote Filesystems
        • NFS (Docker Driver)
        • SMB / Ceph (Docker CIFS Driver)
        • Amazon S3 / Backblaze B2 / Google Drive / etc. (RClone)
          • RClone Config Examples
          • Option A: Running RClone on Bare Metal host
          • Option B: Running RClone with Docker Storage Plugin
        • More Docker Storage Plugins
    • Setting Up Authentication
      • Set Up Admin Web UI Permissions
      • Admin Web UI Authentication Methods
        • Username & Password (the default)
        • Reverse Proxy Authentication
        • LDAP Authentication
        • Not Yet Supported: SAML / OAuth2 / OpenID Authentication
      • REST API
        • API Bearer Token Authentication
        • API Request Header Authentication
        • API Query Parameter Authentication
        • API Session Cookie Authentication
        • API HTTP Basic Authentication
          • Further Reading
    • Setting Up Search
      • How to Search in ArchiveBox
      • How Search Works
      • ArchiveBox Search Backends
        • ripgrep (the default)
          • Pros
          • Cons
        • ripgrep-all (aka rga)
        • ugrep
          • Pros
          • Cons
        • sonic ⭐️ (the recommended upgrade path for most people)
          • Pros
          • Cons
        • SQLite FTS5
          • Pros
          • Cons
        • Further Reading
    • Publishing Your Archive
      • 1. Use the built-in web server
      • 2. Export and host it as static HTML
      • Security Concerns
        • Protecting the Admin Dashboard
      • Copyright Concerns
        • Further Reading: USA Copyright Law & Fair Use Exemptions
    • Scheduled Archiving
      • Docker Usage
      • Example: Archive a Twitter user’s Tweets and linked content within once a week
      • Example: Archive a Reddit subreddit and discussions for every post once a week
      • Example: Archive the HackerNews front page and some linked articles every 24 hours
      • Example: Archive all URLs in an RSS feed from Pocket every 12 hours
      • Example: Archive a Github repository’s source code only once a month
      • Example: Archive a list of URLs pulled from the filesystem every 30 minutes
      • Advanced Scheduling Using Cron
        • Example: Export and archive Firefox browser history every 24 hours
        • Example: Import an RSS feed from Pocket every 12 hours
    • Chrome / Chromium Setup
      • Installing Chromium
        • ⭐️ Any OS (recommended)
        • macOS
        • Ubuntu/Debian
      • Installing Google Chrome
        • macOS
        • Ubuntu/Debian
      • Troubleshooting Chromium Install
    • Setting Up a Chromium User Profile
      • Docker VNC Setup
      • Non-Docker Setup (Local Host)
      • Non-Docker Setup (Remote Host)
      • More Info & Troubleshooting
    • Upgrading Versions
      • Upgrading with Docker Compose ⭐️
      • Upgrading with plain Docker
      • Upgrading with a package manager
      • Merge two or more existing archives
      • Related Documents
    • Merging Collections
      • Modify the ArchiveBox SQLite3 DB directly
        • Example: Modifying an existing user’s email
        • Example: Adding a new user with a hashed password
      • Database Troubleshooting
      • Related Documents
    • Troubleshooting
      • Installing
        • Python
        • Chromium/Google Chrome
        • Wget & Curl
        • NPM Dependencies
      • Archiving
        • No links parsed from export file
        • Lots of skipped sites
        • Lots of errors
        • Lots of broken links from the index
        • Removing unwanted links from the index
      • Hosting the Archive
        • Other database or filesystem issues
          • Docker Permissions issues
      • Database
        • Filesystem doesn’t support FSYNC (e.g. network mounts)
        • Database and filesystem contention issues when running multiple ArchiveBox processes
        • Database migrations errors or upgrade issues
        • Repairing a corrupted SQLite3 database file
  • API Reference
    • Filesystem
    • SQL API
    • REST API
    • Python API
      • archivebox
        • Subpackages
          • archivebox.misc
          • archivebox.machine
          • archivebox.crawls
          • archivebox.index
          • archivebox.extractors
          • archivebox.pkgs
          • archivebox.api
          • archivebox.workers
          • archivebox.parsers
          • archivebox.base_models
          • archivebox.personas
          • archivebox.core
          • archivebox.search
          • archivebox.config
          • archivebox.cli
          • archivebox.tags
        • Submodules
          • archivebox.manage
          • archivebox.__main__
        • Package Contents
          • Data
          • API
      • abx_plugin_favicon
        • Submodules
          • abx_plugin_favicon.config
          • abx_plugin_favicon.favicon
          • abx_plugin_favicon.models
          • abx_plugin_favicon.actors
          • abx_plugin_favicon.extractors
        • Package Contents
          • Functions
          • Data
          • API
      • abx_spec_django
        • Module Contents
          • Classes
          • Data
          • API
      • abx_plugin_playwright
        • Submodules
          • abx_plugin_playwright.config
          • abx_plugin_playwright.binproviders
          • abx_plugin_playwright.binaries
        • Package Contents
          • Functions
          • Data
          • API
      • abx_plugin_readwise
        • Module Contents
          • Classes
          • Functions
          • Data
          • API
      • abx_plugin_curl
        • Submodules
          • abx_plugin_curl.config
          • abx_plugin_curl.binaries
          • abx_plugin_curl.headers
        • Package Contents
          • Functions
          • API
      • abx_spec_extractor
        • Module Contents
          • Classes
          • Functions
          • Data
          • API
      • abx_plugin_title
        • Submodules
          • abx_plugin_title.extractors
          • abx_plugin_title.extractor
        • Package Contents
          • Functions
          • API
      • abx_spec_config
        • Submodules
          • abx_spec_config.toml_util
          • abx_spec_config.base_configset
        • Package Contents
          • Classes
          • Data
          • API
      • abx_plugin_chrome
        • Submodules
          • abx_plugin_chrome.config
          • abx_plugin_chrome.binaries
          • abx_plugin_chrome.screenshot
          • abx_plugin_chrome.dom
          • abx_plugin_chrome.extractors
          • abx_plugin_chrome.pdf
        • Package Contents
          • Functions
          • Data
          • API
      • abx_plugin_git
        • Submodules
          • abx_plugin_git.config
          • abx_plugin_git.binaries
          • abx_plugin_git.extractors
          • abx_plugin_git.git
        • Package Contents
          • Functions
          • Data
          • API
      • abx_plugin_pip
        • Submodules
          • abx_plugin_pip.config
          • abx_plugin_pip.binproviders
          • abx_plugin_pip.binaries
        • Package Contents
          • Functions
          • Data
          • API
      • abx_plugin_archivedotorg
        • Submodules
          • abx_plugin_archivedotorg.config
          • abx_plugin_archivedotorg.archive_org
        • Package Contents
          • Functions
          • Data
          • API
      • abx_plugin_singlefile
        • Submodules
          • abx_plugin_singlefile.config
          • abx_plugin_singlefile.singlefile
          • abx_plugin_singlefile.binaries
          • abx_plugin_singlefile.models
          • abx_plugin_singlefile.actors
          • abx_plugin_singlefile.extractors
        • Package Contents
          • Functions
          • Data
          • API
      • abx_plugin_wget
        • Submodules
          • abx_plugin_wget.config
          • abx_plugin_wget.binaries
          • abx_plugin_wget.wget
          • abx_plugin_wget.wget_util
          • abx_plugin_wget.extractors
        • Package Contents
          • Functions
          • Data
          • API
      • abx_plugin_puppeteer
        • Submodules
          • abx_plugin_puppeteer.config
          • abx_plugin_puppeteer.binproviders
          • abx_plugin_puppeteer.binaries
        • Package Contents
          • Functions
          • Data
          • API
      • abx_plugin_sqlitefts_search
        • Submodules
          • abx_plugin_sqlitefts_search.searchbackend
          • abx_plugin_sqlitefts_search.config
        • Package Contents
          • Functions
          • Data
          • API
      • abx_plugin_pocket
        • Submodules
          • abx_plugin_pocket.config
        • Package Contents
          • Functions
          • Data
          • API
      • abx_plugin_ytdlp
        • Submodules
          • abx_plugin_ytdlp.config
          • abx_plugin_ytdlp.media
          • abx_plugin_ytdlp.binaries
        • Package Contents
          • Functions
          • Data
          • API
      • abx_plugin_ripgrep_search
        • Submodules
          • abx_plugin_ripgrep_search.searchbackend
          • abx_plugin_ripgrep_search.config
          • abx_plugin_ripgrep_search.binaries
        • Package Contents
          • Functions
          • Data
          • API
      • abx_plugin_npm
        • Submodules
          • abx_plugin_npm.config
          • abx_plugin_npm.binproviders
          • abx_plugin_npm.binaries
        • Package Contents
          • Functions
          • Data
          • API
      • abx_spec_archivebox
        • Submodules
          • abx_spec_archivebox.states
          • abx_spec_archivebox.events
          • abx_spec_archivebox.effects
          • abx_spec_archivebox.writes
        • Package Contents
          • Classes
          • Data
          • API
      • abx_plugin_mercury
        • Submodules
          • abx_plugin_mercury.config
          • abx_plugin_mercury.binaries
          • abx_plugin_mercury.mercury
          • abx_plugin_mercury.extractors
        • Package Contents
          • Functions
          • Data
          • API
      • abx_plugin_ldap_auth
        • Submodules
          • abx_plugin_ldap_auth.config
          • abx_plugin_ldap_auth.binaries
        • Package Contents
          • Functions
          • Data
          • API
      • abx_plugin_sonic_search
        • Submodules
          • abx_plugin_sonic_search.searchbackend
          • abx_plugin_sonic_search.config
          • abx_plugin_sonic_search.binaries
        • Package Contents
          • Functions
          • Data
          • API
      • abx_plugin_default_binproviders
        • Module Contents
          • Functions
          • API
      • abx_plugin_readability
        • Submodules
          • abx_plugin_readability.config
          • abx_plugin_readability.binaries
          • abx_plugin_readability.readability
          • abx_plugin_readability.extractors
        • Package Contents
          • Functions
          • Data
          • API
      • abx_plugin_htmltotext
        • Submodules
          • abx_plugin_htmltotext.config
          • abx_plugin_htmltotext.htmltotext
        • Package Contents
          • Functions
          • Data
          • API
      • abx_spec_searchbackend
        • Module Contents
          • Classes
          • Data
          • API
      • abx_spec_abx_pkg
        • Module Contents
          • Classes
          • Data
          • API
      • abx
        • Module Contents
          • Classes
          • Functions
          • Data
          • API
  • Meta
    • Roadmap
      • Planned Specification
        • v0.7: Schema improvements
        • v0.8:  Security
        • v0.9:  Performance
        • v1.0: Full headless browser control
        • v2.0 Federated or distributed archiving + paid hosted service offering
        • Major long-term changes
        • Smaller planned features
      • Past Releases
      • UI / UX Improvements Planned
      • New Extractors Planned
        • Social Media
        • Video/Streams
        • Audio/Music
        • Photos/Images/Comics
        • Text/Forums
        • MOOC/Educational Content
        • Re-Archiving / WARC Creation
        • Other
    • Changelog
    • Supporting Development
    • Web Archiving Community
      • The Master Lists
      • Web Archiving Projects
        • Bookmarking Services
        • From the Archive.org & Archive-It teams
        • From Webrecorder
        • From Rhizome.org (Conifer)
        • From the Old Dominion University: Web Science Team
        • From the Archives Unleashed Team
        • From the IIPC team
        • Other Public Archiving Services
        • Other ArchiveBox Alternatives
        • Smaller Utilities
      • Reading List
        • Blogs Friends of ArchiveBox
        • Articles We Like About Internet Archiving
        • ArchiveBox-Specific Posts, Tutorials, and Guides
        • ArchiveBox Discussions in News & Social Media
      • Communities
        • Most Active Communities
        • Web Archiving Communities
        • General Archiving Foundations, Coalitions, Initiatives, and Institutes
      • ArchiveBox Community Resources
        • ArchiveBox Chat Rooms
        • ArchiveBox on Social Media
        • ArchiveBox on Package Distribution Platforms
ArchiveBox
  • ArchiveBox
  • Edit on GitHub

Welcome to ArchiveBox!

Just getting started?

Check out the Quickstart guide.

Need help with something?

Ping us on Twitter or Github.

Want to join the community?

See our Community Wiki page.

ArchiveBox Logo

ArchiveBox

“The open-source self-hosted internet archive.”

Website | Github | Source | Bug Tracker

mkdir my-archive; cd my-archive/
pip install archivebox

archivebox init
archivebox add https://example.com
archivebox info

Documentation

  • Contents
  • Overview
    • Key Features
    • 🤝 Professional Integration
    • Quickstart
    • Overview
    • Background & Motivation
    • Documentation
    • ArchiveBox Development
  • Getting Started
    • Quickstart
    • Install
    • Docker
    • Configuration
    • Security Overview
    • Usage
  • Guides
    • Setting Up Storage
    • Setting Up Authentication
    • Setting Up Search
    • Publishing Your Archive
    • Scheduled Archiving
    • Chrome / Chromium Setup
    • Setting Up a Chromium User Profile
    • Upgrading Versions
    • Merging Collections
    • Troubleshooting
  • API Reference
    • Filesystem
    • SQL API
    • REST API
    • Python API
  • Meta
    • Roadmap
    • Changelog
    • Supporting Development
    • Web Archiving Community
Next

© Copyright 2024 ©️ ArchiveBox ™️.