ArchiveBox Logo
dev
  • Intro
    • Key Features
    • 🤝 Professional Integration
    • Quickstart
      • ✳️  Easy Setup
      • 🛠  Package Manager Setup
      • 🎗  Other Options
      • ➡️  Next Steps
      • Usage
        • ⚡️  CLI Usage
        • 🖥  Web UI Usage
        • 🗄  SQL/Python/Filesystem Usage
    • Overview
      • Input Formats
      • Output Formats
      • Configuration
        • Most Common Options to Tweak
      • Dependencies
      • Archive Layout
      • Static Archive Exporting
      • Caveats
        • Archiving Private Content
        • Security Risks of Viewing Archived JS
        • Saving Multiple Snapshots of a Single URL
        • Storage Requirements
      • Screenshots
    • Background & Motivation
      • Comparison to Other Projects
        • Comparison With Centralized Public Archives
        • Comparison With Other Self-Hosted Archiving Options
      • Internet Archiving Ecosystem
    • Documentation
      • Getting Started
      • Reference
      • More Info
    • ArchiveBox Development
      • Setup the dev environment
        • 1. Clone the main code repo (making sure to pull the submodules as well)
        • 2. Option A: Install the Python, JS, and system dependencies directly on your machine
        • 2. Option B: Build the docker container and use that for development instead
      • Common development tasks
        • Run in DEBUG mode
        • Install and run a specific GitHub branch
        • Run the linters
        • Run the integration tests
        • Make migrations or enter a django shell
        • Contributing a new extractor
        • Build the docs, pip package, and docker image
        • Roll a release
      • Further Reading
  • Getting Started
    • Quickstart
      • 1. Set up ArchiveBox
      • 2. Get your list of URLs to archive
      • 3. Add your URLs to the archive
      • ✅ Done!
    • Install
      • Supported Systems
      • Dependencies
      • Automatic Setup
      • Manual Setup
        • 1. Install dependencies
          • macOS
          • Ubuntu/Debian
          • BSD
          • Install ArchiveBox using pip
          • Check that everything worked and the versions are high enough.
        • 2. Get your bookmark export file
        • 3. Run archivebox
        • Next Steps
      • Docker Setup
    • Docker
      • Overview
      • Docker Compose
        • Setup
        • Upgrading
        • Usage
        • Accessing the data
        • Configuration
      • Docker
        • Setup
        • Upgrading
        • Usage
        • Accessing the data
          • Using a bind folder
        • Configuration
  • General
    • Usage
      • CLI Usage
        • Run ArchiveBox with configuration options
        • Import a single URL
        • Import a list of URLs from a text file
        • Import list of links from browser history
      • UI Usage
        • Explanation of buttons in the web UI - admin snapshots list
      • Browser Extension Usage
      • Disk Layout
        • Large Archives
      • SQL Shell Usage
      • Python Shell Usage
      • Python API Usage
    • Configuration
      • General Settings
        • OUTPUT_PERMISSIONS
        • ONLY_NEW
        • TIMEOUT
        • MEDIA_TIMEOUT
        • ADMIN_USERNAME / ADMIN_PASSWORD
        • PUBLIC_INDEX / PUBLIC_SNAPSHOTS / PUBLIC_ADD_VIEW
        • CUSTOM_TEMPLATES_DIR
        • SNAPSHOTS_PER_PAGE
        • FOOTER_INFO
        • URL_BLACKLIST
        • URL_WHITELIST
      • Archive Method Toggles
        • SAVE_TITLE
        • SAVE_FAVICON
        • SAVE_WGET
        • SAVE_WARC
        • SAVE_PDF
        • SAVE_SCREENSHOT
        • SAVE_DOM
        • SAVE_SINGLEFILE
        • SAVE_READABILITY
        • SAVE_MERCURY
        • SAVE_GIT
        • SAVE_MEDIA
        • SAVE_ARCHIVE_DOT_ORG
      • Archive Method Options
        • CHECK_SSL_VALIDITY
        • SAVE_WGET_REQUISITES
        • RESOLUTION
        • CURL_USER_AGENT
        • WGET_USER_AGENT
        • CHROME_USER_AGENT
        • GIT_DOMAINS
        • COOKIES_FILE
        • CHROME_USER_DATA_DIR
        • CHROME_HEADLESS
        • CHROME_SANDBOX
      • Shell Options
        • USE_COLOR
        • SHOW_PROGRESS
      • Dependency Options
        • CHROME_BINARY
        • WGET_BINARY
        • YOUTUBEDL_BINARY
        • GIT_BINARY
        • CURL_BINARY
        • SINGLEFILE_BINARY
        • READABILITY_BINARY
        • MERCURY_BINARY
        • RIPGREP_BINARY
        • SINGLEFILE_ARGS
        • CURL_ARGS
        • WGET_ARGS
        • YOUTUBEDL_ARGS
        • GIT_ARGS
    • Troubleshooting
      • Installing
        • Python
        • Chromium/Google Chrome
        • Wget & Curl
      • Archiving
        • No links parsed from export file
        • Lots of skipped sites
        • Lots of errors
        • Lots of broken links from the index
        • Removing unwanted links from the index
      • Hosting the Archive
        • Other database or filesystem issues
    • Security Overview
      • Usage Modes
        • Archiving Public Content [Default]
        • Archiving Private Content
        • ⚠️ Things to watch out for: ⚠️
      • Do not run as root
      • Output Folder
        • Database
        • Filesystem
          • Purging entries
          • Permissions
        • Publishing
    • Publishing Your Archive
      • 1. Use the built-in webserver
      • 2. Export and host it as static HTML
      • Security Concerns
      • Copyright Concerns
    • Scheduled Archiving
      • Docker Usage
      • Example: Archive a Twitter user’s Tweets and linked content within once a week
      • Example: Archive a Reddit subreddit and discussions for every post once a week
      • Example: Archive the HackerNews front page and some linked articles every 24 hours
      • Example: Archive all URLs in an RSS feed from Pocket every 12 hours
      • Example: Archive a Github repository’s source code only once a month
      • Example: Archive a list of URLs pulled from the filesystem every 30 minutes
      • Advanced Scheduling Using Cron
        • Example: Export and archive Firefox browser history every 24 hours
        • Example: Import an RSS feed from Pocket every 12 hours
    • Chromium Install
      • Installing Chromium
        • macOS
        • Ubuntu/Debian
      • Installing Google Chrome
        • macOS
        • Ubuntu/Debian
      • Troubleshooting Chromium Install
    • Setting Up a Chromium User Profile
    • Upgrade your ArchiveBox collection to a new version
      • Upgrading with Docker Compose ⭐️
      • Upgrading with plain Docker
      • Upgrading with a package manager
    • Merge two or more existing archives
    • Modify the ArchiveBox SQLite3 DB directly
      • Example: Modifying an existing user’s email
      • Example: Adding a new user with a hashed password
    • Database Troubleshooting
      • Filesystem doesn’t support FSYNC (e.g. network mounts)
      • Database and filesystem contention issues when running multiple ArchiveBox processes
      • Database migrations errors or upgrade issues
      • Repairing a corrupted SQLite3 database file
    • Related Documents
  • API Reference
    • Configuration Options
      • General Settings
        • OUTPUT_PERMISSIONS
        • ONLY_NEW
        • TIMEOUT
        • MEDIA_TIMEOUT
        • ADMIN_USERNAME / ADMIN_PASSWORD
        • PUBLIC_INDEX / PUBLIC_SNAPSHOTS / PUBLIC_ADD_VIEW
        • CUSTOM_TEMPLATES_DIR
        • SNAPSHOTS_PER_PAGE
        • FOOTER_INFO
        • URL_BLACKLIST
        • URL_WHITELIST
      • Archive Method Toggles
        • SAVE_TITLE
        • SAVE_FAVICON
        • SAVE_WGET
        • SAVE_WARC
        • SAVE_PDF
        • SAVE_SCREENSHOT
        • SAVE_DOM
        • SAVE_SINGLEFILE
        • SAVE_READABILITY
        • SAVE_MERCURY
        • SAVE_GIT
        • SAVE_MEDIA
        • SAVE_ARCHIVE_DOT_ORG
      • Archive Method Options
        • CHECK_SSL_VALIDITY
        • SAVE_WGET_REQUISITES
        • RESOLUTION
        • CURL_USER_AGENT
        • WGET_USER_AGENT
        • CHROME_USER_AGENT
        • GIT_DOMAINS
        • COOKIES_FILE
        • CHROME_USER_DATA_DIR
        • CHROME_HEADLESS
        • CHROME_SANDBOX
      • Shell Options
        • USE_COLOR
        • SHOW_PROGRESS
      • Dependency Options
        • CHROME_BINARY
        • WGET_BINARY
        • YOUTUBEDL_BINARY
        • GIT_BINARY
        • CURL_BINARY
        • SINGLEFILE_BINARY
        • READABILITY_BINARY
        • MERCURY_BINARY
        • RIPGREP_BINARY
        • SINGLEFILE_ARGS
        • CURL_ARGS
        • WGET_ARGS
        • YOUTUBEDL_ARGS
        • GIT_ARGS
    • Data Folder Layout
      • CLI Usage
        • Run ArchiveBox with configuration options
        • Import a single URL
        • Import a list of URLs from a text file
        • Import list of links from browser history
      • UI Usage
        • Explanation of buttons in the web UI - admin snapshots list
      • Browser Extension Usage
      • Disk Layout
        • Large Archives
      • SQL Shell Usage
      • Python Shell Usage
      • Python API Usage
    • Command Line Interface
      • CLI Usage
        • Run ArchiveBox with configuration options
        • Import a single URL
        • Import a list of URLs from a text file
        • Import list of links from browser history
      • UI Usage
        • Explanation of buttons in the web UI - admin snapshots list
      • Browser Extension Usage
      • Disk Layout
        • Large Archives
      • SQL Shell Usage
      • Python Shell Usage
      • Python API Usage
    • Web Interface
      • CLI Usage
        • Run ArchiveBox with configuration options
        • Import a single URL
        • Import a list of URLs from a text file
        • Import list of links from browser history
      • UI Usage
        • Explanation of buttons in the web UI - admin snapshots list
      • Browser Extension Usage
      • Disk Layout
        • Large Archives
      • SQL Shell Usage
      • Python Shell Usage
      • Python API Usage
    • Python API
      • archivebox package
        • Subpackages
          • archivebox.cli package
          • archivebox.config package
          • archivebox.core package
          • archivebox.extractors package
          • archivebox.index package
          • archivebox.parsers package
        • Submodules
        • archivebox.main module
          • help()
          • version()
          • run()
          • init()
          • status()
          • oneshot()
          • add()
          • remove()
          • update()
          • list_all()
          • list_links()
          • list_folders()
          • setup()
          • config()
          • schedule()
          • server()
          • manage()
          • shell()
        • archivebox.manage module
        • archivebox.system module
          • run()
          • atomic_write()
          • chmod_file()
          • copy_and_overwrite()
          • get_dir_size()
          • dedupe_cron_jobs()
          • suppress_output
        • archivebox.util module
          • detect_encoding()
          • scheme()
          • without_scheme()
          • without_query()
          • without_fragment()
          • without_path()
          • path()
          • basename()
          • domain()
          • query()
          • fragment()
          • extension()
          • base_url()
          • without_www()
          • without_trailing_slash()
          • hashurl()
          • urlencode()
          • urldecode()
          • htmlencode()
          • htmldecode()
          • short_ts()
          • ts_to_date_str()
          • ts_to_iso()
          • is_static_file()
          • enforce_types()
          • docstring()
          • str_between()
          • parse_date()
          • download_url()
          • get_headers()
          • chrome_args()
          • chrome_cleanup()
          • ansi_to_html()
          • AttributeDict
          • ExtendedEncoder
        • Module contents
    • REST API
      • archivebox package
        • Subpackages
          • archivebox.cli package
          • archivebox.config package
          • archivebox.core package
          • archivebox.extractors package
          • archivebox.index package
          • archivebox.parsers package
        • Submodules
        • archivebox.main module
          • help()
          • version()
          • run()
          • init()
          • status()
          • oneshot()
          • add()
          • remove()
          • update()
          • list_all()
          • list_links()
          • list_folders()
          • setup()
          • config()
          • schedule()
          • server()
          • manage()
          • shell()
        • archivebox.manage module
        • archivebox.system module
          • run()
          • atomic_write()
          • chmod_file()
          • copy_and_overwrite()
          • get_dir_size()
          • dedupe_cron_jobs()
          • suppress_output
        • archivebox.util module
          • detect_encoding()
          • scheme()
          • without_scheme()
          • without_query()
          • without_fragment()
          • without_path()
          • path()
          • basename()
          • domain()
          • query()
          • fragment()
          • extension()
          • base_url()
          • without_www()
          • without_trailing_slash()
          • hashurl()
          • urlencode()
          • urldecode()
          • htmlencode()
          • htmldecode()
          • short_ts()
          • ts_to_date_str()
          • ts_to_iso()
          • is_static_file()
          • enforce_types()
          • docstring()
          • str_between()
          • parse_date()
          • download_url()
          • get_headers()
          • chrome_args()
          • chrome_cleanup()
          • ansi_to_html()
          • AttributeDict
          • ExtendedEncoder
        • Module contents
  • Meta
    • Roadmap
      • Planned Specification
        • v0.7: Schema improvements
        • v0.8:  Security
        • v0.9:  Performance
        • v1.0: Full headless browser control
        • v2.0 Federated or distributed archiving + paid hosted service offering
        • Major long-term changes
        • Smaller planned features
      • Past Releases
    • Changelog
    • Supporting Development
    • Web Archiving Community
      • The Master Lists
      • Web Archiving Projects
        • Bookmarking Services
        • From the Archive.org & Archive-It teams
        • From the Rhizome.org/WebRecorder.io/Conifer team
        • From the Old Dominion University: Web Science Team
        • From the Archives Unleashed Team
        • From the IIPC team
        • Other Public Archiving Services
        • Other ArchiveBox Alternatives
        • Smaller Utilities
      • Reading List
        • Blogs Friends of ArchiveBox
        • Articles We Like About Internet Archiving
        • ArchiveBox-Specific Posts, Tutorials, and Guides
        • ArchiveBox Discussions in News & Social Media
      • Communities
        • Most Active Communities
        • Web Archiving Communities
        • General Archiving Foundations, Coalitions, Initiatives, and Institutes
      • ArchiveBox Community Resources
        • ArchiveBox Chat Rooms
        • ArchiveBox on Social Media
        • ArchiveBox on Package Distribution Platforms
ArchiveBox
  • ArchiveBox
  • Edit on GitHub

Welcome to ArchiveBox!

Just getting started?

Check out the Quickstart guide.

Need help with something?

Ping us on Twitter or Github.

Want to join the community?

See our Community Wiki page.

ArchiveBox Logo

ArchiveBox

“The open-source self-hosted internet archive.”

Website | Github | Source | Bug Tracker

mkdir my-archive; cd my-archive/
pip install archivebox

archivebox init
archivebox add https://example.com
archivebox info

Documentation

  • Intro
    • Key Features
    • 🤝 Professional Integration
    • Quickstart
    • Overview
    • Background & Motivation
    • Documentation
    • ArchiveBox Development
  • Getting Started
    • Quickstart
    • Install
    • Docker
  • General
    • Usage
    • Configuration
    • Troubleshooting
    • Security Overview
    • Publishing Your Archive
    • Scheduled Archiving
    • Chromium Install
    • Setting Up a Chromium User Profile
    • Upgrade your ArchiveBox collection to a new version
    • Merge two or more existing archives
    • Modify the ArchiveBox SQLite3 DB directly
    • Database Troubleshooting
    • Related Documents
  • API Reference
    • Configuration Options
    • Data Folder Layout
    • Command Line Interface
    • Web Interface
    • Python API
    • REST API
  • Meta
    • Roadmap
    • Changelog
    • Supporting Development
    • Web Archiving Community
Next

© Copyright 2023 ©️ ArchiveBox ™️. Revision 0bd83076.

Read the Docs v: dev
Versions
master
latest
v0.7.1
v0.7.0
v0.6.2
v0.6.0
v0.5.6
v0.5.4
v0.5.3
v0.4.24
v0.4.21
v0.4.20
v0.4.19
v0.4.18
v0.4.17
v0.4.16
v0.4.15
v0.4.14
v0.4.13
v0.4.12
v0.4.9
v0.4.3
v0.4.2
v0.4.1
v0.4.0
v0.2.4
v0.2.3
v0.2.2
v0.2.1
v0.2.0
v0.1.0
dev
v0.0.3
v0.0.2
v0.0.1
Downloads
pdf
epub
On Read the Docs
Project Home
Builds