Roadmap

▶️ Comment here to discuss the contribution roadmap:Official Roadmap Discussion.


CLI Usage

Note, these ways to run ArchiveBox are equivalent:

  • archivebox [subcommand] [...args]
  • python3 -m archivebox [subcommand] [...args]
  • docker run -v $PWD:/data nikisweeting/archivebox [subcommand] [...args]
  • docker-compose run archivebox [subcommand] [...args]

$ pip install archivebox

...
Installing collected packages: archivebox
  Running setup.py install for archivebox ... done
Successfully installed archivebox-0.4.9

Developers who are working on the ArchiveBox codebase should install the project in “linked” mode for development using: pipenv install; pip install -e ..

$ archivebox [version|--version]

ArchiveBox v0.4.9

[i] Dependency versions:
 √  PYTHON_BINARY            /optArchiveBox/.venv/bin/python3.7            v3.7            valid
 √  DJANGO_BINARY            /optArchiveBox/.venv/lib/python3.7/site-packages/django/bin/django-admin.py v2.2.0          valid
 √  CURL_BINARY              /usr/bin/curl                                                          v7.54.0         valid
 √  WGET_BINARY              /usr/local/bin/wget                                                    v1.20.1         valid
 √  GIT_BINARY               /usr/local/bin/git                                                     v2.20.1         valid
 √  YOUTUBEDL_BINARY         /optArchiveBox/.venv/bin/youtube-dl           v2019.04.17     valid
 √  CHROME_BINARY            /Applications/Google Chrome.app/Contents/MacOS/Google Chrome           v74.0.3729.91   valid

[i] Folder locations:
 √  REPO_DIR                 /optArchiveBox                                28 files        valid
 √  PYTHON_DIR               /optArchiveBox/archivebox                     14 files        valid
 √  LEGACY_DIR               /optArchiveBox/archivebox/legacy              15 files        valid
 √  TEMPLATES_DIR            /optArchiveBox/archivebox/legacy/templates    7 files         valid
 √  OUTPUT_DIR               /optArchiveBox/archivebox/data                10 files        valid
 √  SOURCES_DIR              /optArchiveBox/archivebox/data/sources        1 files         valid
 √  LOGS_DIR                 /optArchiveBox/archivebox/data/logs           0 files         valid
 √  ARCHIVE_DIR              /optArchiveBox/archivebox/data/archive        2 files         valid
 √  CHROME_USER_DATA_DIR     /Users/squash/Library/Application Support/Chromium                     2 files         valid
 -  COOKIES_FILE                                                                                    -               disabled                                                                                   -               disabled

$ archivebox [help|-h|--help]

ArchiveBox: The self-hosted internet archive.

Documentation:
    https://github.com/pirate/ArchiveBox/wiki

UI Usage:
    Open output/index.html to view your archive.

CLI Usage:
    mkdir data; cd data/
    archivebox init

    echo 'https://example.com/some/page' | archivebox add
    archivebox add https://example.com/some/other/page
    archivebox add --depth=1 ~/Downloads/bookmarks_export.html
    archivebox add --depth=1 https://example.com/feed.rss
    archivebox update --resume=15109948213.123

$ archivebox init

Initialize a new “collection” folder, aka a complete archive containing an ArchiveBox.conf config file, an index of all the archived pages, and the archived content for each page.

$ mkdir ~/my-archive && ~/my-archive
$ archivebox init
[+] Initializing a new ArchiveBox collection in this folder...
    ~/my-archive
------------------------------------------------------------------

[+] Building archive folder structure...
    √ ~/my-archive/sources
    √ ~/my-archive/archive
    √ ~/my-archive/logs

[+] Building main SQL index and running migrations...
    √ ~/my-archive/index.sqlite3

    Operations to perform:
      Apply all migrations: admin, auth, contenttypes, core, sessions
    Running migrations:
    Applying contenttypes.0001_initial... OK
    Applying auth.0001_initial... OK
    Applying admin.0001_initial... OK
    ...

[*] Collecting links from any existing index or archive folders...
    √ Loaded 30 links from existing main index...
    ! Skipped adding 2 orphaned link data directories that would have overwritten existing data.
    ! Skipped adding 2 corrupted/unrecognized link data directories that could not be read.
        For more information about the link data directories that were skipped, run:
            archivebox status
            archivebox list --status=invalid
            archivebox list --status=orphaned
            archivebox list --status=duplicate

[*] [2019-04-24 15:41:11] Writing 30 links to main index...
    √ ~/my-archive/index.sqlite3
    √ ~/my-archive/index.json
    √ ~/my-archive/index.html

------------------------------------------------------------------
[] Done. A new ArchiveBox collection was initialized (30 links).

    To view your archive index, open:
        ~/my-archive/index.html

    To add new links, you can run:
        archivebox add 'https://example.com'

    For more usage and examples, run:
        archivebox help

$ archivebox status

Print out some info and statistics about the archive collection.

$ archivebox status
[*] Scanning archive collection main index...
    /Users/squash/Documents/Code/ArchiveBox/data/*
    Size: 209.3 KB across 3 files

    > JSON Main Index: 30 links      (found in index.json)
    > SQL Main Index: 30 links       (found in index.sqlite3)
    > HTML Main Index: 30 links      (found in index.html)
    > JSON Link Details: 1 links     (found in archive/*/index.json)
    > Admin: 0 users                 (found in index.sqlite3)

    Hint: You can create an admin user by running:
        archivebox manage createsuperuser

[*] Scanning archive collection link data directories...
    /Users/squash/Documents/Code/ArchiveBox/data/archive/*
    Size: 1.6 MB across 46 files in 50 directories

    > indexed: 30                    (indexed links without checking archive status or data directory validity)
      > archived: 1                  (indexed links that are archived with a valid data directory)
      > unarchived: 29               (indexed links that are unarchived with no data directory or an empty data directory)

    > present: 1                     (dirs that are expected to exist based on the main index)
      > valid: 1                     (dirs with a valid index matched to the main index and archived content)
      > invalid: 0                   (dirs that are invalid for any reason: corrupted/duplicate/orphaned/unrecognized)
        > duplicate: 0               (dirs that conflict with other directories that have the same link URL or timestamp)
        > orphaned: 0                (dirs that contain a valid index but aren't listed in the main index)
        > corrupted: 0               (dirs that don't contain a valid index and aren't listed in the main index)
        > unrecognized: 0            (dirs that don't contain recognizable archive data and aren't listed in the main index)

    Hint: You can list link data directories by status like so:
        archivebox list --status=<status>  (e.g. indexed, corrupted, archived, etc.)

$ archivebox add

--only-new

Controls whether to only add new links or also retry previously failed/skipped links.

--index-only

Pass this to only add the links to the main index without archiving them.

--mirror

Archive an entire site (finding all linked pages below it on the same domain)

--depth

Controls how far to follow links from the given url. 0 sets it to only archive the page, and not follow any outlinks. 1 sets it to archive the page and follow one link outwards and archive those pages. 2 sets it to follow a maximum of two hops outwards, and so on…

--crawler=[type]

Controls which crawler to use in order to find outlinks in a given page.

url

Is the page you want to archive

< stdin

URLs to be added can also be piped in via stdin instead of passed as an argument

$ archivebox add --depth=1 https://example.com
[+] [2019-03-30 18:36:41] Adding 1 new url and all pages 1 hop out: https://example.com
[*] [2019-03-30 18:36:42] Saving main index files...
    √ ./index.json
    √ ./index.html
[] [2019-03-30 18:36:42] Updating archive content...
[+] [2019-03-30 18:36:42] "Using Geolocation Data to Understand Consumer Behavior During Severe Weather Events"
    https://orbitalinsight.com/using-geolocation-data-understand-consumer-behavior-severe-weather-events
    > ./archive/1553789823
        > wget
        > warc
        > media
        > screenshot
[] [2019-03-30 18:39:00] Update of 37 pages complete (2.08 sec)
    - 35 links skipped
    - 0 links updated
    - 2 links had errors
[*] [2019-03-30 18:39:00] Saving main index files...
    √ ./index.json
    √ ./index.html

    To view your archive, open:
        /Users/example/ArchiveBox/index.html

$ archivebox schedule

Use python-crontab to add, remove, and edit regularly scheduled archive update jobs.

--run-all

Run all the scheduled jobs once immediately, independent of their configured schedules

--foreground

Launch ArchiveBox as a long-running foreground task instead of using cron.

--show

Print a list of currently active ArchiveBox cron jobs

--clear

Stop all ArchiveBox scheduled runs, clear it completely from cron

--add

Add a new scheduled ArchiveBox update job to cron

--quiet

Don’t warn about many jobs potentially using up storage space.

--every=[schedule]

The schedule to run the command can be either:

  • minute/hour/day/week/month/year
  • or a cron-formatted schedule like "0/2 * * * *"/"* 0/10 * * * *"/…

import_path

Specify the path as the path to a local file or remote URL to check for new links.

$ archivebox schedule --show
@hourly cd /optArchiveBox/data && /opt/ArchiveBox/.venv/bin/archivebox add "https://getpocket.com/users/nikisweeting/feed/all" 2>&1 > /opt/ArchiveBox/data/logs/archivebox.log # archivebox_schedule
$ archivebox schedule --add --every=hour https://getpocket.com/users/nikisweeting/feed/all

[] Scheduled new ArchiveBox cron job for user: squash (1 jobs are active).
  > @hourly cd /Users/squash/Documents/Code/ArchiveBox/data && /Users/squash/Documents/Code/ArchiveBox/.venv/bin/archivebox add "https://getpocket.com/users/nikisweeting/feed/all" 2>&1 > /Users/squash/Documents/Code/ArchiveBox/data/logs/archivebox.log # archivebox_schedule

[!] With the current cron config, ArchiveBox is estimated to run >365 times per year.
    Congrats on being an enthusiastic internet archiver! 👌

    Make sure you have enough storage space available to hold all the data.
    Using a compressed/deduped filesystem like ZFS is recommended if you plan on archiving a lot.

$ archivebox config

(no args)

Print the entire config to stdout.

--get KEY

Get the given config key:value and print it to stdout.

--set KEY=VALUE

Set the given config key:value in the current collection’s config file.

< stdin

$ archviebox config
OUTPUT_DIR="output"
OUTPUT_PERMISSIONS=755
ONLY_NEW=False
...
$ archviebox config --get CHROME_VERSION
Google Chrome 74.0.3729.40 beta
$ archviebox config --set USE_CHROME=False
USE_CHROME=False

$ archivebox update

Check all subscribed feeds for new links, archive them and retry any previously failed pages.

(no args)

Update the index and go through each page, retrying any that failed previously.

--only-new

By default it always retries previously failed/skipped pages, pass this flag to only archive newly added links without going through the whole archive and attempting to fix previously failed links.

--resume=[timestamp]

Resume the update process from a specific URL timestamp.

--snapshot

[TODO] by default ArchiveBox never re-archives pages after the first successful archive, if you want to take a new snapshot of every page even if there’s an existing version, pass this option.

$ archivebox list

--csv=COLUMNS

Print the output in CSV format, with the specified columns, e.g. --csv=timestamp,base_url,is_archived

--json

Print the output in JSON format (with all the link attributes included in the JSON output).

--filter=REGEX

Print only URLs matching a specified regex, e.g. --filter='.*github.com.*'

--before=TIMESTAMP / --after=TIMESTAMP

Print only URLs before or after a given timestamp, e.g. --before=1554263415.2 or --after=1554260000

$ archivebox list --sort=timestamp
http://www.iana.org/domains/example
https://github.com/pirate/ArchiveBox/wiki
https://github.com/pirate/ArchiveBox/commit/0.4.0
https://github.com/pirate/ArchiveBox
https://archivebox.io
$ archivebox list --sort=timestamp --csv=timestamp,url
timestamp,url
1554260947,http://www.iana.org/domains/example
1554263415,https://github.com/pirate/ArchiveBox/wiki
1554263415.0,https://github.com/pirate/ArchiveBox/commit/0.4.0
1554263415.1,https://github.com/pirate/ArchiveBox
1554263415.2,https://archivebox.io
$ archivebox list --sort=timestamp --csv=timestamp,url --after=1554263415.0
timestamp,url
1554263415,https://github.com/pirate/ArchiveBox/wiki
1554263415.0,https://github.com/pirate/ArchiveBox/commit/0.4.0
1554263415.1,https://github.com/pirate/ArchiveBox
1554263415.2,https://archivebox.io

$ archivebox remove

--yes

Proceed with removal without prompting the user for confirmation.

--delete

Also delete all the matching links snapshot data folders and content files.

--filter-type

Defaults to exact, but can be set to any of exact, substring, domain, or regex.

pattern

The filter pattern used to match links in the index. Matching links are removed.

--before=TIMESTAMP / --after=TIMESTAMP

Remove any URLs bookmarked before/after the given timestamp, e.g. --before=1554263415.2 or --after=1554260000.

$ archivebox remove --delete --filter-type=regex 'http(s)?:\\/\\/(.+)?(demo\\.dev|example\\.com)\\/?.*'
[*] Finding links in the archive index matching these regex patterns:
    http(s)?:\/\/(.+)?(youtube\.com|example\.com)\/?.*

---------------------------------------------------------------------------------------------------
timestamp        | is_archived      | num_outputs      | url
"1554984695"     | true             | 7                | "https://example.com"
---------------------------------------------------------------------------------------------------

[i] Found 1 matching URLs to remove.
    1 Links will be de-listed from the main index, and their archived content folders will be deleted from disk.
    (1 data folders with 7 archived files will be deleted!)

[?] Do you want to proceed with removing these 1 links?
    y/[n]: y

[*] [2019-04-11 08:11:57] Saving main index files...
    √ /opt/ArchiveBox/data/index.json
    √ /opt/ArchiveBox/data/index.html

[] Removed 1 out of 1 links from the archive index.
    Index now contains 0 links.
$ archivebox remove --yes --delete --filter-type=domain example.com
...

$ archivebox manage

Run a Django management command in the context of the current archivebox data directory.

[command] [...args]

The name of the management command to run, e.g.: help, migrate, changepassword, createsuperuser, etc.

$ archivebox manage help
Type 'archivebox manage help <subcommand>' for help on a specific subcommand.

Available subcommands:

[auth]
    changepassword
    createsuperuser

[contenttypes]
    remove_stale_contenttypes

[core]
    archivebox

...

$ archivebox server

--bind=[ip:port]

The address:port combo to run the web UI server on, defaults to 127.0.0.1:8000.

$ archivebox server
[+] Starting ArchiveBox webserver...
Watching for file changes with StatReloader
Performing system checks...

System check identified no issues (0 silenced).
April 23, 2019 - 01:40:52
Django version 2.2, using settings 'core.settings'
Starting development server at http://127.0.0.1:8000/
Quit the server with CONTROL-C.

$ archivebox proxy

Run a live HTTP/HTTPS proxy that records all traffic into WARC files using pywb.

--bind=[ip:port]

The address:port combo to run the proxy on, defaults to 127.0.0.1:8010.

--record

Save all traffic visited through the proxy to the archive.

--replay

Attempt to serve all pages visited through the proxy from the archive.

$ archivebox shell

Drop into an ArchiveBox Django shell with access to all models and data.

$ archivebox shell                                                                                                                                                          
Loaded archive data folder ~/example_collection...
Python 3.7.2 (default, Feb 12 2019, 08:15:36)

In [1]: url_to_archive = Link.objects.filter(is_archived=True).values_list('url', flat=True)
...

$ archivebox oneshot

Create a single URL archive folder with an index.json and index.html, and all the archive method outputs. You can run this to archive single pages without needing to create a whole collection with archivebox init.

--out-dir=[path]

Path to save the single archive folder to, e.g. ./example.com_archive.

[--all|--media|--wget|...]

Which archive methods to use when saving the URL.

Python Usage

API for normal ArchiveBox usage

from archivebox import add, subscribe, update

add('https://example.com', depth=2)
subscribe('https://example.com/some/feed.rss')
update(only_new=True)

API for All Useful Subcomponents

from archivebox import oneshot
from archivebox.crawl import rss
from archivebox.extract import media

links = crawl_rss(open('feed.rss', 'r').read())
assets = media.extract('https://youtube.com/watch?v=example')
oneshot('https://example.com', depth=2, out_dir='~/Desktop/example.com_archive')

Design

As of v0.4.0 ArchiveBox also writes the index to a sqlite3 file using the Django ORM (in addition to the usual json and html formats, those aren’t going away). To an end user, it will still appear to be a single CLI application, and none of the django complexity will be exposed. Django is used primarily because it allows for safe migrations of a sqlite database. As the schema gets updated in the future I don’t want to break people’s archives with every new version. It also allows us to have the GUI server start with many safe defaults and share much of the same codebase with the CLI and library components, including maintaining the archive database and managing a worker pool.

There will be 3 primary use cases for archivebox, and all three will be served by the pip package:

  • simple CLI operation: archivebox.cli import add --depth=1 ./path/to/export.html (similar to current archivebox CLI)
  • use of individual components as a library: from archivebox.extract import screenshot or archivebox oneshot --screenshot ...
  • usage in server mode with a GUI to add/remove links and create exports: archivebox server

Dependencies:

  • django (required)
  • sqlite (required)
  • headless chrome (required)
  • wget (required)
  • redis (optional, for web GUI only)
  • dramatiq (optinal, for web GUI only)

When launched in webserver mode, archivebox will automatically spawn a pool of workers (dramatiq) as big as the number of CPUs available to use for crawling, archiving, and publishing.

When launched in CLI mode it will use normal subprocesses to do multithreading without redis/dramatiq.

Code Folder Layout

  • archivebox/
    • core/
      • models.py Archive = Dict[Page, Dict[Archiver, List[Asset]]] # A collection of archived pages Crawl = List[Page] # list of links to add to an archive Page # an archived page with unique url Asset # a file archived from a page
      • util.py
      • settings.py
    • crawl/ impl: detect_crawlable(Import) -> bool crawl(Import) -> List[Page]
      • txt.py
      • rss.py
      • netscape.py
      • pocket.py
      • pinboard.py
      • html.py
    • extract/ impl: detect_extractable(Page) -> bool extract(Page) -> List[Asset]
      • wget.py
      • screenshot.py
      • pdf.py
      • dom.py
      • youtubedl.py
      • waybackmachine.py
      • solana.py
    • publish/ impl: publish(Archive, output_format)
      • html.py
      • json.py
      • csv.py
      • sql.py

Collection Data Folder Layout

  • ArchiveBox.conf
  • database/
    • sqlite.db
  • archive
    • assets/<hash>/
  • logs/
    • server.log
    • crawl.log
    • archive.log

Exported Folder Layout

For publishing the archive as static html/json/csv/sql.

  • index.html,json,csv,sql
  • archive/
    • <timestamp>/
      • index.html
      • <url>/
        • index.html,json,csv,sql
      • assets/
        • hash.mp4
        • hash.txt
        • hash.mp3

The server will be runnable with docker / docker-compose as well:

version: '3'

services:
    archivebox:
        image: archivebox
        ports:
            - '8098:80'
        volumes:
            - ./data/:/data

Major long-term changes

  • release pip, apt, pkg, and brew packaged distributions for installing ArchiveBox
  • add an optional web GUI for managing sources, adding new links, and viewing the archive
  • switch to django + sqlite db with migrations system & json/html export for managing archive schema changes and persistence
  • modularize internals to allow importing individual components
  • switch to sha256 of URL as unique link ID
  • support storing multiple snapshots of pages over time
  • support custom user puppeteer scripts to run while archiving (e.g. for expanding reddit threads, scrolling thread on twitter, etc)
  • support named collections of archived content with different user access permissions
  • support sharing archived assets via DHT + torrent / ipfs / ZeroNet / other sharing system

Smaller planned features

  • support pushing pages to multiple 3rd party services using ArchiveNow instead of just archive.org
  • body text extraction to markdown (using fathom?)
  • featured image / thumbnail extraction
  • auto-tagging links based on important/frequent keywords in extracted text (like pocket)
  • automatic article summary paragraphs from extracted text with nlp summarization library
  • full-text search of extracted text with elasticsearch/elasticlunr/ag
  • download closed-caption subtitles from Youtube and other video sites for full-text indexing of video content
  • try pulling dead sites from archive.org and other sources if original is down (https://github.com/hartator/wayback-machine-downloader)
  • And more in the issues list

IMPORTANT: Please don’t work on any of these major long-term tasks without contacting me first, work is already in progress for many of these, and I may have to reject your PR if it doesn’t align with the existing work!