Roadmap¶

▶️ Comment here to discuss the contribution roadmap:Official Roadmap Discussion.

Planned Specification¶

To see how this spec has been scheduled / implemented / released so far, read these pull requests:

✅ v0.2.x
✅ v0.3.x
✅ v0.4.x
🛠 v0.5.x

API:

pip install archivebox
archivebox version
archivebox help
archivebox init
archivebox status
archivebox add
archivebox remove
archivebox schedule
archivebox config
archivebox update
archivebox list
archivebox oneshot
archivebox server
archivebox proxy
archivebox shell
archivebox manage
from archivebox import ...
from archivebox.component import ...

Design:

Overview
Dependencies
Dependencies
Code Layout
Data Layout
Export Layout

CLI Usage¶

Note, these ways to run ArchiveBox are equivalent:

archivebox [subcommand] [...args]
python3 -m archivebox [subcommand] [...args]
docker run -v $PWD:/data nikisweeting/archivebox [subcommand] [...args]
docker-compose run archivebox [subcommand] [...args]

`$ pip install archivebox`¶

...
Installing collected packages: archivebox
  Running setup.py install for archivebox ... done
Successfully installed archivebox-0.4.9

Developers who are working on the ArchiveBox codebase should install the project in “linked” mode for development using: pipenv install; pip install -e ..

`$ archivebox [version|--version]`¶

ArchiveBox v0.4.9

[i] Dependency versions:
 √  PYTHON_BINARY            /optArchiveBox/.venv/bin/python3.7            v3.7            valid
 √  DJANGO_BINARY            /optArchiveBox/.venv/lib/python3.7/site-packages/django/bin/django-admin.py v2.2.0          valid
 √  CURL_BINARY              /usr/bin/curl                                                          v7.54.0         valid
 √  WGET_BINARY              /usr/local/bin/wget                                                    v1.20.1         valid
 √  GIT_BINARY               /usr/local/bin/git                                                     v2.20.1         valid
 √  YOUTUBEDL_BINARY         /optArchiveBox/.venv/bin/youtube-dl           v2019.04.17     valid
 √  CHROME_BINARY            /Applications/Google Chrome.app/Contents/MacOS/Google Chrome           v74.0.3729.91   valid

[i] Folder locations:
 √  REPO_DIR                 /optArchiveBox                                28 files        valid
 √  PYTHON_DIR               /optArchiveBox/archivebox                     14 files        valid
 √  LEGACY_DIR               /optArchiveBox/archivebox/legacy              15 files        valid
 √  TEMPLATES_DIR            /optArchiveBox/archivebox/legacy/templates    7 files         valid
 √  OUTPUT_DIR               /optArchiveBox/archivebox/data                10 files        valid
 √  SOURCES_DIR              /optArchiveBox/archivebox/data/sources        1 files         valid
 √  LOGS_DIR                 /optArchiveBox/archivebox/data/logs           0 files         valid
 √  ARCHIVE_DIR              /optArchiveBox/archivebox/data/archive        2 files         valid
 √  CHROME_USER_DATA_DIR     /Users/squash/Library/Application Support/Chromium                     2 files         valid
 -  COOKIES_FILE                                                                                    -               disabled                                                                                   -               disabled

`$ archivebox [help|-h|--help]`¶

ArchiveBox: The self-hosted internet archive.

Documentation:
    https://github.com/pirate/ArchiveBox/wiki

UI Usage:
    Open output/index.html to view your archive.

CLI Usage:
    mkdir data; cd data/
    archivebox init

    echo 'https://example.com/some/page' | archivebox add
    archivebox add https://example.com/some/other/page
    archivebox add --depth=1 ~/Downloads/bookmarks_export.html
    archivebox add --depth=1 https://example.com/feed.rss
    archivebox update --resume=15109948213.123

`$ archivebox init`¶

Initialize a new “collection” folder, aka a complete archive containing an ArchiveBox.conf config file, an index of all the archived pages, and the archived content for each page.

$ mkdir ~/my-archive && ~/my-archive
$ archivebox init
[+] Initializing a new ArchiveBox collection in this folder...
    ~/my-archive
------------------------------------------------------------------

[+] Building archive folder structure...
    √ ~/my-archive/sources
    √ ~/my-archive/archive
    √ ~/my-archive/logs

[+] Building main SQL index and running migrations...
    √ ~/my-archive/index.sqlite3

    Operations to perform:
      Apply all migrations: admin, auth, contenttypes, core, sessions
    Running migrations:
    Applying contenttypes.0001_initial... OK
    Applying auth.0001_initial... OK
    Applying admin.0001_initial... OK
    ...

[*] Collecting links from any existing index or archive folders...
    √ Loaded 30 links from existing main index...
    ! Skipped adding 2 orphaned link data directories that would have overwritten existing data.
    ! Skipped adding 2 corrupted/unrecognized link data directories that could not be read.
        For more information about the link data directories that were skipped, run:
            archivebox status
            archivebox list --status=invalid
            archivebox list --status=orphaned
            archivebox list --status=duplicate

[*] [2019-04-24 15:41:11] Writing 30 links to main index...
    √ ~/my-archive/index.sqlite3
    √ ~/my-archive/index.json
    √ ~/my-archive/index.html

------------------------------------------------------------------
[√] Done. A new ArchiveBox collection was initialized (30 links).

    To view your archive index, open:
        ~/my-archive/index.html

    To add new links, you can run:
        archivebox add 'https://example.com'

    For more usage and examples, run:
        archivebox help

`$ archivebox status`¶

Print out some info and statistics about the archive collection.

$ archivebox status
[*] Scanning archive collection main index...
    /Users/squash/Documents/Code/ArchiveBox/data/*
    Size: 209.3 KB across 3 files

    > JSON Main Index: 30 links      (found in index.json)
    > SQL Main Index: 30 links       (found in index.sqlite3)
    > HTML Main Index: 30 links      (found in index.html)
    > JSON Link Details: 1 links     (found in archive/*/index.json)
    > Admin: 0 users                 (found in index.sqlite3)

    Hint: You can create an admin user by running:
        archivebox manage createsuperuser

[*] Scanning archive collection link data directories...
    /Users/squash/Documents/Code/ArchiveBox/data/archive/*
    Size: 1.6 MB across 46 files in 50 directories

    > indexed: 30                    (indexed links without checking archive status or data directory validity)
      > archived: 1                  (indexed links that are archived with a valid data directory)
      > unarchived: 29               (indexed links that are unarchived with no data directory or an empty data directory)

    > present: 1                     (dirs that are expected to exist based on the main index)
      > valid: 1                     (dirs with a valid index matched to the main index and archived content)
      > invalid: 0                   (dirs that are invalid for any reason: corrupted/duplicate/orphaned/unrecognized)
        > duplicate: 0               (dirs that conflict with other directories that have the same link URL or timestamp)
        > orphaned: 0                (dirs that contain a valid index but aren't listed in the main index)
        > corrupted: 0               (dirs that don't contain a valid index and aren't listed in the main index)
        > unrecognized: 0            (dirs that don't contain recognizable archive data and aren't listed in the main index)

    Hint: You can list link data directories by status like so:
        archivebox list --status=<status>  (e.g. indexed, corrupted, archived, etc.)

`$ archivebox add`¶

`--only-new`¶

Controls whether to only add new links or also retry previously failed/skipped links.

`--index-only`¶

Pass this to only add the links to the main index without archiving them.

`--mirror`¶

Archive an entire site (finding all linked pages below it on the same domain)

`--depth`¶

Controls how far to follow links from the given url. 0 sets it to only archive the page, and not follow any outlinks. 1 sets it to archive the page and follow one link outwards and archive those pages. 2 sets it to follow a maximum of two hops outwards, and so on…

`--crawler=[type]`¶

Controls which crawler to use in order to find outlinks in a given page.

`url`¶

Is the page you want to archive

`< stdin`¶

URLs to be added can also be piped in via stdin instead of passed as an argument

$ archivebox add --depth=1 https://example.com
[+] [2019-03-30 18:36:41] Adding 1 new url and all pages 1 hop out: https://example.com
[*] [2019-03-30 18:36:42] Saving main index files...
    √ ./index.json
    √ ./index.html
[▶] [2019-03-30 18:36:42] Updating archive content...
[+] [2019-03-30 18:36:42] "Using Geolocation Data to Understand Consumer Behavior During Severe Weather Events"
    https://orbitalinsight.com/using-geolocation-data-understand-consumer-behavior-severe-weather-events
    > ./archive/1553789823
        > wget
        > warc
        > media
        > screenshot
[√] [2019-03-30 18:39:00] Update of 37 pages complete (2.08 sec)
    - 35 links skipped
    - 0 links updated
    - 2 links had errors
[*] [2019-03-30 18:39:00] Saving main index files...
    √ ./index.json
    √ ./index.html

    To view your archive, open:
        /Users/example/ArchiveBox/index.html

`$ archivebox schedule`¶

Use python-crontab to add, remove, and edit regularly scheduled archive update jobs.

`--run-all`¶

Run all the scheduled jobs once immediately, independent of their configured schedules

`--foreground`¶

Launch ArchiveBox as a long-running foreground task instead of using cron.

`--show`¶

Print a list of currently active ArchiveBox cron jobs

`--clear`¶

Stop all ArchiveBox scheduled runs, clear it completely from cron

`--add`¶

Add a new scheduled ArchiveBox update job to cron

`--quiet`¶

Don’t warn about many jobs potentially using up storage space.

`--every=[schedule]`¶

The schedule to run the command can be either:

minute/hour/day/week/month/year
or a cron-formatted schedule like "0/2 * * * *"/"* 0/10 * * * *"/…

`import_path`¶

Specify the path as the path to a local file or remote URL to check for new links.

$ archivebox schedule --show
@hourly cd /optArchiveBox/data && /opt/ArchiveBox/.venv/bin/archivebox add "https://getpocket.com/users/nikisweeting/feed/all" 2>&1 > /opt/ArchiveBox/data/logs/archivebox.log # archivebox_schedule

$ archivebox schedule --add --every=hour https://getpocket.com/users/nikisweeting/feed/all

[√] Scheduled new ArchiveBox cron job for user: squash (1 jobs are active).
  > @hourly cd /Users/squash/Documents/Code/ArchiveBox/data && /Users/squash/Documents/Code/ArchiveBox/.venv/bin/archivebox add "https://getpocket.com/users/nikisweeting/feed/all" 2>&1 > /Users/squash/Documents/Code/ArchiveBox/data/logs/archivebox.log # archivebox_schedule

[!] With the current cron config, ArchiveBox is estimated to run >365 times per year.
    Congrats on being an enthusiastic internet archiver! 👌

    Make sure you have enough storage space available to hold all the data.
    Using a compressed/deduped filesystem like ZFS is recommended if you plan on archiving a lot.

`$ archivebox config`¶

`(no args)`¶

Print the entire config to stdout.

`--get KEY`¶

Get the given config key:value and print it to stdout.

`--set KEY=VALUE`¶

Set the given config key:value in the current collection’s config file.

`< stdin`¶

$ archviebox config
OUTPUT_DIR="output"
OUTPUT_PERMISSIONS=755
ONLY_NEW=False
...

$ archviebox config --get CHROME_VERSION
Google Chrome 74.0.3729.40 beta

$ archviebox config --set USE_CHROME=False
USE_CHROME=False

`$ archivebox update`¶

Check all subscribed feeds for new links, archive them and retry any previously failed pages.

`(no args)`¶

Update the index and go through each page, retrying any that failed previously.

`--only-new`¶

By default it always retries previously failed/skipped pages, pass this flag to only archive newly added links without going through the whole archive and attempting to fix previously failed links.

`--resume=[timestamp]`¶

Resume the update process from a specific URL timestamp.

`--snapshot`¶

[TODO] by default ArchiveBox never re-archives pages after the first successful archive, if you want to take a new snapshot of every page even if there’s an existing version, pass this option.

`$ archivebox list`¶

`--csv=COLUMNS`¶

Print the output in CSV format, with the specified columns, e.g. --csv=timestamp,base_url,is_archived

`--json`¶

Print the output in JSON format (with all the link attributes included in the JSON output).

`--filter=REGEX`¶

Print only URLs matching a specified regex, e.g. --filter='.*github.com.*'

`--before=TIMESTAMP` / `--after=TIMESTAMP`¶

Print only URLs before or after a given timestamp, e.g. --before=1554263415.2 or --after=1554260000

$ archivebox list --sort=timestamp
http://www.iana.org/domains/example
https://github.com/pirate/ArchiveBox/wiki
https://github.com/pirate/ArchiveBox/commit/0.4.0
https://github.com/pirate/ArchiveBox
https://archivebox.io

$ archivebox list --sort=timestamp --csv=timestamp,url
timestamp,url
1554260947,http://www.iana.org/domains/example
1554263415,https://github.com/pirate/ArchiveBox/wiki
1554263415.0,https://github.com/pirate/ArchiveBox/commit/0.4.0
1554263415.1,https://github.com/pirate/ArchiveBox
1554263415.2,https://archivebox.io

$ archivebox list --sort=timestamp --csv=timestamp,url --after=1554263415.0
timestamp,url
1554263415,https://github.com/pirate/ArchiveBox/wiki
1554263415.0,https://github.com/pirate/ArchiveBox/commit/0.4.0
1554263415.1,https://github.com/pirate/ArchiveBox
1554263415.2,https://archivebox.io

`$ archivebox remove`¶

`--yes`¶

Proceed with removal without prompting the user for confirmation.

`--delete`¶

Also delete all the matching links snapshot data folders and content files.

`--filter-type`¶

Defaults to exact, but can be set to any of exact, substring, domain, or regex.

`pattern`¶

The filter pattern used to match links in the index. Matching links are removed.

`--before=TIMESTAMP` / `--after=TIMESTAMP`¶

Remove any URLs bookmarked before/after the given timestamp, e.g. --before=1554263415.2 or --after=1554260000.

$ archivebox remove --delete --filter-type=regex 'http(s)?:\\/\\/(.+)?(demo\\.dev|example\\.com)\\/?.*'
[*] Finding links in the archive index matching these regex patterns:
    http(s)?:\/\/(.+)?(youtube\.com|example\.com)\/?.*

---------------------------------------------------------------------------------------------------
timestamp        | is_archived      | num_outputs      | url
"1554984695"     | true             | 7                | "https://example.com"
---------------------------------------------------------------------------------------------------

[i] Found 1 matching URLs to remove.
    1 Links will be de-listed from the main index, and their archived content folders will be deleted from disk.
    (1 data folders with 7 archived files will be deleted!)

[?] Do you want to proceed with removing these 1 links?
    y/[n]: y

[*] [2019-04-11 08:11:57] Saving main index files...
    √ /opt/ArchiveBox/data/index.json
    √ /opt/ArchiveBox/data/index.html

[√] Removed 1 out of 1 links from the archive index.
    Index now contains 0 links.

$ archivebox remove --yes --delete --filter-type=domain example.com
...

`$ archivebox manage`¶

Run a Django management command in the context of the current archivebox data directory.

`[command] [...args]`¶

The name of the management command to run, e.g.: help, migrate, changepassword, createsuperuser, etc.

$ archivebox manage help
Type 'archivebox manage help <subcommand>' for help on a specific subcommand.

Available subcommands:

[auth]
    changepassword
    createsuperuser

[contenttypes]
    remove_stale_contenttypes

[core]
    archivebox

...

`$ archivebox server`¶

`--bind=[ip:port]`¶

The address:port combo to run the web UI server on, defaults to 127.0.0.1:8000.

$ archivebox server
[+] Starting ArchiveBox webserver...
Watching for file changes with StatReloader
Performing system checks...

System check identified no issues (0 silenced).
April 23, 2019 - 01:40:52
Django version 2.2, using settings 'core.settings'
Starting development server at http://127.0.0.1:8000/
Quit the server with CONTROL-C.

`$ archivebox proxy`¶

Run a live HTTP/HTTPS proxy that records all traffic into WARC files using pywb.

`--bind=[ip:port]`¶

The address:port combo to run the proxy on, defaults to 127.0.0.1:8010.

`--record`¶

Save all traffic visited through the proxy to the archive.

`--replay`¶

Attempt to serve all pages visited through the proxy from the archive.

`$ archivebox shell`¶

Drop into an ArchiveBox Django shell with access to all models and data.

$ archivebox shell                                                                                                                                                          
Loaded archive data folder ~/example_collection...
Python 3.7.2 (default, Feb 12 2019, 08:15:36)

In [1]: url_to_archive = Link.objects.filter(is_archived=True).values_list('url', flat=True)
...

`$ archivebox oneshot`¶

Create a single URL archive folder with an index.json and index.html, and all the archive method outputs. You can run this to archive single pages without needing to create a whole collection with archivebox init.

`--out-dir=[path]`¶

Path to save the single archive folder to, e.g. ./example.com_archive.

`[--all|--media|--wget|...]`¶

Which archive methods to use when saving the URL.

Python Usage¶

API for normal ArchiveBox usage¶

from archivebox import add, subscribe, update

add('https://example.com', depth=2)
subscribe('https://example.com/some/feed.rss')
update(only_new=True)

API for All Useful Subcomponents¶

from archivebox import oneshot
from archivebox.crawl import rss
from archivebox.extract import media

links = crawl_rss(open('feed.rss', 'r').read())
assets = media.extract('https://youtube.com/watch?v=example')
oneshot('https://example.com', depth=2, out_dir='~/Desktop/example.com_archive')

Design¶

As of v0.4.0 ArchiveBox also writes the index to a sqlite3 file using the Django ORM (in addition to the usual json and html formats, those aren’t going away). To an end user, it will still appear to be a single CLI application, and none of the django complexity will be exposed. Django is used primarily because it allows for safe migrations of a sqlite database. As the schema gets updated in the future I don’t want to break people’s archives with every new version. It also allows us to have the GUI server start with many safe defaults and share much of the same codebase with the CLI and library components, including maintaining the archive database and managing a worker pool.

There will be 3 primary use cases for archivebox, and all three will be served by the pip package:

simple CLI operation: archivebox.cli import add --depth=1 ./path/to/export.html (similar to current archivebox CLI)
use of individual components as a library: from archivebox.extract import screenshot or archivebox oneshot --screenshot ...
usage in server mode with a GUI to add/remove links and create exports: archivebox server

Dependencies:¶

django (required)
sqlite (required)
headless chrome (required)
wget (required)
redis (optional, for web GUI only)
dramatiq (optinal, for web GUI only)

When launched in webserver mode, archivebox will automatically spawn a pool of workers (dramatiq) as big as the number of CPUs available to use for crawling, archiving, and publishing.

When launched in CLI mode it will use normal subprocesses to do multithreading without redis/dramatiq.

Code Folder Layout¶

archivebox/
- core/
  - models.py Archive = Dict[Page, Dict[Archiver, List[Asset]]] # A collection of archived pages Crawl = List[Page] # list of links to add to an archive Page # an archived page with unique url Asset # a file archived from a page
  - util.py
  - settings.py
- crawl/ impl: detect_crawlable(Import) -> bool crawl(Import) -> List[Page]
  - txt.py
  - rss.py
  - netscape.py
  - pocket.py
  - pinboard.py
  - html.py
- extract/ impl: detect_extractable(Page) -> bool extract(Page) -> List[Asset]
  - wget.py
  - screenshot.py
  - pdf.py
  - dom.py
  - youtubedl.py
  - waybackmachine.py
  - solana.py
- publish/ impl: publish(Archive, output_format)
  - html.py
  - json.py
  - csv.py
  - sql.py

Collection Data Folder Layout¶

ArchiveBox.conf
database/
- sqlite.db
archive
- assets/<hash>/
logs/
- server.log
- crawl.log
- archive.log

Exported Folder Layout¶

For publishing the archive as static html/json/csv/sql.

index.html,json,csv,sql
archive/
- <timestamp>/
  - index.html
  - <url>/
    - index.html,json,csv,sql
  - assets/
    - hash.mp4
    - hash.txt
    - hash.mp3

The server will be runnable with docker / docker-compose as well:

version: '3'

services:
    archivebox:
        image: archivebox
        ports:
            - '8098:80'
        volumes:
            - ./data/:/data

Major long-term changes¶

release pip, apt, pkg, and brew packaged distributions for installing ArchiveBox
add an optional web GUI for managing sources, adding new links, and viewing the archive
switch to django + sqlite db with migrations system & json/html export for managing archive schema changes and persistence
modularize internals to allow importing individual components
switch to sha256 of URL as unique link ID
support storing multiple snapshots of pages over time
support custom user puppeteer scripts to run while archiving (e.g. for expanding reddit threads, scrolling thread on twitter, etc)
support named collections of archived content with different user access permissions
support sharing archived assets via DHT + torrent / ipfs / ZeroNet / other sharing system

Smaller planned features¶

support pushing pages to multiple 3rd party services using ArchiveNow instead of just archive.org
body text extraction to markdown (using fathom?)
featured image / thumbnail extraction
auto-tagging links based on important/frequent keywords in extracted text (like pocket)
automatic article summary paragraphs from extracted text with nlp summarization library
full-text search of extracted text with elasticsearch/elasticlunr/ag
download closed-caption subtitles from Youtube and other video sites for full-text indexing of video content
try pulling dead sites from archive.org and other sources if original is down (https://github.com/hartator/wayback-machine-downloader)
And more in the issues list…

IMPORTANT: Please don’t work on any of these major long-term tasks without contacting me first, work is already in progress for many of these, and I may have to reject your PR if it doesn’t align with the existing work!

Roadmap¶

Planned Specification¶

CLI Usage¶

$ pip install archivebox¶

$ archivebox [version|--version]¶

$ archivebox [help|-h|--help]¶

$ archivebox init¶

$ archivebox status¶

$ archivebox add¶

--only-new¶

--index-only¶

--mirror¶

--depth¶

--crawler=[type]¶

url¶

< stdin¶

$ archivebox schedule¶

--run-all¶

--foreground¶

--show¶

--clear¶

--add¶

--quiet¶

--every=[schedule]¶

import_path¶

$ archivebox config¶

(no args)¶

--get KEY¶

--set KEY=VALUE¶

< stdin¶

$ archivebox update¶

(no args)¶

--only-new¶

--resume=[timestamp]¶

--snapshot¶

$ archivebox list¶

--csv=COLUMNS¶

--json¶

--filter=REGEX¶

--before=TIMESTAMP / --after=TIMESTAMP¶

$ archivebox remove¶

--yes¶

--delete¶

--filter-type¶

pattern¶

--before=TIMESTAMP / --after=TIMESTAMP¶

$ archivebox manage¶

[command] [...args]¶

$ archivebox server¶

--bind=[ip:port]¶

$ archivebox proxy¶

--bind=[ip:port]¶

--record¶

--replay¶

$ archivebox shell¶

$ archivebox oneshot¶

--out-dir=[path]¶

[--all|--media|--wget|...]¶

Python Usage¶

API for normal ArchiveBox usage¶

API for All Useful Subcomponents¶

Design¶

Dependencies:¶

Code Folder Layout¶

Collection Data Folder Layout¶

Exported Folder Layout¶

Major long-term changes¶

Smaller planned features¶

`$ pip install archivebox`¶

`$ archivebox [version|--version]`¶

`$ archivebox [help|-h|--help]`¶

`$ archivebox init`¶

`$ archivebox status`¶

`$ archivebox add`¶

`--only-new`¶

`--index-only`¶

`--mirror`¶

`--depth`¶

`--crawler=[type]`¶

`url`¶

`< stdin`¶

`$ archivebox schedule`¶

`--run-all`¶

`--foreground`¶

`--show`¶

`--clear`¶

`--add`¶

`--quiet`¶

`--every=[schedule]`¶

`import_path`¶

`$ archivebox config`¶

`(no args)`¶

`--get KEY`¶

`--set KEY=VALUE`¶

`< stdin`¶

`$ archivebox update`¶

`(no args)`¶

`--only-new`¶

`--resume=[timestamp]`¶

`--snapshot`¶

`$ archivebox list`¶

`--csv=COLUMNS`¶

`--json`¶

`--filter=REGEX`¶

`--before=TIMESTAMP` / `--after=TIMESTAMP`¶

`$ archivebox remove`¶

`--yes`¶

`--delete`¶

`--filter-type`¶

`pattern`¶

`--before=TIMESTAMP` / `--after=TIMESTAMP`¶

`$ archivebox manage`¶

`[command] [...args]`¶

`$ archivebox server`¶

`--bind=[ip:port]`¶

`$ archivebox proxy`¶

`--bind=[ip:port]`¶

`--record`¶

`--replay`¶

`$ archivebox shell`¶

`$ archivebox oneshot`¶

`--out-dir=[path]`¶

`[--all|--media|--wget|...]`¶