Roadmap
▶️ Comment here to discuss the contribution roadmap:
Official Roadmap Discussion.
Planned Specification
(this is not set in stone, just a rough estimate)
v0.7: Schema improvements
move config loading logic into settings.py
move all the extractors into “plugin” style folders that register their own config
right now, the paths of the extractor output are scattered all over the codebase, e.g.
output.pdf
(should be moved to constants at the top of the plugin config file)make out_dir, link_dir, extractor_dir, naming consistent across codebase
remove
timestamps
as primary keys in favor of hashes, UUIDs, or some other slug https://github.com/ArchiveBox/ArchiveBox/issues/74create a migration system for folder layout independent of the index (
mv
is atomic at the FS level, so we just need atransaction.atomic(): move(oldpath, newpath); snap.data_dir = newpath; snap.save()
)make
Tag
a real modelManyToMany
with Snapshotsallow multiple Snapshots of the same site over time + CLI / UI to manage those, + migration from old style
#2020-01-01
hack to proper versioned snapshotsupgrade from Django 3 to Django 5 https://github.com/ArchiveBox/ArchiveBox/issues/988
v0.8: Security
Add CSRF/CSP/XSS protection to rendered archive pages
Provide secure reverse proxy in front of archivebox server in docker-compose.yml
Create UX flow for users to setup session cookies / auth for archiving private sites
cookies for wget, curl, etc low-level commands
localstorage, cookies, indexedb setup for chrome archiving methods
v0.9: Performance
setup huey, break up archiving process into tasks on a queue that a worker pool executes
setup pyppeteer2 to wrap chrome so that it’s not open/closed during each extractor
v1.0: Full headless browser control
run user-scripts / extensions in the context of the page during archiving
community userscripts for unrolling twitter threads, reddit threads, youtube comment sections, etc.
pywb-based headless browser session recording and warc replay
archive proxy support
support sending upstream requests through an external proxy
support for exposing a proxy that archives all downstream traffic
…
v2.0 Federated or distributed archiving + paid hosted service offering
ZFS / merkel tree for storing archive output subresource hashes
DHT for assigning merkel tree hash:file shards to nodes
tag system for tagging certain hashes with human-readable names, e.g. title, url, tags, filetype etc.
distributed tag lookup system
Major long-term changes
✅ release
pip
,apt
,pkg
, andbrew
packaged distributions for installing ArchiveBox✅ add an optional web GUI for managing sources, adding new links, and viewing the archive
✅ switch to django + sqlite db with migrations system & json/html export for managing archive schema changes and persistence
modularize internals to allow importing individual components
switch to sha256 of URL as unique link ID
support storing multiple snapshots of pages over time
support custom user puppeteer scripts to run while archiving (e.g. for expanding reddit threads, scrolling thread on twitter, etc)
support named collections of archived content with different user access permissions
support sharing archived assets via DHT + torrent / ipfs / ZeroNet / other sharing system
Smaller planned features
support pushing pages to multiple 3rd party services using ArchiveNow instead of just archive.org
✅ body text extraction to markdown (using ~~fathom~~ readability and mercury)
featured image / thumbnail extraction
auto-tagging links based on important/frequent keywords in extracted text (like pocket)
automatic article summary paragraphs from extracted text with nlp summarization library
✅ full-text search of extracted text with ~~elasticsearch/elasticlunr/ag~~ sonic and ripgrep
✅ download closed-caption subtitles from Youtube and other video sites (TODO: submit the subtitle files to the full-text search index)
try pulling dead sites from archive.org and other sources if original is down (https://github.com/hartator/wayback-machine-downloader)
And more in the issues list…
IMPORTANT: Please don’t work on any of these major long-term tasks without contacting me first, work is already in progress for many of these, and I may have to reject your PR if it doesn’t align with the existing work!
Past Releases
To see how this spec has been scheduled / implemented / released so far, read these pull requests:
UI / UX Improvements Planned
New Extractors Planned
gallery-dl
: https://github.com/ArchiveBox/ArchiveBox/issues/564forum-dl
: https://github.com/ArchiveBox/ArchiveBox/issues/1368scihub-dl
: https://github.com/ArchiveBox/ArchiveBox/issues/720podcast-archiver
: https://github.com/ArchiveBox/ArchiveBox/issues/1357cutycapt
screenshots: https://github.com/ArchiveBox/ArchiveBox/issues/253sourcemap downloader: https://github.com/ArchiveBox/ArchiveBox/issues/1291
ArchiveBox Developer Documentation: Contributing a New Extractor
And others we’re considering for the future:
Video/Streams
https://github.com/iawia002/lux (generic video/audio downloader)
https://github.com/wukko/cobalt (generic video/audio downloader)
https://github.com/jaysonlong/webvideo-downloader (Bilibili, iQIYI, Tencent Video, MGTV and WeTV)
https://github.com/spaam/svtplay-dl (comedy central, twitch, HBO, etc. video downloader)
https://github.com/aajanki/yle-dl (Yle Areena Finnish broadcasting video downloader)
https://github.com/WHTJEON/widevine-dl (encrypted widevine video downloader)
Audio/Music
https://github.com/nathom/streamrip (Qobuz, Tidal, Deezer and SoundCloud)
https://github.com/SathyaBhat/spotify-dl / https://github.com/SwapnilSoni1999/spotify-dl / https://github.com/dhruv-ahuja/spoti-dl
https://github.com/vitiko98/qobuz-dl (Qobuz music downloader)
https://github.com/yaronzz/Tidal-Media-Downloader-PRO (stale)
https://github.com/ravishi/rdio-dl (stale, Rdio song downloader)
https://github.com/carlosflorencio/laracasts-downloader (stale?)
Photos/Images/Comics
https://github.com/Bionus/imgbrd-grabber (generic image board downloader like gallery-dl)
https://github.com/Xonshiz/comic-dl (comic, anime, manga, etc. downloader)
https://github.com/justfoolingaround/animdl (anime downloader)
https://github.com/metafates/mangal (manga downloader)
https://github.com/boredazfcuk/docker-icloudpd (iCloud Photos downloader)
https://github.com/Oshan96/monkey-dl (stale? anime downloader)
https://github.com/Xonshiz/anime-dl (stale?)
Text/Forums
https://github.com/AAndyProgram/SCrawler (Twitter, Reddit, Instagram, Threads, Facebook, Pinterest, nsfw sites downloader)
https://github.com/aliparlakci/bulk-downloader-for-reddit (stale?)
MOOC/Educational Content
https://github.com/yann0917/dedao-dl (stale, MOOC course downloader)
https://github.com/SigureMo/mooc-dl (stale?)
https://github.com/r0oth3x49/lynda-dl (stale, Lynda.com course downloader)
Re-Archiving / WARC Creation
https://github.com/AhmadIbrahiim/Website-downloader (wget wrapper)
https://github.com/igrigorik/gharchive.org (stale? Github downloader)
Other
https://github.com/matlink/gplaycli (Google Play store Android app downloader)
https://github.com/AlphaSlayer1964/kemono-dl (Patreon, gumroad, etc. archiver)
https://github.com/cancerian0684/dli-downloader (Digital Library of India ebook downloader)
https://github.com/tusharbabbar/gaana-dl (gaana.com bollywood song downloader)
https://github.com/rebane2001/matterport-dl (stale? virtual house tour downloader)
Social Media
Instagram
https://github.com/instaloader/instaloader (instagram downloader)
https://github.com/althonos/InstaLooter (stale)
Telegram
https://github.com/iyear/tdl (telegram downloader)
TikTok
https://github.com/charmparticle/tiktokget (tiktok downloader using yt-dlp)
https://github.com/TerminalWarlord/TikTok-Downloader-Bot
https://github.com/n0l3r/tiktok-downloader
https://github.com/hansputera/tiktok-dl
https://github.com/naseif/tiktok-scraper
https://github.com/irevenko/tiktik
https://github.com/samirelanduk/tiktok-save
https://github.com/Dinoosauro/tiktok-to-ytdlp
https://github.com/krypton-byte/tiktok-downloader
Twitter
https://github.com/HoloArchivists/twspace-dl (stale, twitter spaces archiver)