Roadmap¶
▶️ Comment here to discuss the contribution roadmap:Official Roadmap Discussion.
Planned Specification¶
(this is not set in stone, just a rough estimate)
v0.5: Remove live-updated JSON & HTML index in favor of archivebox export
¶
- use SQLite as the main db and export staticfile indexes once at the end of the whole process instead of live-updating them during each extractor run (i.e. remove
patch_main_index
) - create archivebox export command
- we have to create a public view to replace
index.html
/old.html
used for non-logged in users
v0.6: Code cleanup / refactor
¶
- move config loading logic into settings.py
- move all the extractors into “plugin” style folders that register their own config
- right now, the paths of the extractor output are scattered all over the codebase, e.g.
output.pdf
(should be moved to constants at the top of the plugin config file) - make out_dir, link_dir, extractor_dir, naming consistent across codebase
- convert all
os.path
calls and raw string paths toPathlib
v0.7: Schema improvements
¶
- remove
timestamps
as primary keys in favor of hashes, UUIDs, or some other slug - create a migration system for folder layout independent of the index (
mv
is atomic at the FS level, so we just need atransaction.atomic(): move(oldpath, newpath); snap.data_dir = newpath; snap.save()
) - make
Tag
a real modelManyToMany
with Snapshots - allow multiple Snapshots of the same site over time + CLI / UI to manage those, + migration from old style
#2020-01-01
hack to proper versioned snapshots
v0.8: Security
¶
- Add CSRF/CSP/XSS protection to rendered archive pages
- Provide secure reverse proxy in front of archivebox server in docker-compose.yml
- Create UX flow for users to setup session cookies / auth for archiving private sites
- cookies for wget, curl, etc low-level commands
- localstorage, cookies, indexedb setup for chrome archiving methods
v0.9: Performance
¶
- setup huey, break up archiving process into tasks on a queue that a worker pool executes
- setup pyppeteer2 to wrap chrome so that it’s not open/closed during each extractor
v1.0: Full headless browser control
¶
- run user-scripts / extensions in the context of the page during archiving
- community userscripts for unrolling twitter threads, reddit threads, youtube comment sections, etc.
- pywb-based headless browser session recording and warc replay
- archive proxy support
- support sending upstream requests through an external proxy
- support for exposing a proxy that archives all downstream traffic
…
v2.0 Federated or distributed archiving + paid hosted service offering
¶
- merkel tree for storing archive output subresource hashes
- DHT for assigning merkel tree hash:file shards to nodes
- tag system for tagging certain hashes with human-readable names, e.g. title, url, tags, filetype etc.
- distributed tag lookup system
Major long-term changes¶
- release
pip
,apt
,pkg
, andbrew
packaged distributions for installing ArchiveBox - add an optional web GUI for managing sources, adding new links, and viewing the archive
- switch to django + sqlite db with migrations system & json/html export for managing archive schema changes and persistence
- modularize internals to allow importing individual components
- switch to sha256 of URL as unique link ID
- support storing multiple snapshots of pages over time
- support custom user puppeteer scripts to run while archiving (e.g. for expanding reddit threads, scrolling thread on twitter, etc)
- support named collections of archived content with different user access permissions
- support sharing archived assets via DHT + torrent / ipfs / ZeroNet / other sharing system
Smaller planned features¶
- support pushing pages to multiple 3rd party services using ArchiveNow instead of just archive.org
- body text extraction to markdown (using fathom?)
- featured image / thumbnail extraction
- auto-tagging links based on important/frequent keywords in extracted text (like pocket)
- automatic article summary paragraphs from extracted text with nlp summarization library
- full-text search of extracted text with elasticsearch/elasticlunr/ag
- download closed-caption subtitles from Youtube and other video sites for full-text indexing of video content
- try pulling dead sites from archive.org and other sources if original is down (https://github.com/hartator/wayback-machine-downloader)
- And more in the issues list…
IMPORTANT: Please don’t work on any of these major long-term tasks without contacting me first, work is already in progress for many of these, and I may have to reject your PR if it doesn’t align with the existing work!