Web Archiving Community¶
🔢 Just getting started and want to learn more about why Web Archiving is important?
Check out this article: On the Importance of Web Archiving.
The internet archiving community is surprisingly far-reaching and almost universally friendly!
Whether you want to learn which organizations are the big players in the web archiving space, want to find a specific open source tool for your web archiving need, or just want to see where archivists hang out online, this is my attempt at an index of the entire web archiving community.
- The Master ListsCommunity-maintained indexes of web archiving tools and groups by IIPC, COPTR, ArchiveTeam, Wikipedia, & the ASA.
- Web Archiving SoftwareOpen source tools and projects in the internet archiving space.
- Reading ListArticles, posts, and blogs relevant to ArchiveBox and web archiving in general.
- CommunitiesA collection of the most active internet archiving communities and initiatives.
The Master Lists¶
Indexes of archiving institutions and software maintained by other people. If there’s anything archivists love doing, it’s making lists.
- COPTR Wiki of Web Archiving Tools (COPTR)
- Awesome Web Archiving Tools (IIPC)
- Awesome Web Crawling Tools
- Awesome Web Scraping Tools
- ArchiveTeam’s List of Software (ArchiveTeam.org)
- List of Web Archiving Initiatives (Wikipedia.org)
- Directory of Archiving Organizations (American Society of Archivists)
Web Archiving Projects¶
Bookmarking Services¶
- Pocket Premium Bookmarking tool that provides an archiving service in their paid version, run by Mozilla
- Pinboard Bookmarking tool that provides archiving in a paid version, run by a single independent developer
- Instapaper Bookmarking alternative to Pocket/Pinboard (with no archiving)
- Wallabag / Wallabag.it Self-hostable web archiving server that can import via RSS
- Shaarli Self-hostable bookmark tagging, archiving, and sharing service
From the Archive.org & Archive-It teams¶
- Archive.org The O.G. wayback machine provided publicly by the Internet Archive (Archive.org)
- Archive.it commercial Wayback-Machine solution
- Heretrix The king of internet archiving crawlers, powers the Wayback Machine
- Brozzler chrome headless crawler + WARC archiver maintained by Archive.org
- WarcProx warc proxy recording and playback utility
- WarcTools utilities for dealing with WARCs
- Grab-Site An easy preconfigured web crawler designed for backing up websites
- WPull A pure python implementation of wget with WARC saving
- More on their Github…
From the Rhizome.org/WebRecorder.io/Conifer team¶
- Conifer by Rhizome.org An open-source personal archiving server that uses pywb under the hood previously known as Webrecorder.io
- Webrecorder.net Suite of open source projects and tools, led by Ilya Kreymer, to capture interactive websites and replay them at a later time as accurately as possible
- pywb The python wayback machine, the codebase forked off archive.org that powers webrecorder
- warcit Create a warc file out of a folder full of assets
- WebArchivePlayer A tool for replaying web archives
- warcio fast streaming asynchronous WARC reader and writer
- node-warc Parse And Create Web ARChive (WARC) files with node.js
- WAIL Web archiver GUI using Heritrix and OpenWayback
- squidwarc User-scriptable, archival crawler using Chrome
- More on their Github…
From the Old Dominion University: Web Science Team¶
- ipwb A distributed web archiving solution using pywb with ipfs for storage
- archivenow tool that pushes urls into all the online archive services like Archive.is and Archive.org
- WAIL Electron app version of the original wail for creating and interacting with web archives
- warcreate a Chrome extension for creating WARCs from any webpage
- More on their Github…
From the Archives Unleashed Team¶
- AUT Archives Unleashed Toolkit for analyzing web archives (formerly WarcBase)
- Warclight A Rails engine for finding and searching web archives
- More on their Github…
From the IIPC team¶
- OpenWayback Open source project developing core Wayback-Machine components
- awesome-web-archiving Large list of archiving projects and orgs
- JWARC A Java library for reading and writing WARC files.
- More on their Github…
Other Public Archiving Services¶
- https://archive.is / https://archive.today
- https://archive.st
- http://theoldnet.com
- https://timetravel.mementoweb.org/
- https://freezepage.com/
- https://webcitation.org/archive
- https://archiveofourown.org/
- https://megalodon.jp/
- https://github.com/HelloZeroNet/ZeroNet (super cool project)
- Google, Bing, DuckDuckGo, and other search engine caches
Other ArchiveBox Alternatives¶
- Memex by Worldbrain.io a beautiful, user-friendly browser extension that archives all history with full-text search, annotation support, and more
- Hypothes.is a web/pdf/ebook annotation tool that also archives content
- Reminiscence extremely similar to ArchiveBox, uses a Django backend + UI and provides auto-tagging and summary features with NLTK
- Shaarchiver very similar project that archives Firefox, Shaarli, or Delicious bookmarks and all linked media, generating a markdown/HTML index
- Polarized a desktop application for bookmarking, annotating, and archiving articles offline
- Photon a fast crawler with archiving and asset extraction support
- Archivy Python-based self-hosted knowledge base embedded into your filesystem
- ReadableWebProxy A proxying archiver that downloads content from sites and can snapshot multiple versions of sites over time
- Perkeep “Perkeep lets you permanently keep your stuff, for life.”
- Fetching.io A personal search engine/archiver that lets you search through all archived websites that you’ve bookmarked
- Fossilo A commercial archiving solution that appears to be very similar to ArchiveBox
- Archivematica web GUI for institutional long-term archiving of web and other content
- Headless Chrome Crawler distributed web crawler built on puppeteer with screenshots
- WWWofle old proxying recorder software similar to ArchiveBox
- Erised Super simple CLI utility to bookmark and archive webpages
- Zotero collect, organize, cite, and share research (mainly for technical/scientific papers & citations)
- TiddlyWiki Non-linear bookmark and note-taking tool with archiving support
Smaller Utilities¶
Random helpful utilities for web archiving, WARC creation and replay, and more…
- https://github.com/gildas-lormeau/SingleFile/ Web Extension for Firefox and Chrome to save a web page as a single HTML file
- https://github.com/vrtdev/save-page-state A Chrome extension for saving the state of a page in multiple formats
- https://github.com/jsvine/waybackpack command-line tool that lets you download the entire Wayback Machine archive for a given URL
- https://github.com/hartator/wayback-machine-downloader Download an entire website from the Internet Archive Wayback Machine.
- https://github.com/Lifesgood123/prevent-link-rot Replace any broken URLs in some content with Wayback machine URL equivalents
- https://en.archivarix.com download an archived page or entire site from the Wayback Machine
- https://proofofexistence.com prove that a certain file existed at a given time using the blockchain
- https://github.com/chfoo/warcat for merging, extracting, and verifying WARC files
- https://github.com/mozilla/readability tool for extracting article contents and text
- https://github.com/mholt/timeliner All your digital life on a single timeline, stored locally
- https://github.com/wkhtmltopdf/wkhtmltopdf Webkit HTML to PDF archiver/saver
- Sheetsee-Pocket project that provides a pretty auto-updating index of your Pocket links (without archiving them)
- Pocket -> IFTTT -> Dropbox Post by Christopher Su on his Pocket saving IFTTT recipe
- http://squidman.net/squidman/index.html
- https://wordpress.org/plugins/broken-link-checker/
- https://github.com/ArchiveTeam/wpull
- http://freedup.org/
- https://en.wikipedia.org/wiki/Furl
- And many more on the other lists…
Reading List¶
A collection of blog posts and articles about internet archiving, contact me / open an issue if you want to add a link here!
Blogs¶
- https://blog.archive.org
- https://netpreserveblog.wordpress.com
- https://blog.webrecorder.io/
- https://ws-dl.blogspot.com
- https://siarchives.si.edu/blog
- https://parameters.ssrc.org
- https://sr.ithaka.org/publications
- https://ait.blog.archive.org
- https://brewster.kahle.org
- https://ianmilligan.ca
- https://medium.com/@giovannidamiola
Articles¶
- https://parameters.ssrc.org/2018/09/on-the-importance-of-web-archiving/
- https://theconversation.com/your-internet-data-is-rotting-115891
- https://www.bbc.com/future/story/20190401-why-theres-so-little-left-of-the-early-internet
- https://sr.ithaka.org/publications/the-state-of-digital-preservation-in-2018/
- https://gizmodo.com/delete-never-the-digital-hoarders-who-collect-tumblrs-1832900423
- https://siarchives.si.edu/blog/we-are-not-alone-progress-digital-preservation-community
- https://www.gwern.net/Archiving-URLs
- http://brewster.kahle.org/2015/08/11/locking-the-web-open-a-call-for-a-distributed-web-2/
- https://lwn.net/Articles/766374/
- https://en.wikipedia.org/wiki/List_of_Web_archiving_initiatives
- https://medium.com/@giovannidamiola/making-the-internet-archives-full-text-search-faster-30fb11574ea9
- https://xkcd.com/1909/
- https://samsaffron.com/archive/2012/06/07/testing-3-million-hyperlinks-lessons-learned#comment-31366
- https://www.gwern.net/docs/linkrot/2011-muflax-backup.pdf
- https://thoughtstreams.io/higgins/permalinking-vs-transience/
- http://ait.blog.archive.org/files/2014/04/archiveit_life_cycle_model.pdf
- https://blog.archive.org/2016/05/26/web-archiving-with-national-libraries/
- https://blog.archive.org/2014/10/28/building-libraries-together/
- https://ianmilligan.ca/2018/03/27/ethics-and-the-archived-web-presentation-the-ethics-of-studying-geocities/
- https://ianmilligan.ca/2018/05/22/new-article-if-these-crawls-could-talk-studying-and-documenting-web-archives-provenance/
- https://ws-dl.blogspot.com/2019/02/2019-02-08-google-is-being-shuttered.html
If any of these links are dead, you can find an archived version on https://archive.sweeting.me.
ArchiveBox-Specific Posts, Tutorials, and Guides¶
- “How to install ArchiveBox to preserve websites you care about” https://blog.sleeplessbeastie.eu/2019/06/19/how-to-install-archivebox-to-preserve-websites-you-care-about/
- “How to remotely archive websites using ArchiveBox” https://blog.sleeplessbeastie.eu/2019/06/26/how-to-remotely-archive-websites-using-archivebox/
- “How to use CutyCapt inside ArchiveBox” https://blog.sleeplessbeastie.eu/2019/07/10/how-to-use-cutycapt-inside-archivebox/
- “Automate ArchiveBox with Google Spreadsheet to Backup your internet” https://manfred.life/archivebox
- “【デモ有♪】ConoHaのArchiveBoxアプリケーションを使ってみたよ” https://qiita.com/CloudRemix/items/691caf91efa3ef19a7ad
- “WEB-ARCHIV TEIL 8: WALLABAG UND ARCHIVEBOX” http://webermartin.net/blog/web-archiv-teil-8-wallabag-und-archivebox/
- https://metaxyntax.neocities.org/entries/7.html
ArchiveBox Discussions in News & Social Media¶
- Aggregators:ProductHunt, AlternativeTo, SteemHunt, Recurse Center: The Joy of Computing, Github Changelog, Dev.To Ultra List, O’Reilly 4 Short Links, JaxEnter
- Blog Posts & Podcasts:Korben.info, Defining Desktop Linux Podcast #296 (0:55:00), Binärgewitter Podcast #221, Schrankmonster.de, La Ferme Du Web
- Hacker News:#1, #2, #3, #4
- Reddit r/DataHoarder:#1, #2, #3, #4, #5 , #6
- Reddit r/SelfHosted:#1, #2
- Twitter:Python Trending, PyCoder’s Weekly, Python Hub, Smashing Magazine
- More on:Twitter, Reddit, HN, Google…
Communities¶
Most Active Communities¶
- The Internet Archive (Archive.org) (USA)
- International Internet Preservation Consortium (IIPC) (International)
- The Archive Team, URL Team, r/ArchiveTeam (International)
- Rhizome.org The digital preservation group that works on Webrecorder.io (USA)
- r/DataHoarder, r/Archivists, r/DHExchange (International)
- The Eye Non-profit working on content archival and long-term preservation (Europe)
- Digital Preservation Coalition & their Software Tool Registry (COPTR) (UK & Wales)
- Archives Unleashed Project and UAP Github (Canada)
- Old Dominion University: Web Science and Digital Libraries (WS-DL @ ODU) (Virginia, USA)
Web Archiving Communities¶
- Canadian Web Archiving Coalition (Canada)
- Web Archives for Historical Research Group (Canada)
- Smithsonian Institution Archives: Digital Curation (Washington D.C., USA)
- National Digital Stewardship Alliance (NDSA) (USA)
- Digital Library Federation (DLF) (USA)
- Council on Library and Information Resources (CLIR) (USA)
- Digital Curation Centre (DCC) (UK)
- ArchiveMatica & their Community Wiki (International)
- Professional Development Institutes for Digital Preservation (POWRR) (USA)
- Institute of Museum and Library Services (IMLS) (USA)
- Stanford Libraries Web Archiving (USA)
- Society of American Archivists: Electronic Records (SAA) (USA)
- BitCurator Consortium (BCC) (USA)
- Ethics & Archiving the Web Conference (Rhizome/Webrecorder.io) (USA)
- Archivists Round Table of NYC (USA)
General Archiving Foundations, Coalitions, Initiatives, and Institutes¶
- Community Archives and Heritage Group (UK & Ireland)
- Open Preservation Foundation (OPF) (UK & Europe)
- Software Preservation Network (International)
- ITHAKA, Portico, JSTOR, ARTSTOR, S+R (USA)
- Archives and Records Association (UK & Ireland)
- Arkivrådet AAS (Sweden)
- Asociación Española de Archiveros, Bibliotecarios, Museologos y Documentalistas (ANABAD) (Spain)
- Associação dos Arquivistas Brasileiros (AAB) (Brazil)
- Associação Portuguesa de Bibliotecários, Archivistas e Documentalistas (BAD) (Portugal)
- Association des archivistes français (AAF) (France)
- Associazione Nazionale Archivistica Italiana (ANAI) (Italy)
- Australian Society of Archivists Inc. (Australia)
- International Council on Archives (ICA)
- International Records Management Trust (IRMT)
- Irish Society for Archives (Ireland)
- Koninklijke Vereniging van Archivarissen in Nederland (Netherlands)
- State Archives Administration of the People’s Republic of China (China)
- Academy of Certified Archivists
- Archivists and Librarians in the History of the Health Sciences
- Archivists for Congregations of Women Religious
- Archivists of Religious Institutions
- Association of Catholic Diocesan Archivists
- Association of Moving Image Archivists
- Council of State Archivists
- National Association of Government Archives and Records Administrators
- National Episcopal Historians and Archivists
- Archival Education and Research Institute
- Archives Leadership Institute
- Georgia Archives Institute
- Modern Archives Institute
- Western Archives Institute
- Association des archivistes du Québec
- Association of Canadian Archivists
- Canadian Council of Archives/Conseil canadien des archives
- Archives Association of British Columbia
- Archives Association of Ontario
- Archives Council of Prince Edward Island
- Archives Society of Alberta
- Association for Manitoba Archives
- Association of Newfoundland and Labrador Archives
- Council of Nova Scotia Archives
- Réseau des services d’archives du Québec
- Saskatchewan Council for Archives and Archivists
You can find more organizations and initiatives on these other lists:
- Wikipedia.org List of Web Archiving Initiatives
- SAA List of USA & Canada Based Archiving Organizations
- SAA List of International Archiving Organizations
- Digital Preservation Coalition’s Member List