archivebox.extractors package

Submodules

archivebox.extractors.archive_org module

archivebox.extractors.archive_org.should_save_archive_dot_org(link: archivebox.index.schema.Link, out_dir: Optional[str] = None) → bool[source]
archivebox.extractors.archive_org.save_archive_dot_org(link: archivebox.index.schema.Link, out_dir: Optional[str] = None, timeout: int = 60) → archivebox.index.schema.ArchiveResult[source]

submit site to archive.org for archiving via their service, save returned archive url

archivebox.extractors.archive_org.parse_archive_dot_org_response(response: bytes) → Tuple[List[str], List[str]][source]

archivebox.extractors.dom module

archivebox.extractors.dom.should_save_dom(link: archivebox.index.schema.Link, out_dir: Optional[str] = None) → bool[source]
archivebox.extractors.dom.save_dom(link: archivebox.index.schema.Link, out_dir: Optional[str] = None, timeout: int = 60) → archivebox.index.schema.ArchiveResult[source]

print HTML of site to file using chrome –dump-html

archivebox.extractors.favicon module

archivebox.extractors.favicon.should_save_favicon(link: archivebox.index.schema.Link, out_dir: Optional[str] = None) → bool[source]
archivebox.extractors.favicon.save_favicon(link: archivebox.index.schema.Link, out_dir: Optional[str] = None, timeout: int = 60) → archivebox.index.schema.ArchiveResult[source]

download site favicon from google’s favicon api

archivebox.extractors.git module

archivebox.extractors.git.should_save_git(link: archivebox.index.schema.Link, out_dir: Optional[str] = None) → bool[source]
archivebox.extractors.git.save_git(link: archivebox.index.schema.Link, out_dir: Optional[str] = None, timeout: int = 60) → archivebox.index.schema.ArchiveResult[source]

download full site using git

archivebox.extractors.media module

archivebox.extractors.media.should_save_media(link: archivebox.index.schema.Link, out_dir: Optional[str] = None) → bool[source]
archivebox.extractors.media.save_media(link: archivebox.index.schema.Link, out_dir: Optional[str] = None, timeout: int = 3600) → archivebox.index.schema.ArchiveResult[source]

Download playlists or individual video, audio, and subtitles using youtube-dl

archivebox.extractors.pdf module

archivebox.extractors.pdf.should_save_pdf(link: archivebox.index.schema.Link, out_dir: Optional[str] = None) → bool[source]
archivebox.extractors.pdf.save_pdf(link: archivebox.index.schema.Link, out_dir: Optional[str] = None, timeout: int = 60) → archivebox.index.schema.ArchiveResult[source]

print PDF of site to file using chrome –headless

archivebox.extractors.screenshot module

archivebox.extractors.screenshot.should_save_screenshot(link: archivebox.index.schema.Link, out_dir: Optional[str] = None) → bool[source]
archivebox.extractors.screenshot.save_screenshot(link: archivebox.index.schema.Link, out_dir: Optional[str] = None, timeout: int = 60) → archivebox.index.schema.ArchiveResult[source]

take screenshot of site using chrome –headless

archivebox.extractors.title module

archivebox.extractors.title.should_save_title(link: archivebox.index.schema.Link, out_dir: Optional[str] = None) → bool[source]
archivebox.extractors.title.save_title(link: archivebox.index.schema.Link, out_dir: Optional[str] = None, timeout: int = 60) → archivebox.index.schema.ArchiveResult[source]

try to guess the page’s title from its content

archivebox.extractors.wget module

archivebox.extractors.wget.should_save_wget(link: archivebox.index.schema.Link, out_dir: Optional[str] = None) → bool[source]
archivebox.extractors.wget.save_wget(link: archivebox.index.schema.Link, out_dir: Optional[str] = None, timeout: int = 60) → archivebox.index.schema.ArchiveResult[source]

download full site using wget

archivebox.extractors.wget.wget_output_path(link: archivebox.index.schema.Link) → Optional[str][source]

calculate the path to the wgetted .html file, since wget may adjust some paths to be different than the base_url path.

See docs on wget –adjust-extension (-E)

Module contents

download the DOM, PDF, and a screenshot into a folder named after the link’s timestamp