archivebox.extractors

Package Contents

Classes

ExtractorModuleProtocol

Type interface for an Extractor Module (WIP)

Functions

get_default_archive_methods

get_archive_methods_for_link

ignore_methods

archive_link

download the DOM, PDF, and a screenshot into a folder named after the link’s timestamp

archive_links

get_extractors

iterate through archivebox/extractors/*.py and load extractor modules

Data

ShouldSaveFunction

SaveFunction

ArchiveMethodEntry

ARCHIVE_METHODS_INDEXING_PRECEDENCE

EXTRACTORS_DIR

EXTRACTORS

API

archivebox.extractors.ShouldSaveFunction[source]

None

archivebox.extractors.SaveFunction[source]

None

archivebox.extractors.ArchiveMethodEntry[source]

None

archivebox.extractors.get_default_archive_methods() List[archivebox.extractors.ArchiveMethodEntry][source]
archivebox.extractors.ARCHIVE_METHODS_INDEXING_PRECEDENCE[source]

[(‘readability’, 1), (‘mercury’, 2), (‘htmltotext’, 3), (‘singlefile’, 4), (‘dom’, 5), (‘wget’, 6)]

archivebox.extractors.ignore_methods(to_ignore: List[str]) Iterable[str][source]

download the DOM, PDF, and a screenshot into a folder named after the link’s timestamp

archivebox.extractors.EXTRACTORS_DIR[source]

None

class archivebox.extractors.ExtractorModuleProtocol[source]

Bases: typing.Protocol

Type interface for an Extractor Module (WIP)

get_output_path: Callable[source]

None

archivebox.extractors.get_extractors(dir: pathlib.Path = EXTRACTORS_DIR) Dict[str, archivebox.extractors.ExtractorModuleProtocol][source]

iterate through archivebox/extractors/*.py and load extractor modules

archivebox.extractors.EXTRACTORS[source]

‘get_extractors(…)’