archivebox.extractors

Submodules

Package Contents

Classes

ExtractorModuleProtocol

Type interface for an Extractor Module (WIP)

Functions

get_default_archive_methods

get_archive_methods_for_link

ignore_methods

archive_link

download the DOM, PDF, and a screenshot into a folder named after the link’s timestamp

archive_links

get_extractors

iterate through archivebox/extractors/*.py and load extractor modules

Data

__package__

ShouldSaveFunction

SaveFunction

ArchiveMethodEntry

ARCHIVE_METHODS_INDEXING_PRECEDENCE

EXTRACTORS_DIR

EXTRACTORS

API

archivebox.extractors.__package__

‘archivebox.extractors’

archivebox.extractors.ShouldSaveFunction

None

archivebox.extractors.SaveFunction

None

archivebox.extractors.ArchiveMethodEntry

None

archivebox.extractors.get_default_archive_methods() List[archivebox.extractors.ArchiveMethodEntry]
archivebox.extractors.ARCHIVE_METHODS_INDEXING_PRECEDENCE

[(‘readability’, 1), (‘mercury’, 2), (‘htmltotext’, 3), (‘singlefile’, 4), (‘dom’, 5), (‘wget’, 6)]

archivebox.extractors.ignore_methods(to_ignore: List[str]) Iterable[str]

download the DOM, PDF, and a screenshot into a folder named after the link’s timestamp

archivebox.extractors.EXTRACTORS_DIR

None

class archivebox.extractors.ExtractorModuleProtocol

Bases: typing.Protocol

Type interface for an Extractor Module (WIP)

get_output_path: Callable

None

archivebox.extractors.get_extractors(dir: pathlib.Path = EXTRACTORS_DIR) Dict[str, archivebox.extractors.ExtractorModuleProtocol]

iterate through archivebox/extractors/*.py and load extractor modules

archivebox.extractors.EXTRACTORS

‘get_extractors(…)’