archivebox.extractors
Package Contents
Classes
Type interface for an Extractor Module (WIP) |
Functions
download the DOM, PDF, and a screenshot into a folder named after the link’s timestamp |
|
iterate through archivebox/extractors/*.py and load extractor modules |
Data
API
- archivebox.extractors.get_default_archive_methods() List[archivebox.extractors.ArchiveMethodEntry] [source]
- archivebox.extractors.ARCHIVE_METHODS_INDEXING_PRECEDENCE[source]
[(‘readability’, 1), (‘mercury’, 2), (‘htmltotext’, 3), (‘singlefile’, 4), (‘dom’, 5), (‘wget’, 6)]
- archivebox.extractors.get_archive_methods_for_link(link: archivebox.index.schema.Link) Iterable[archivebox.extractors.ArchiveMethodEntry] [source]
- archivebox.extractors.archive_link(link: archivebox.index.schema.Link, overwrite: bool = False, methods: Optional[Iterable[str]] = None, out_dir: Optional[pathlib.Path] = None, created_by_id: int | None = None) archivebox.index.schema.Link [source]
download the DOM, PDF, and a screenshot into a folder named after the link’s timestamp
- archivebox.extractors.archive_links(all_links: Union[Iterable[archivebox.index.schema.Link], django.db.models.QuerySet], overwrite: bool = False, methods: Optional[Iterable[str]] = None, out_dir: Optional[pathlib.Path] = None, created_by_id: int | None = None) List[archivebox.index.schema.Link] [source]
- class archivebox.extractors.ExtractorModuleProtocol[source]
Bases:
typing.Protocol
Type interface for an Extractor Module (WIP)
- archivebox.extractors.get_extractors(dir: pathlib.Path = EXTRACTORS_DIR) Dict[str, archivebox.extractors.ExtractorModuleProtocol] [source]
iterate through archivebox/extractors/*.py and load extractor modules