archivebox.extractors
Submodules
archivebox.extractors.htmltotext
archivebox.extractors.media
archivebox.extractors.archive_org
archivebox.extractors.git
archivebox.extractors.mercury
archivebox.extractors.wget
archivebox.extractors.readability
archivebox.extractors.favicon
archivebox.extractors.headers
archivebox.extractors.pdf
archivebox.extractors.dom
archivebox.extractors.title
archivebox.extractors.screenshot
archivebox.extractors.singlefile
Package Contents
Classes
Type interface for an Extractor Module (WIP) |
Functions
download the DOM, PDF, and a screenshot into a folder named after the link’s timestamp |
|
iterate through archivebox/extractors/*.py and load extractor modules |
Data
API
- archivebox.extractors.__package__
‘archivebox.extractors’
- archivebox.extractors.ShouldSaveFunction
None
- archivebox.extractors.SaveFunction
None
- archivebox.extractors.ArchiveMethodEntry
None
- archivebox.extractors.get_default_archive_methods() List[archivebox.extractors.ArchiveMethodEntry]
- archivebox.extractors.ARCHIVE_METHODS_INDEXING_PRECEDENCE
[(‘readability’, 1), (‘mercury’, 2), (‘htmltotext’, 3), (‘singlefile’, 4), (‘dom’, 5), (‘wget’, 6)]
- archivebox.extractors.get_archive_methods_for_link(link: archivebox.index.schema.Link) Iterable[archivebox.extractors.ArchiveMethodEntry]
- archivebox.extractors.ignore_methods(to_ignore: List[str]) Iterable[str]
- archivebox.extractors.archive_link(link: archivebox.index.schema.Link, overwrite: bool = False, methods: Optional[Iterable[str]] = None, out_dir: Optional[pathlib.Path] = None, created_by_id: int | None = None) archivebox.index.schema.Link
download the DOM, PDF, and a screenshot into a folder named after the link’s timestamp
- archivebox.extractors.archive_links(all_links: Union[Iterable[archivebox.index.schema.Link], django.db.models.QuerySet], overwrite: bool = False, methods: Optional[Iterable[str]] = None, out_dir: Optional[pathlib.Path] = None, created_by_id: int | None = None) List[archivebox.index.schema.Link]
- archivebox.extractors.EXTRACTORS_DIR
None
- class archivebox.extractors.ExtractorModuleProtocol
Bases:
typing.Protocol
Type interface for an Extractor Module (WIP)
- get_output_path: Callable
None
- archivebox.extractors.get_extractors(dir: pathlib.Path = EXTRACTORS_DIR) Dict[str, archivebox.extractors.ExtractorModuleProtocol]
iterate through archivebox/extractors/*.py and load extractor modules
- archivebox.extractors.EXTRACTORS
‘get_extractors(…)’