abx_plugin_title.extractor

Module Contents

Classes

TitleParser

Functions

get_html

Try to find wget, singlefile and then dom files. If none is found, download the url again.

get_output_path

should_save_title

extract_title_with_regex

save_title

try to guess the page’s title from its content

Data

HTML_TITLE_REGEX

API

abx_plugin_title.extractor.HTML_TITLE_REGEX[source]

‘compile(…)’

class abx_plugin_title.extractor.TitleParser(*args, **kwargs)[source]

Bases: html.parser.HTMLParser

property title[source]
handle_starttag(tag, attrs)[source]
handle_data(data)[source]
handle_endtag(tag)[source]
abx_plugin_title.extractor.get_html(link: archivebox.index.schema.Link, path: pathlib.Path, timeout: int = CURL_CONFIG.CURL_TIMEOUT) str[source]

Try to find wget, singlefile and then dom files. If none is found, download the url again.

abx_plugin_title.extractor.get_output_path()[source]
abx_plugin_title.extractor.should_save_title(link: archivebox.index.schema.Link, out_dir: Optional[str] = None, overwrite: Optional[bool] = False) bool[source]
abx_plugin_title.extractor.extract_title_with_regex(html)[source]
abx_plugin_title.extractor.save_title(link: archivebox.index.schema.Link, out_dir: Optional[pathlib.Path] = None, timeout: int = CURL_CONFIG.CURL_TIMEOUT) archivebox.index.schema.ArchiveResult[source]

try to guess the page’s title from its content