archivebox.parsers.generic_html

Module Contents

Classes

HrefParser

Functions

parse_generic_html_export

Parse Generic HTML for href tags and use only the url (support for title coming later)

did_urljoin_misbehave

Handle urljoin edge case bug where multiple slashes get turned into a single slash:

fix_urljoin_bug

recursively replace broken suburls …/http:/… with …/http://…

Data

__package__

KEY

NAME

PARSER

API

archivebox.parsers.generic_html.__package__

‘archivebox.parsers’

class archivebox.parsers.generic_html.HrefParser

Bases: html.parser.HTMLParser

handle_starttag(tag, attrs)
archivebox.parsers.generic_html.parse_generic_html_export(html_file: IO[str], root_url: Optional[str] = None, **_kwargs) Iterable[archivebox.index.schema.Link]

Parse Generic HTML for href tags and use only the url (support for title coming later)

archivebox.parsers.generic_html.KEY

‘html’

archivebox.parsers.generic_html.NAME

‘Generic HTML’

archivebox.parsers.generic_html.PARSER

None

archivebox.parsers.generic_html.did_urljoin_misbehave(root_url: str, relative_path: str, final_url: str) bool

Handle urljoin edge case bug where multiple slashes get turned into a single slash:

This workaround only fixes the most common case of a sub-URL inside an outer URL, e.g.: https://web.archive.org/web/https://example.com/some/inner/url

But there are other valid URLs containing // that are not fixed by this workaround, e.g.: https://example.com/drives/C//some/file

archivebox.parsers.generic_html.fix_urljoin_bug(url: str, nesting_limit=5)

recursively replace broken suburls …/http:/… with …/http://…

basically equivalent to this for 99.9% of cases: url = url.replace(‘/http:/’, ‘/http://’) url = url.replace(‘/https:/’, ‘/https://’) except this handles: other schemes besides http/https (e.g. https://example.com/link/git+ssh://github.com/example) other preceding separators besides / (e.g. https://example.com/login/?next=https://example.com/home) fixing multiple suburls recursively