archivebox.parsers.generic_html
Module Contents
Classes
Functions
Parse Generic HTML for href tags and use only the url (support for title coming later) |
|
Handle urljoin edge case bug where multiple slashes get turned into a single slash: |
|
recursively replace broken suburls …/http:/… with …/http://… |
Data
API
- archivebox.parsers.generic_html.__package__
‘archivebox.parsers’
- class archivebox.parsers.generic_html.HrefParser
Bases:
html.parser.HTMLParser
- handle_starttag(tag, attrs)
- archivebox.parsers.generic_html.parse_generic_html_export(html_file: IO[str], root_url: Optional[str] = None, **_kwargs) Iterable[archivebox.index.schema.Link]
Parse Generic HTML for href tags and use only the url (support for title coming later)
- archivebox.parsers.generic_html.KEY
‘html’
- archivebox.parsers.generic_html.NAME
‘Generic HTML’
- archivebox.parsers.generic_html.PARSER
None
- archivebox.parsers.generic_html.did_urljoin_misbehave(root_url: str, relative_path: str, final_url: str) bool
Handle urljoin edge case bug where multiple slashes get turned into a single slash:
This workaround only fixes the most common case of a sub-URL inside an outer URL, e.g.: https://web.archive.org/web/https://example.com/some/inner/url
But there are other valid URLs containing // that are not fixed by this workaround, e.g.: https://example.com/drives/C//some/file
- archivebox.parsers.generic_html.fix_urljoin_bug(url: str, nesting_limit=5)
recursively replace broken suburls …/http:/… with …/http://…
basically equivalent to this for 99.9% of cases: url = url.replace(‘/http:/’, ‘/http://’) url = url.replace(‘/https:/’, ‘/https://’) except this handles: other schemes besides http/https (e.g. https://example.com/link/git+ssh://github.com/example) other preceding separators besides / (e.g. https://example.com/login/?next=https://example.com/home) fixing multiple suburls recursively