archivebox.parsers package

Submodules

archivebox.parsers.generic_json module

archivebox.parsers.generic_json.parse_generic_json_export(json_file: IO[str]) → Iterable[archivebox.index.schema.Link][source]

Parse JSON-format bookmarks export files (produced by pinboard.in/export/, or wallabag)

archivebox.parsers.generic_rss module

archivebox.parsers.generic_rss.parse_generic_rss_export(rss_file: IO[str]) → Iterable[archivebox.index.schema.Link][source]

Parse RSS XML-format files into links

archivebox.parsers.generic_txt module

archivebox.parsers.generic_txt.parse_generic_txt_export(text_file: IO[str]) → Iterable[archivebox.index.schema.Link][source]

Parse raw links from each line in a text file

archivebox.parsers.medium_rss module

archivebox.parsers.medium_rss.parse_medium_rss_export(rss_file: IO[str]) → Iterable[archivebox.index.schema.Link][source]

Parse Medium RSS feed files into links

archivebox.parsers.netscape_html module

archivebox.parsers.netscape_html.parse_netscape_html_export(html_file: IO[str]) → Iterable[archivebox.index.schema.Link][source]

Parse netscape-format bookmarks export files (produced by all browsers)

archivebox.parsers.pinboard_rss module

archivebox.parsers.pinboard_rss.parse_pinboard_rss_export(rss_file: IO[str]) → Iterable[archivebox.index.schema.Link][source]

Parse Pinboard RSS feed files into links

archivebox.parsers.pocket_html module

archivebox.parsers.pocket_html.parse_pocket_html_export(html_file: IO[str]) → Iterable[archivebox.index.schema.Link][source]

Parse Pocket-format bookmarks export files (produced by getpocket.com/export/)

archivebox.parsers.shaarli_rss module

archivebox.parsers.shaarli_rss.parse_shaarli_rss_export(rss_file: IO[str]) → Iterable[archivebox.index.schema.Link][source]

Parse Shaarli-specific RSS XML-format files into links

Module contents

Everything related to parsing links from input sources.

For a list of supported services, see the README.md. For examples of supported import formats see tests/.

parse a list of URLs with their metadata from an RSS feed, bookmarks export, or text file

archivebox.parsers.save_stdin_to_sources(raw_text: str, out_dir: str = '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.4.0/docs') → str[source]
archivebox.parsers.save_file_to_sources(path: str, timeout: int = 60, out_dir: str = '/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/v0.4.0/docs') → str[source]

download a given url’s content into output/sources/domain-<timestamp>.txt

archivebox.parsers.check_url_parsing_invariants() → None[source]

Check that plain text regex URL parsing works as expected