archivebox.misc.util

Module Contents

Classes

ExtendedEncoder

Extended json serializer that supports serializing several model fields and objects

Functions

filter_queryset_by_uuid_substring

Filter a queryset to UUID-column matches by prefix or suffix (case-insensitive).

ts_to_date_str

sanitize_extracted_url

Trim quote garbage and dangling prose punctuation from an extracted URL candidate.

validate_url_length

validate_url

parens_are_matched

check that all parentheses in a string are balanced and nested properly

fix_url_from_markdown

cleanup a regex-parsed url that may contain dangling trailing parens from markdown link syntax helpful to fix URLs parsed from markdown e.g. input: https://wikipedia.org/en/some_article_(Disambiguation).html?abc=def).somemoretext result: https://wikipedia.org/en/some_article_(Disambiguation).html?abc=def

split_comma_separated_urls

find_all_urls

parse_filesize_to_bytes

Parse a byte count from an integer or human-readable string like 45mb or 2 GB.

enforce_types

Enforce function arg and kwarg types at runtime using its python3 type hints Simpler version of pydantic @validate_call decorator

docstring

attach the given docstring to the decorated function

parse_date

Parse unix timestamps, iso format, and human-readable strings

download_url

Download the contents of a remote url and return the text

ansi_to_html

Based on: https://stackoverflow.com/questions/19212665/python-converting-ansi-color-codes-to-html Simple way to render colored CLI stdout/stderr in HTML properly, Textual/rich is probably better though.

to_json

Serialize object to JSON string with extended type support

Data

scheme

without_scheme

without_query

without_fragment

without_path

path

basename

domain

query

fragment

extension

base_url

urlencode

urldecode

htmlencode

htmldecode

COLOR_REGEX

URL_REGEX

MAX_URL_LENGTH

QUOTE_DELIMITERS

QUOTE_ENTITY_DELIMITERS

URL_ENTITY_REPLACEMENTS

FILESIZE_UNITS

API

archivebox.misc.util.filter_queryset_by_uuid_substring(queryset, slug: str, field: str = 'id')[source]

Filter a queryset to UUID-column matches by prefix or suffix (case-insensitive).

Avoids id__icontains (an unindexed full-table scan over the UUID column) by stripping non-hex chars from slug and matching with istartswith / iendswith. Returns an empty queryset for inputs with fewer than 8 hex chars to avoid overly broad matches. A full 32-char hex string falls back to an exact-equality lookup.

archivebox.misc.util.scheme[source]

None

archivebox.misc.util.without_scheme[source]

None

archivebox.misc.util.without_query[source]

None

archivebox.misc.util.without_fragment[source]

None

archivebox.misc.util.without_path[source]

None

archivebox.misc.util.path[source]

None

archivebox.misc.util.basename[source]

None

archivebox.misc.util.domain[source]

None

archivebox.misc.util.query[source]

None

archivebox.misc.util.fragment[source]

None

archivebox.misc.util.extension[source]

None

archivebox.misc.util.base_url[source]

None

archivebox.misc.util.urlencode[source]

None

archivebox.misc.util.urldecode[source]

None

archivebox.misc.util.htmlencode[source]

None

archivebox.misc.util.htmldecode[source]

None

archivebox.misc.util.ts_to_date_str(ts: Any) str | None[source]
archivebox.misc.util.COLOR_REGEX[source]

β€˜compile(…)’

archivebox.misc.util.URL_REGEX[source]

β€˜compile(…)’

archivebox.misc.util.MAX_URL_LENGTH[source]

65535

archivebox.misc.util.QUOTE_DELIMITERS[source]

(β€˜β€β€™, β€œβ€™β€, β€˜`’, β€˜β€œβ€™, β€˜β€β€™, β€˜β€˜β€™, β€˜β€™β€™)

archivebox.misc.util.QUOTE_ENTITY_DELIMITERS[source]

(β€˜β€β€™, β€˜β€β€™, β€˜β€β€™, β€˜β€™β€™, β€˜β€™β€™, β€˜β€™β€™)

archivebox.misc.util.URL_ENTITY_REPLACEMENTS[source]

((’&’, β€˜&’), (’&’, β€˜&’), (’&’, β€˜&’))

archivebox.misc.util.FILESIZE_UNITS: dict[str, int][source]

None

archivebox.misc.util.sanitize_extracted_url(url: str) str[source]

Trim quote garbage and dangling prose punctuation from an extracted URL candidate.

archivebox.misc.util.validate_url_length(url: str) str[source]
archivebox.misc.util.validate_url(url: str) str[source]
archivebox.misc.util.parens_are_matched(string: str, open_char='(', close_char=')')[source]

check that all parentheses in a string are balanced and nested properly

archivebox.misc.util.fix_url_from_markdown(url_str: str) str[source]

cleanup a regex-parsed url that may contain dangling trailing parens from markdown link syntax helpful to fix URLs parsed from markdown e.g. input: https://wikipedia.org/en/some_article_(Disambiguation).html?abc=def).somemoretext result: https://wikipedia.org/en/some_article_(Disambiguation).html?abc=def

IMPORTANT ASSUMPTION: valid urls wont have unbalanced or incorrectly nested parentheses e.g. this will fail the user actually wants to ingest a url like β€˜https://example.com/some_wei)(rd_url’ in that case it will return https://example.com/some_wei (truncated up to the first unbalanced paren) This assumption is true 99.9999% of the time, and for the rare edge case the user can use url_list parser.

archivebox.misc.util.split_comma_separated_urls(url: str)[source]
archivebox.misc.util.find_all_urls(urls_str: str)[source]
archivebox.misc.util.parse_filesize_to_bytes(value: str | int | float | None) int[source]

Parse a byte count from an integer or human-readable string like 45mb or 2 GB.

archivebox.misc.util.enforce_types(func)[source]

Enforce function arg and kwarg types at runtime using its python3 type hints Simpler version of pydantic @validate_call decorator

archivebox.misc.util.docstring(text: str | None)[source]

attach the given docstring to the decorated function

archivebox.misc.util.parse_date(date: Any) datetime.datetime | None[source]

Parse unix timestamps, iso format, and human-readable strings

archivebox.misc.util.download_url(url: str, timeout: int | None = None, config=None, **config_kwargs) str[source]

Download the contents of a remote url and return the text

archivebox.misc.util.ansi_to_html(text: str) str[source]

Based on: https://stackoverflow.com/questions/19212665/python-converting-ansi-color-codes-to-html Simple way to render colored CLI stdout/stderr in HTML properly, Textual/rich is probably better though.

class archivebox.misc.util.ExtendedEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)[source]

Bases: json.JSONEncoder

Extended json serializer that supports serializing several model fields and objects

Initialization

Constructor for JSONEncoder, with sensible defaults.

If skipkeys is false, then it is a TypeError to attempt encoding of keys that are not str, int, float, bool or None. If skipkeys is True, such items are simply skipped.

If ensure_ascii is true, the output is guaranteed to be str objects with all incoming non-ASCII characters escaped. If ensure_ascii is false, the output can contain non-ASCII characters.

If check_circular is true, then lists, dicts, and custom encoded objects will be checked for circular references during encoding to prevent an infinite recursion (which would cause an RecursionError). Otherwise, no such check takes place.

If allow_nan is true, then NaN, Infinity, and -Infinity will be encoded as such. This behavior is not JSON specification compliant, but is consistent with most JavaScript based encoders and decoders. Otherwise, it will be a ValueError to encode such floats.

If sort_keys is true, then the output of dictionaries will be sorted by key; this is useful for regression tests to ensure that JSON serializations can be compared on a day-to-day basis.

If indent is a non-negative integer, then JSON array elements and object members will be pretty-printed with that indent level. An indent level of 0 will only insert newlines. None is the most compact representation.

If specified, separators should be an (item_separator, key_separator) tuple. The default is (’, β€˜, β€˜: β€˜) if indent is None and (β€˜,’, β€˜: β€˜) otherwise. To get the most compact JSON representation, you should specify (β€˜,’, β€˜:’) to eliminate whitespace.

If specified, default is a function that gets called for objects that can’t otherwise be serialized. It should return a JSON encodable version of the object or raise a TypeError.

default(o)[source]
archivebox.misc.util.to_json(obj: Any, indent: int | None = 4, sort_keys: bool = True) str[source]

Serialize object to JSON string with extended type support