abx_plugin_wget.wget_util

Module Contents

Functions

unsafe_wget_output_path

wget_output_path

calculate the path to the wgetted .html file, since wget may adjust some paths to be different than the base_url path.

API

abx_plugin_wget.wget_util.unsafe_wget_output_path(link) Optional[str][source]
abx_plugin_wget.wget_util.wget_output_path(link, nocache: bool = False) Optional[str][source]

calculate the path to the wgetted .html file, since wget may adjust some paths to be different than the base_url path.

See docs on: wget –adjust-extension (-E), –restrict-file-names=windows|unix|ascii, –convert-links

WARNING: this function is extremely error prone because mapping URLs to filesystem paths deterministically is basically impossible. Every OS and filesystem have different requirements on what special characters are allowed, and URLs are full of all kinds of special characters, illegal unicode, and generally unsafe strings that you dont want anywhere near your filesystem. Also URLs can be obscenely long, but most filesystems dont accept paths longer than 250 characters. On top of all that, this function only exists to try to reverse engineer wget’s approach to solving this problem, so this is a shittier, less tested version of their already insanely complicated attempt to do this. Here be dragons: - https://github.com/ArchiveBox/ArchiveBox/issues/549 - https://github.com/ArchiveBox/ArchiveBox/issues/1373 - https://stackoverflow.com/questions/9532499/check-whether-a-path-is-valid-in-python-without-creating-a-file-at-the-paths-ta - and probably many more that I didn’t realize were caused by this…

The only constructive thing we could possibly do to this function is to figure out how to remove it.

Preach loudly to anyone who will listen: never attempt to map URLs to filesystem paths, and pray you never have to deal with the aftermath of someone else’s attempt to do so…