add_soup(): Don't match Content-type with `in` #374

johnhawkinson · 2021-05-29T21:57:52Z

Don't use Python's in operator to match Content-Types, since that is
a simple substring match.

It's obviously not correct since a Content-Type string can be
relatively complicated, like

Content-Type: application/xhtml+xml; charset="utf-8"; boundary="This is not text/html"

Although that's rather contrived, the prior test
"text/html" in response.headers.get("Content-Type", "")
would return True here, incorrectly.

Also, the existance of subtypes with +'s means that using the prior
test for "application/xhtml" would match the above example when it
probably shouldn't.

Instead, leverage requests's code, which comes from the Python
Standard Library's cgi.py.

Clarify that we don't implement MIME sniffing, nor
X-Content-Type-Options: nosniff
instead we do our own thing.

I was looking at this code because of #373.

I've marked this as a draft, because I'm not quite sure this is the way to go, both because of the long discursive comment, the use of a _ function from requests (versus cgi.py's parse_header()).

Also, I'm kind of perplexed what's going on here:

            http_encoding = (
                response.encoding
                if 'charset' in parameters
                else None
            )

Like…why does the presence of charset=utf-8 in the Content-Type header mean that we should trust requests's encoding field? Oh, I see, it's because sometimes requests does some sniffing-ish-stuff and sometimes it doesn't (in which case it parses the Content-Type) and we need to know which, and we're backing out a conclusion about its heuristics? Probably seems like maybe we should parse it ourselves if so. idk.

Maybe we should be doing more formal mime sniffing. And maybe we should be honoring X-Content-Type-Options: nosniff. And… … …

I'm also not sure what kind of test coverage is really appropriate here, if anything additional. Seems like the answer shouldn't be "zero," so…

Don't use Python's `in` operator to match Content-Types, since that is a simple substring match. It's obviously not correct since a Content-Type string can be relatively complicated, like Content-Type: application/xhtml+xml; charset="utf-8"; boundary="This is not text/html" Although that's rather contrived, the prior test "text/html" in response.headers.get("Content-Type", "") would return True here, incorrectly. Also, the existance of subtypes with +'s means that using the prior test for "application/xhtml" would match the above example when it probably shouldn't. Instead, leverage requests's code, which comes from the Python Standard Library's `cgi.py`. Clarify that we don't implement MIME sniffing, nor X-Content-Type-Options: nosniff instead we do our own thing.

moy

Looks good to me.

I'd be really reluctant to implement advanced MIME sniffing into MechanicalSoup (essentially, this is our __looks_like_html, but far more elaborated), but why not add this if we find a suitable library to do the job for us.

johnhawkinson · 2021-05-31T13:44:22Z

Thanks.

I agree caution is warranted before implementing MIME sniffing, and I don't suggest we do it.
That said, there exists https://pypi.org/project/sniffpy/, but I don't know how mature it is.

This is mostly in draft state for feedback on:

calling requests.utils._parse_content_type_header
test coverage
whether our http_encoding is appropriate (but probably that's another PR/issue)

moy · 2022-10-02T16:13:48Z

Actually, this looks very much like spam, so, @almaney4 if you are a humain, please say so!

johnhawkinson force-pushed the 2021.05.29.Content-Type_parsing branch from 7b2e991 to 323fc77 Compare May 29, 2021 22:01

moy approved these changes May 31, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add_soup(): Don't match Content-type with `in` #374

add_soup(): Don't match Content-type with `in` #374

johnhawkinson commented May 29, 2021 •

edited

moy left a comment

johnhawkinson commented May 31, 2021

moy commented Oct 2, 2022

add_soup(): Don't match Content-type with in #374

Are you sure you want to change the base?

add_soup(): Don't match Content-type with in #374

Conversation

johnhawkinson commented May 29, 2021 • edited

moy left a comment

Choose a reason for hiding this comment

johnhawkinson commented May 31, 2021

moy commented Oct 2, 2022

add_soup(): Don't match Content-type with `in` #374

add_soup(): Don't match Content-type with `in` #374

johnhawkinson commented May 29, 2021 •

edited