Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add_soup(): Don't match Content-type with in #374

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

johnhawkinson
Copy link
Contributor

@johnhawkinson johnhawkinson commented May 29, 2021

Don't use Python's in operator to match Content-Types, since that is
a simple substring match.

It's obviously not correct since a Content-Type string can be
relatively complicated, like

Content-Type: application/xhtml+xml; charset="utf-8"; boundary="This is not text/html"

Although that's rather contrived, the prior test
"text/html" in response.headers.get("Content-Type", "")
would return True here, incorrectly.

Also, the existance of subtypes with +'s means that using the prior
test for "application/xhtml" would match the above example when it
probably shouldn't.

Instead, leverage requests's code, which comes from the Python
Standard Library's cgi.py.

Clarify that we don't implement MIME sniffing, nor
X-Content-Type-Options: nosniff
instead we do our own thing.


I was looking at this code because of #373.

I've marked this as a draft, because I'm not quite sure this is the way to go, both because of the long discursive comment, the use of a _ function from requests (versus cgi.py's parse_header()).

Also, I'm kind of perplexed what's going on here:

            http_encoding = (
                response.encoding
                if 'charset' in parameters
                else None
            )

Like…why does the presence of charset=utf-8 in the Content-Type header mean that we should trust requests's encoding field? Oh, I see, it's because sometimes requests does some sniffing-ish-stuff and sometimes it doesn't (in which case it parses the Content-Type) and we need to know which, and we're backing out a conclusion about its heuristics? Probably seems like maybe we should parse it ourselves if so. idk.

Maybe we should be doing more formal mime sniffing. And maybe we should be honoring X-Content-Type-Options: nosniff. And… … …

I'm also not sure what kind of test coverage is really appropriate here, if anything additional. Seems like the answer shouldn't be "zero," so…

Don't use Python's `in` operator to match Content-Types, since that is
a simple substring match.

It's obviously not correct since a Content-Type string can be
relatively complicated, like

Content-Type: application/xhtml+xml; charset="utf-8"; boundary="This is not text/html"

Although that's rather contrived, the prior test
        "text/html" in response.headers.get("Content-Type", "")
would return True here, incorrectly.

Also, the existance of subtypes with +'s means that using the prior
test for "application/xhtml" would match the above example when it
probably shouldn't.

Instead, leverage requests's code, which comes from the Python
Standard Library's `cgi.py`.

Clarify that we don't implement MIME sniffing, nor
  X-Content-Type-Options: nosniff
instead we do our own thing.
@johnhawkinson johnhawkinson force-pushed the 2021.05.29.Content-Type_parsing branch from 7b2e991 to 323fc77 Compare May 29, 2021 22:01
Copy link
Collaborator

@moy moy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

I'd be really reluctant to implement advanced MIME sniffing into MechanicalSoup (essentially, this is our __looks_like_html, but far more elaborated), but why not add this if we find a suitable library to do the job for us.

@johnhawkinson
Copy link
Contributor Author

Thanks.

I agree caution is warranted before implementing MIME sniffing, and I don't suggest we do it.
That said, there exists https://pypi.org/project/sniffpy/, but I don't know how mature it is.

This is mostly in draft state for feedback on:

  • calling requests.utils._parse_content_type_header
  • test coverage
  • whether our http_encoding is appropriate (but probably that's another PR/issue)

@moy
Copy link
Collaborator

moy commented Oct 2, 2022

Actually, this looks very much like spam, so, @almaney4 if you are a humain, please say so!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants