Replace chardet with plain UTF-8 detection #1018

Tronic · 2020-06-09T06:53:39Z

I'd suggest replacing chardet with dead-simple decoding logic:

encoding = "UTF-8"
try:
    text = databytes.decode(encoding)
except UnicodeDecodeError:
    encoding = "ISO-8859-1"
    text = databytes.decode(encoding)

It is fairly safe to assume that any valid UTF-8 sequence is in fact UTF-8, and should be decoded as such.

Otherwise ISO-8859-1 (as specified in HTTP RFC) should be used instead of trying to detect languages, which is always quite unreliable. Further, ISO-8859-1 maps all possible byte values directly to Unicode code points 0-255, so the final decoding step in the above code will never raise UnicodeDecodeError.

Since the chardet module has been around a long time and is known as the standard go-to solution for charset detection, many modules depend on it, and this is a frequent source of dependency conflicts such as the one reported here: sanic-org/sanic#1858

A more serious problem is that chardet is notoriously unreliable. Especially its tendency to misdetect anything as Turkish is well known:

>>> chardet.decode("Testiä".encode("UTF-8"))   # "Testing" in Finnish
{'encoding': 'ISO-8859-9',
 'confidence': 0.5521177026603239,
 'language': 'Turkish'}

>>> "Testiä".encode("UTF-8").decode("ISO-8859-9")  # Yields gibberish, not Turkish
'TestiÃ¤'

The text was updated successfully, but these errors were encountered:

tomchristie · 2020-06-09T08:41:46Z

Hiya. Okay, this is well pitched, tho it's quite hard to get a handle onto if it'd be a net positive or net negative for us to switch away from.

We use chardet in cases where the charset hasn't been included in the Content-Type, because one of the initial design goals for httpx was requests compatiblility and that's historically been their behaviour. However I'm open to reassessing this. I expect the landscape has kinda change with the prevalence of UTF-8 having dramatically risen in that time.

It'd be interesting to collate some issues from requests to see what their prior experiences are here.
It'd be interesting to get an insight into browser behaviours wrt. fuzzily determining content types where it's unset in the headers.
We might well do a better job by using some fuzzy pattern matching for HTML meta tags as a fallback when headers are unset, with utf-8, and then iso-8859-1 as absolute fallbacks.

I guess the upshot of this is likely to be...

Start by allowing the "apparent encoding" mechanism to be configurable.
Switch chardet out from being the default, and instead fallback on 1. fuzzy pattern match for meta tags 2. utf-8 3. iso-8859-1 in that order.

tomchristie · 2020-06-09T08:49:44Z

Hrm, so it looks to me like chardet is what Mozilla use if both Content-Type and Meta tags fail... https://www-archive.mozilla.org/projects/intl/chardet.html

As such I think we ought to be a bit careful about the assumption in "It is fairly safe to assume that any valid UTF-8 sequence is in fact UTF-8, and should be decoded as such." - it's not necessarily a given that'd match up with actual real-world behaviours, or with current browser behaviours.

One thing that'd be helpful would be some examples of URLs which are resolving with undesired content-types. Having a dig into real world cases would help given a better overall orientation of cases where HTML pages are omitting encoding information.

tomchristie · 2020-06-09T13:29:49Z

Useful context in psf/requests#1737

tomchristie · 2020-07-14T12:51:26Z

Also useful context aio-libs/aiohttp#1811 (aiohttp diverges from requests here)

Tronic · 2020-07-15T07:40:14Z

Benchmark of the OP code against chardet detect, using 100 kB of ASCII and 100 kB of ISO-8859-1 (only 'ä' repeated):

>>> %timeit auto_decode(ascii_bytes)
7.66 µs ± 84.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

>>> %timeit auto_decode(latin1_bytes)
12.1 µs ± 61.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

>>> %timeit chardet.detect(ascii_bytes)
1.32 ms ± 8.15 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

>>> %timeit chardet.detect(latin1_bytes)
2.89 s ± 19.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

There is a huge difference even with ASCII, and non-ASCII characters totally devastate chardet performance (tried with UTF-8 too, with similar timing results as latin1 on both implementations).

Tronic · 2020-07-15T07:46:14Z

It might be tempting to try and optimise by only using the initial few kilobytes for detection but doing so is likely to catch only HTML markup, missing the actual native characters that would appear somewhere in the middle of the document. Fortunately Python's built-in decoding code is extremely fast and breaks on first error, so any such optimisations are not needed, especially when the decoding result is also actually used.

tomchristie · 2020-07-15T08:43:36Z

Okay, so given that the landscape has moved on since requests opted with chardet for determining unknown character encodings, and that UTF-8 is so much more prevalent now, I'm feeling pretty open to us considering a change in behaviour here.

There's two practical things we'd need to consider to make that change...

A pull request updating TextDecoder to instead use UTF-8 with ISO-8859-1 fallback. It needs to support iterative decoding, so probably using IncrementalDecoder, and initially waiting until at least (say) 4096 bytes have been buffered before the initial decode pass?
We might(?) want to consider an API to allow customizing the text decoder, to allow for alternate users to switch to alternate behaviours such as ChardetDecoder or an HTMLMetaTagDecoder?

voegtlel · 2020-07-16T08:06:10Z

Another note on this issue: psf/requests#4848 mentions that chardet is problematic when bundling it with an application in productive deployments. I also like the suggestion to make it pluggable with https://github.com/Ousret/charset_normalizer (MIT license) or https://github.com/PyYoshi/cChardet (which is MPL, still easier to comply than LGPL). It'd be great if you could consider making the dependency optional :)

Tronic · 2020-07-17T07:32:10Z

Buffering causes issues for applications that expect partial messages to come through without delay (long polling, logging and other such hacks). Thus, maybe it would be best to just always use ISO-8859-1 when nothing else is explicitly specified?

Section 3.7.1 of https://tools.ietf.org/html/rfc2616 says:

When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP. Data in character sets other than "ISO-8859-1" or its subsets MUST be labeled with an appropriate charset value.

I wonder where the autodetection is actually used because I just tried serving a UTF-8 document as text/html with no charset or meta tags, and Chrome, Firefox, httpx and requests all decoded it as 8859-1 or similar, not as UTF-8, although chardet detected the content as UTF-8.

tomchristie · 2020-07-17T10:12:12Z

The autodetection is only used for media types that are not text/*, and that don't have a charset set... https://github.com/encode/httpx/blob/master/httpx/_models.py#L777

And yes, I'm more inclined to diverge from requests behaviour here, since it's anyway applied to only a pretty narrow subset of cases, and the landscape wrt. utf-8 has moved on since their initial design decision here.

alexchamberlain · 2020-08-19T08:09:33Z

I found this issue as I was reporting an issue with response.text taking 10 seconds, which I tracked down to chardet.detect. For what it's worth, in my case, I'm parsing a sitemap, which is XML, so the algorithm for guessing the encoding should be something like: https://www.oreilly.com/library/view/python-cookbook/0596001673/ch12s11.html.

I've implemented this outside of httpx via response.content. Do you see supporting the XML encoding guessing or a custom guessing function fitting in here?

tomchristie · 2020-08-19T09:48:18Z

Not really, no, although if we could provide enough good API for allowing it as a third party option then that'd be a good route.

alexchamberlain · 2020-08-19T09:53:42Z

There is a setter for encoding exposed on Response, which I think is sufficient now that I've thought about it a little more.

Would you be open to being able to set the encoding in __init__? I'd use it when reconstructing Response from an on-disk cache.

tomchristie · 2020-08-19T09:59:55Z

@alexchamberlain Possibly, but it'd need some careful thinking about around how to make Response API nicely coherent in that area. (Eg. see #1091)

I probably wouldn't want us to accept any API changes on it at the moment.

tomchristie added this to the v1.0 milestone Jun 9, 2020

tomchristie added enhancement New feature or request and removed enhancement New feature or request labels Jul 21, 2020

tomchristie mentioned this issue Jul 21, 2020

Prefer defaulting to utf-8, rather than using chardet autodetect #1078

Closed

florimondmanca mentioned this issue Aug 4, 2020

Add httpcore_11 case dalf/pyhttp-benchmark#1

Merged

tomchristie mentioned this issue Aug 7, 2020

Packaging dependancy tweaks #1144

Closed

tomchristie mentioned this issue Aug 21, 2020

Packaging dependancy tweaks #1206

Merged

tomchristie mentioned this issue Sep 9, 2020

Drop chardet #1269

Merged

tomchristie closed this as completed in #1269 Sep 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace chardet with plain UTF-8 detection #1018

Replace chardet with plain UTF-8 detection #1018

Tronic commented Jun 9, 2020

tomchristie commented Jun 9, 2020

tomchristie commented Jun 9, 2020 •

edited

tomchristie commented Jun 9, 2020

tomchristie commented Jul 14, 2020

Tronic commented Jul 15, 2020 •

edited

Tronic commented Jul 15, 2020

tomchristie commented Jul 15, 2020

voegtlel commented Jul 16, 2020

Tronic commented Jul 17, 2020

tomchristie commented Jul 17, 2020

alexchamberlain commented Aug 19, 2020

tomchristie commented Aug 19, 2020

alexchamberlain commented Aug 19, 2020

tomchristie commented Aug 19, 2020

Replace chardet with plain UTF-8 detection #1018

Replace chardet with plain UTF-8 detection #1018

Comments

Tronic commented Jun 9, 2020

tomchristie commented Jun 9, 2020

tomchristie commented Jun 9, 2020 • edited

tomchristie commented Jun 9, 2020

tomchristie commented Jul 14, 2020

Tronic commented Jul 15, 2020 • edited

Tronic commented Jul 15, 2020

tomchristie commented Jul 15, 2020

voegtlel commented Jul 16, 2020

Tronic commented Jul 17, 2020

tomchristie commented Jul 17, 2020

alexchamberlain commented Aug 19, 2020

tomchristie commented Aug 19, 2020

alexchamberlain commented Aug 19, 2020

tomchristie commented Aug 19, 2020

tomchristie commented Jun 9, 2020 •

edited

Tronic commented Jul 15, 2020 •

edited