Response.text returns improperly decoded text (requests 1.2.3, python 2.7) #1604

lavr · 2013-09-15T18:15:15Z

If http server returns Content-type: text/* without encoding, Response.text always decode it as 'ISO-8859-1' text.

It may be valid in RFC2616/3.7.1, but this is wrong in real life in 2013.

I made example page with chinese text:
http://lavr.github.io/python-emails/tests/requests/some-utf8-text.html
All browsers renders this page properly.
But reguests.get returns invalid text.

And here is simple test with that url:
https://gist.github.com/lavr/6572927

Lukasa · 2013-09-15T18:18:08Z

Thanks for this @lavr!

This is a deliberate design decision for Requests. We're following the spec unless we find ourselves in a position where the specification diverges so wildly from real world behaviour that it becomes a problem (e.g. GET after 302 response to a POST).

If the upstream server knows what the correct encoding is, it should signal it. Otherwise, we're going to follow what the spec says. =)

If you think the spec default is a bad one, I highly encourage you to get involved with the RFC process for HTTP/2.0 in order to get this default changed. =)

sigmavirus24 · 2013-09-15T18:21:21Z

What @Lukasa said + the fact that if the encoding retrieved from the headers is non-existent we rely on charade to guess at the encoding. With so few characters, charade will not return anything definitive because it uses statistical data to guess at what the right encoding is.

Frankly, the year makes no difference and does not change specification either.

If you know what encoding you're expecting you can also do the decoding yourself like so:

text = str(r.content, '<ENCODING>', errors='replace')

There is nothing wrong with requests as far as I'm concerned and this is not a bug in charade either. Since @Lukasa seems to agree with me, I'm closing this.

kennethreitz · 2013-09-15T18:26:16Z

@lavr (/cc @sigmavirus24), even easier than that, you can simply provide the encoding yourself.

>>> r = requests.get('http://irresponsible-server/')
>>> r.encoding = 'utf-8'

Then, proceed normally.

sigmavirus24 · 2013-09-15T18:27:29Z

@kennethreitz that's disappointing. Why are we making that easy for people? =P

kennethreitz · 2013-09-15T18:31:02Z

Absolutely :)

kennethreitz · 2013-09-15T18:31:24Z

Mostly for Japanese websites. They all lie about their encoding.

lavr · 2013-09-15T18:58:19Z

@sigmavirus24
please note, that utils.get_encoding_from_headers always returns 'ISO-8859-1', and charade has no chance to be called.
so bug is: we expect that charade is used to guess encoding, but it is not.

lavr · 2013-09-15T19:05:07Z

A patch above fixes a bug, but still follows RFC.
Please, consider to review it.

Lukasa · 2013-09-16T09:18:48Z

@lavr Sorry, we didn't make this very clear. We do not expect charade to be called in this case. The RFC is very clear: if you don't specify a charset, and the MIME type is text/*, the encoding must be assumed to be ISO-8859-1. That means "don't guess". =)

kennethreitz · 2013-09-16T09:20:14Z

@lavr: just set r.encoding to None, and it'll work as you expect (I think).

Lukasa · 2013-09-16T09:21:18Z

Or do r.encoding = r.apparent_encoding.

kennethreitz · 2013-09-16T09:21:39Z

Even better.

lavr · 2013-09-16T18:39:20Z

On r.encoding = None and r.encoding = r.apparent_encoding we lost server charset information.
Totally ignoring server header is not good solution, I think.

Right solution is something like this:

r = requests.get(...)
params = cgi.parse_header(r.headers.get('content-type'))[0]
server_encoding = ('charset' in params) and params['charset'].strip("'\"") or None
r.encoding = server_encoding or r.apparent_encoding
text = r.text

Looks weird :(

Lukasa · 2013-09-16T18:42:10Z

Or do this:

r = requests.get(...)

if r.encoding is None or r.encoding == 'ISO-8859-1':
    r.encoding = r.apparent_encoding

lavr · 2013-09-16T19:09:22Z

I don't think so :)

Condition r.encoding is None has no sense, because r.encoding can never be None for content-type=text/*.

r.encoding == 'ISO-8859-1'... what does it mean ? Server sent charset='ISO-8859-1' or server sent no charset? If first, I shouldn't guess charset.

Lukasa · 2013-09-16T19:12:33Z

@lavr I was covering the non-text bases. You can rule out the charset possibility by using this condition instead:

r.encoding == 'ISO-8859-1' and not 'ISO-8859-1' in r.headers.get('Content-Type', '')

lavr · 2013-09-16T19:38:00Z

@Lukasa
Well, I can use this hack.
And everybody in Eastern Europe and Asia can use it.

But what if we fix it in requests ? ;)
What if requests can honestly set enconding=None on response without charset ?

Lukasa · 2013-09-16T19:39:52Z

As we've discussed many times, Requests is following the HTTP specification to the letter. The current behaviour is not wrong. =)

Lukasa · 2013-09-16T19:40:42Z

The fact that it is not helpful for your use case is a whole other story. =)

kennethreitz · 2013-09-16T19:58:38Z

Alright, that's enough discussion on this. Thanks for the feedback.

lavr · 2014-06-08T20:02:48Z

Updated HTTP 1.1 obsoletes ISO-8859-1 default charset: http://tools.ietf.org/html/rfc7231#appendix-B

Lukasa · 2014-06-08T20:03:37Z

We're already tracking this in #2086. =)

passos · 2017-05-19T07:28:05Z

To whom it may concern, here is a compatibility patch

create file requests_patch.py with following code and import it, then the problem should be solved.

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import requests
import chardet

def monkey_patch():
    prop = requests.models.Response.content
    def content(self):
        _content = prop.fget(self)
        if self.encoding == 'ISO-8859-1':
            encodings = requests.utils.get_encodings_from_content(_content)
            if encodings:
                self.encoding = encodings[0]
            else:
                self.encoding = chardet.detect(_content)['encoding']

            if self.encoding:
                _content = _content.decode(self.encoding, 'replace').encode('utf8', 'replace')
                self._content = _content
                self.encoding = 'utf8'

        return _content
    requests.models.Response.content = property(content)

monkey_patch()

…s it. See for details: - psf/requests#1604 - https://github.com/requests/requests/blob/9cfd292da33f3b8324ce2c20e01db0e2f8b9210b/requests/utils.py#L473

2af · 2018-12-26T09:36:05Z

@lavr (/cc @sigmavirus24), even easier than that, you can simply provide the encoding yourself.
>>> r = requests.get('http://irresponsible-server/')
>>> r.encoding = 'utf-8'
Then, proceed normally.

Thanks for this! Any idea on how to do it in one line?

sigmavirus24 closed this as completed Sep 15, 2013

itsadok mentioned this issue Nov 14, 2013

[Suggestion] Simplify charset handling #1737

Open

Lukasa mentioned this issue Dec 3, 2013

Response should not return 'ISO-8859-1' as default encoding #1774

Closed

Lukasa mentioned this issue Jul 9, 2014

Response encoding detect #2122

Closed

Lukasa mentioned this issue Aug 7, 2014

add auto detect charset from http body when http headers not seted #2161

Closed

sourcefilter mentioned this issue Feb 21, 2018

Use response.content instead of response.text.encode("utf-8")? mloesch/sickle#22

Closed

qcha0 mentioned this issue Mar 18, 2018

关于requests请求内容的Content-Type qcha0/blog#3

Open

psf locked as resolved and limited conversation to collaborators Dec 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Response.text returns improperly decoded text (requests 1.2.3, python 2.7) #1604

Response.text returns improperly decoded text (requests 1.2.3, python 2.7) #1604

lavr commented Sep 15, 2013

Lukasa commented Sep 15, 2013

sigmavirus24 commented Sep 15, 2013

kennethreitz commented Sep 15, 2013

sigmavirus24 commented Sep 15, 2013

kennethreitz commented Sep 15, 2013

kennethreitz commented Sep 15, 2013

lavr commented Sep 15, 2013

lavr commented Sep 15, 2013

Lukasa commented Sep 16, 2013

kennethreitz commented Sep 16, 2013

Lukasa commented Sep 16, 2013

kennethreitz commented Sep 16, 2013

lavr commented Sep 16, 2013

Lukasa commented Sep 16, 2013

lavr commented Sep 16, 2013

Lukasa commented Sep 16, 2013

lavr commented Sep 16, 2013

Lukasa commented Sep 16, 2013

Lukasa commented Sep 16, 2013

kennethreitz commented Sep 16, 2013

lavr commented Jun 8, 2014

Lukasa commented Jun 8, 2014

passos commented May 19, 2017 •

edited

2af commented Dec 26, 2018

Response.text returns improperly decoded text (requests 1.2.3, python 2.7) #1604

Response.text returns improperly decoded text (requests 1.2.3, python 2.7) #1604

Comments

lavr commented Sep 15, 2013

Lukasa commented Sep 15, 2013

sigmavirus24 commented Sep 15, 2013

kennethreitz commented Sep 15, 2013

sigmavirus24 commented Sep 15, 2013

kennethreitz commented Sep 15, 2013

kennethreitz commented Sep 15, 2013

lavr commented Sep 15, 2013

lavr commented Sep 15, 2013

Lukasa commented Sep 16, 2013

kennethreitz commented Sep 16, 2013

Lukasa commented Sep 16, 2013

kennethreitz commented Sep 16, 2013

lavr commented Sep 16, 2013

Lukasa commented Sep 16, 2013

lavr commented Sep 16, 2013

Lukasa commented Sep 16, 2013

lavr commented Sep 16, 2013

Lukasa commented Sep 16, 2013

Lukasa commented Sep 16, 2013

kennethreitz commented Sep 16, 2013

lavr commented Jun 8, 2014

Lukasa commented Jun 8, 2014

passos commented May 19, 2017 • edited

2af commented Dec 26, 2018

passos commented May 19, 2017 •

edited