Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Response.text returns improperly decoded text (requests 1.2.3, python 2.7) #1604

Closed
lavr opened this issue Sep 15, 2013 · 24 comments
Closed

Comments

@lavr
Copy link

lavr commented Sep 15, 2013

If http server returns Content-type: text/* without encoding, Response.text always decode it as 'ISO-8859-1' text.

It may be valid in RFC2616/3.7.1, but this is wrong in real life in 2013.

I made example page with chinese text:
http://lavr.github.io/python-emails/tests/requests/some-utf8-text.html
All browsers renders this page properly.
But reguests.get returns invalid text.

And here is simple test with that url:
https://gist.github.com/lavr/6572927

@Lukasa
Copy link
Member

Lukasa commented Sep 15, 2013

Thanks for this @lavr!

This is a deliberate design decision for Requests. We're following the spec unless we find ourselves in a position where the specification diverges so wildly from real world behaviour that it becomes a problem (e.g. GET after 302 response to a POST).

If the upstream server knows what the correct encoding is, it should signal it. Otherwise, we're going to follow what the spec says. =)

If you think the spec default is a bad one, I highly encourage you to get involved with the RFC process for HTTP/2.0 in order to get this default changed. =)

@sigmavirus24
Copy link
Contributor

What @Lukasa said + the fact that if the encoding retrieved from the headers is non-existent we rely on charade to guess at the encoding. With so few characters, charade will not return anything definitive because it uses statistical data to guess at what the right encoding is.

Frankly, the year makes no difference and does not change specification either.

If you know what encoding you're expecting you can also do the decoding yourself like so:

text = str(r.content, '<ENCODING>', errors='replace')

There is nothing wrong with requests as far as I'm concerned and this is not a bug in charade either. Since @Lukasa seems to agree with me, I'm closing this.

@kennethreitz
Copy link
Contributor

@lavr (/cc @sigmavirus24), even easier than that, you can simply provide the encoding yourself.

>>> r = requests.get('http://irresponsible-server/')
>>> r.encoding = 'utf-8'

Then, proceed normally.

@sigmavirus24
Copy link
Contributor

@kennethreitz that's disappointing. Why are we making that easy for people? =P

@kennethreitz
Copy link
Contributor

Absolutely :)

@kennethreitz
Copy link
Contributor

Mostly for Japanese websites. They all lie about their encoding.

@lavr
Copy link
Author

lavr commented Sep 15, 2013

@sigmavirus24
please note, that utils.get_encoding_from_headers always returns 'ISO-8859-1', and charade has no chance to be called.
so bug is: we expect that charade is used to guess encoding, but it is not.

@lavr
Copy link
Author

lavr commented Sep 15, 2013

A patch above fixes a bug, but still follows RFC.
Please, consider to review it.

@Lukasa
Copy link
Member

Lukasa commented Sep 16, 2013

@lavr Sorry, we didn't make this very clear. We do not expect charade to be called in this case. The RFC is very clear: if you don't specify a charset, and the MIME type is text/*, the encoding must be assumed to be ISO-8859-1. That means "don't guess". =)

@kennethreitz
Copy link
Contributor

@lavr: just set r.encoding to None, and it'll work as you expect (I think).

@Lukasa
Copy link
Member

Lukasa commented Sep 16, 2013

Or do r.encoding = r.apparent_encoding.

@kennethreitz
Copy link
Contributor

Even better.

@lavr
Copy link
Author

lavr commented Sep 16, 2013

On r.encoding = None and r.encoding = r.apparent_encoding we lost server charset information.
Totally ignoring server header is not good solution, I think.

Right solution is something like this:

r = requests.get(...)
params = cgi.parse_header(r.headers.get('content-type'))[0]
server_encoding = ('charset' in params) and params['charset'].strip("'\"") or None
r.encoding = server_encoding or r.apparent_encoding
text = r.text

Looks weird :(

@Lukasa
Copy link
Member

Lukasa commented Sep 16, 2013

Or do this:

r = requests.get(...)

if r.encoding is None or r.encoding == 'ISO-8859-1':
    r.encoding = r.apparent_encoding

@lavr
Copy link
Author

lavr commented Sep 16, 2013

I don't think so :)

Condition r.encoding is None has no sense, because r.encoding can never be None for content-type=text/*.

r.encoding == 'ISO-8859-1'... what does it mean ? Server sent charset='ISO-8859-1' or server sent no charset? If first, I shouldn't guess charset.

@Lukasa
Copy link
Member

Lukasa commented Sep 16, 2013

@lavr I was covering the non-text bases. You can rule out the charset possibility by using this condition instead:

r.encoding == 'ISO-8859-1' and not 'ISO-8859-1' in r.headers.get('Content-Type', '')

@lavr
Copy link
Author

lavr commented Sep 16, 2013

@Lukasa
Well, I can use this hack.
And everybody in Eastern Europe and Asia can use it.

But what if we fix it in requests ? ;)
What if requests can honestly set enconding=None on response without charset ?

@Lukasa
Copy link
Member

Lukasa commented Sep 16, 2013

As we've discussed many times, Requests is following the HTTP specification to the letter. The current behaviour is not wrong. =)

@Lukasa
Copy link
Member

Lukasa commented Sep 16, 2013

The fact that it is not helpful for your use case is a whole other story. =)

@kennethreitz
Copy link
Contributor

Alright, that's enough discussion on this. Thanks for the feedback.

@lavr
Copy link
Author

lavr commented Jun 8, 2014

Updated HTTP 1.1 obsoletes ISO-8859-1 default charset: http://tools.ietf.org/html/rfc7231#appendix-B

@Lukasa
Copy link
Member

Lukasa commented Jun 8, 2014

We're already tracking this in #2086. =)

@passos
Copy link

passos commented May 19, 2017

To whom it may concern, here is a compatibility patch

create file requests_patch.py with following code and import it, then the problem should be solved.

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import requests
import chardet

def monkey_patch():
    prop = requests.models.Response.content
    def content(self):
        _content = prop.fget(self)
        if self.encoding == 'ISO-8859-1':
            encodings = requests.utils.get_encodings_from_content(_content)
            if encodings:
                self.encoding = encodings[0]
            else:
                self.encoding = chardet.detect(_content)['encoding']

            if self.encoding:
                _content = _content.decode(self.encoding, 'replace').encode('utf8', 'replace')
                self._content = _content
                self.encoding = 'utf8'

        return _content
    requests.models.Response.content = property(content)

monkey_patch()


@2af
Copy link

2af commented Dec 26, 2018

@lavr (/cc @sigmavirus24), even easier than that, you can simply provide the encoding yourself.

>>> r = requests.get('http://irresponsible-server/')
>>> r.encoding = 'utf-8'

Then, proceed normally.

Thanks for this! Any idea on how to do it in one line?

@psf psf locked as resolved and limited conversation to collaborators Dec 26, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants