New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Response.text returns improperly decoded text (requests 1.2.3, python 2.7) #1604
Comments
Thanks for this @lavr! This is a deliberate design decision for Requests. We're following the spec unless we find ourselves in a position where the specification diverges so wildly from real world behaviour that it becomes a problem (e.g. GET after 302 response to a POST). If the upstream server knows what the correct encoding is, it should signal it. Otherwise, we're going to follow what the spec says. =) If you think the spec default is a bad one, I highly encourage you to get involved with the RFC process for HTTP/2.0 in order to get this default changed. =) |
What @Lukasa said + the fact that if the encoding retrieved from the headers is non-existent we rely on charade to guess at the encoding. With so few characters, charade will not return anything definitive because it uses statistical data to guess at what the right encoding is. Frankly, the year makes no difference and does not change specification either. If you know what encoding you're expecting you can also do the decoding yourself like so: text = str(r.content, '<ENCODING>', errors='replace') There is nothing wrong with requests as far as I'm concerned and this is not a bug in charade either. Since @Lukasa seems to agree with me, I'm closing this. |
@lavr (/cc @sigmavirus24), even easier than that, you can simply provide the encoding yourself. >>> r = requests.get('http://irresponsible-server/')
>>> r.encoding = 'utf-8' Then, proceed normally. |
@kennethreitz that's disappointing. Why are we making that easy for people? =P |
Absolutely :) |
Mostly for Japanese websites. They all lie about their encoding. |
@sigmavirus24 |
A patch above fixes a bug, but still follows RFC. |
@lavr Sorry, we didn't make this very clear. We do not expect charade to be called in this case. The RFC is very clear: if you don't specify a charset, and the MIME type is |
@lavr: just set |
Or do |
Even better. |
On Right solution is something like this: r = requests.get(...)
params = cgi.parse_header(r.headers.get('content-type'))[0]
server_encoding = ('charset' in params) and params['charset'].strip("'\"") or None
r.encoding = server_encoding or r.apparent_encoding
text = r.text Looks weird :( |
Or do this: r = requests.get(...)
if r.encoding is None or r.encoding == 'ISO-8859-1':
r.encoding = r.apparent_encoding |
I don't think so :) Condition
|
@lavr I was covering the non-text bases. You can rule out the r.encoding == 'ISO-8859-1' and not 'ISO-8859-1' in r.headers.get('Content-Type', '') |
@Lukasa But what if we fix it in requests ? ;) |
As we've discussed many times, Requests is following the HTTP specification to the letter. The current behaviour is not wrong. =) |
The fact that it is not helpful for your use case is a whole other story. =) |
Alright, that's enough discussion on this. Thanks for the feedback. |
Updated HTTP 1.1 obsoletes ISO-8859-1 default charset: http://tools.ietf.org/html/rfc7231#appendix-B |
We're already tracking this in #2086. =) |
To whom it may concern, here is a compatibility patch create file
|
Thanks for this! Any idea on how to do it in one line? |
If http server returns Content-type: text/* without encoding, Response.text always decode it as 'ISO-8859-1' text.
It may be valid in RFC2616/3.7.1, but this is wrong in real life in 2013.
I made example page with chinese text:
http://lavr.github.io/python-emails/tests/requests/some-utf8-text.html
All browsers renders this page properly.
But reguests.get returns invalid text.
And here is simple test with that url:
https://gist.github.com/lavr/6572927
The text was updated successfully, but these errors were encountered: