Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Response should not return 'ISO-8859-1' as default encoding #1774

Closed
weiqiyiji opened this issue Dec 3, 2013 · 6 comments
Closed

Response should not return 'ISO-8859-1' as default encoding #1774

weiqiyiji opened this issue Dec 3, 2013 · 6 comments

Comments

@weiqiyiji
Copy link

Hi, the code that get encoding, when fetching http://lianxu.me/blog/2012/11/14/10-cocoa-objc-newbie-problems/, it will return default encoding 'ISO-8859-1' (The page's content-type is text/html, not text/html; charset=utf-8)

def get_encoding_from_headers(headers):
    """Returns encodings from given HTTP Header Dict.

    :param headers: dictionary to extract encoding from.
    """

    content_type = headers.get('content-type')

    if not content_type:
        return None

    content_type, params = cgi.parse_header(content_type)

    if 'charset' in params:
         return params['charset'].strip("'\"")

    if 'text' in content_type:
        return 'ISO-8859-1'

And then, encoding is 'ISO-8859-1', so the text will call unicode(content, 'ISO-8859-1'), but the content is already utf-8 encoded, so this will return an invalid unicode string that I cannot call unicode.decode('utf-8') on it.

@property
def text(self):
    """Content of the response, in unicode.

    if Response.encoding is None and chardet module is available, encoding
    will be guessed.
    """

    # Try charset from content-type
    content = None
    encoding = self.encoding

    if not self.content:
        return str('')

    # Fallback to auto-detected encoding.
    if self.encoding is None:
        encoding = self.apparent_encoding

    # Decode unicode from given encoding.
    try:
        content = str(self.content, encoding, errors='replace')
    except (LookupError, TypeError):
        # A LookupError is raised if the encoding was not found which could
        # indicate a misspelling or similar mistake.
        #
        # A TypeError can be raised if encoding is None
        #
        # So we try blindly encoding.
        content = str(self.content, errors='replace')

    return content

I'll show you the code

resp = requests.get('http://lianxu.me/blog/2012/11/14/10-cocoa-objc-newbie-problems/')
resp.apprent_encoding # == 'utf-8'
resp.encoding # == 'ISO-8859-1'
resp.content # is byte array encoded in 'utf-8'
resp.text # is a unicode string that wrap content in 'ISO-8859-1'

I think requests should return None when no encoding found, otherwise this will lead wrong text that user cannot decode on it

@sigmavirus24
Copy link
Contributor

I'm 90% sure we just received a similar issue. Let me find it before I give you what I think I remember as the conclusion

@sigmavirus24
Copy link
Contributor

And the issue I was thinking of is still open. I'm closing this to centralize discussion over there with the added request that you please look at the open issues before you open a new one. Look closely and maybe even read some of the issue bodies because we're keeping a bunch of "discussion" issues open as well as already fixed issues.

@Lukasa
Copy link
Member

Lukasa commented Dec 3, 2013

This issue has been raised many times in the past (please see #1737, #1604, #1589, #1588, #1546. There are others, but this list should be sufficient). The issue @sigmavirus24 is looking for is #1604.

RFC 2616 is very clear here: if no encoding is declared in the Content-Type header, the encoding for text/html is assumed to be ISO-8859-1. If you know better, you are encouraged to either decode Response.content yourself or to set Response.encoding to the relevant encoding.

@sigmavirus24
Copy link
Contributor

As usual @Lukasa is 100% correct.

@weiqiyiji
Copy link
Author

@Lukasa thanks for your explanation! I think not every user knows the detail defined in RFC2616, so should you add some comment on Response.text?

@sigmavirus24
Copy link
Contributor

Adding to the documentation never hurts. It also doesn't hurt to make check
standards surrounding what web clients are supposed to do before making a
feature request. The standards are well defined and well documented.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 8, 2021
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 8, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants