Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode issues in text view #224

Closed
acdha opened this issue Apr 29, 2013 · 6 comments
Closed

Unicode issues in text view #224

acdha opened this issue Apr 29, 2013 · 6 comments

Comments

@acdha
Copy link

acdha commented Apr 29, 2013

Using the feed http://www.loc.gov/rss/pao/events.xml and the linked article http://www.loc.gov/today/pr/2013/13-081.html the text view displays the Unicode characters incorrectly (i.e. Waldseemüller rather than Waldseemüller - see e.g. http://blog.lumino.so/2012/08/20/fix-unicode-mistakes-with-python/).

The original & story view work as expected.

@jhecking
Copy link

jhecking commented Apr 6, 2014

I noticed character encoding issues in the text view as well for a German feed I subscribe to. There are quite a number of users that complain about this in the forums as well, e.g. https://getsatisfaction.com/newsblur/topics/german_umlaute_are_wrong_coded_with_www_spiegel_de. (My issue is with the same feed from www.spiegel.de.)

I was looking around for the code that handles the text views and found [PageImporter#fetch_page])https://github.com/samuelclay/NewsBlur/blob/master/apps/rss_feeds/page_importer.py#L57). Now I'm no Python expert but I looked through the documentation for the requests library and it seem to me that Newsblur is incorrectly using the requests.text and requests.encoding methods:

response = requests.get(feed_link, headers=self.headers)
data = response.text
if response.encoding and response.encoding != 'utf-8':
  data = data.encode(response.encoding)

response.encoding will return the correct character set of the web site only if the site specifies it in the Content-Type HTTP header. Otherwise response.encoding will return ISO-8859-1 (which is correct according to the relevant RFC apparently.) response.text always returns Unicode. If the encoding is specified in the HTML body only then you are required to set response.encoding before calling response.text or to decode the response.content yourself. Calling response.apparent_encoding seems to be the right way to do so.

Here are two examples where I think the character decoding goes wrong in slightly different ways:

So in he case of www.spiegel.de what I think happens is that the web site returns the correct encoding (ISO-8859-1) in the Content-Type header. Therefore requests.text is able to decode the content correctly and returns valid Unicode. But then data.encode("ISO-8859-1") is applied and now the content is back in ISO-8859-1 encoding.

$ python
Python 2.7.5 (default, Aug 25 2013, 00:04:04)
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>> response = requests.get("http://www.spiegel.de/schulspiegel/dresscode-in-us-schule-maedchen-wehren-sich-gegen-leggingsverbot-a-962800.html")
>>> response.encoding
'ISO-8859-1'
>>> data = response.text
>>> type(data)
<type 'unicode'>
>>> data
...<strong>Weil sie die Jungs angeblich ablenken, d\xfcrfen M\xe4dchen an einer Schule in den USA keine Leggings mehr tragen. Die finden das Verbot unfair und wehren sich auf Facebook und mit einer Petition gegen den Dresscode.</strong>...
>>> data = data.encode(response.encoding)
>>> type(data)
<type 'str'>
>>> data
...<strong>Weil sie die Jungs angeblich ablenken, d\xfcrfen M\xe4dchen an einer Schule in den USA keine Leggings mehr tragen. Die finden das Verbot unfair und wehren sich auf Facebook und mit einer Petition gegen den Dresscode.</strong>...

\xfc is the correct Unicode code point for the German Umlaut "ü" in the word "dürfen". But after the final encode operation the content is no longer a Unicode string but a String object with ISO-8859-1 encoding.

In the case of the log.gov feed that @acdha had (has?) an issue with the web site does not return the correct encoding (UTF-8) in the Content-Type header. Instead it returns it only in the HTML body. So response.text is not able to correctly decode the content. Furthermore data.encode("ISO-8859-1") is then again applied to the result which doesn't help either.

>>> response = requests.get("http://www.loc.gov/today/pr/2013/13-081.html")
>>> response.encoding
'ISO-8859-1'
>>> response.apparent_encoding
'utf-8'
>>> data = response.text
>>> type(data)
<type 'unicode'>
>>> data
...<title>Conference on Cartography of Martin Waldseem\xc3\xbcller, May 17-18 | News Releases - Library of Congress</title>...
>>> data = data.encode(response.encoding)
>>> data
<title>Conference on Cartography of Martin Waldseem\xc3\xbcller, May 17-18 | News Releases - Library of Congress</title>
>>> type(data)
<type 'str'>
>>> response.encoding = response.apparent_encoding
>>> data = response.text
>>> data
...<title>Conference on Cartography of Martin Waldseem\xfcller, May 17-18 | News Releases - Library of Congress</title>...

Only when setting the right (i.e. 'apparent') encoding before calling response.text do we get the correct UTF-8 encoding for the Umlaut "ü" in the name "Waldseemüller".

This discussion on the requests issue tracker provides some good background on requests.text and requests.encoding: kennethreitz/requests#1737.

@jhecking
Copy link

jhecking commented Apr 6, 2014

Forgot to link to the relevant docs for the requests.Response class:

http://docs.python-requests.org/en/latest/api/#requests.Response

@linnet
Copy link
Contributor

linnet commented Apr 7, 2014

If more test examples are necessary, the Danish ComputerWorld has similar problems in the Text view with the Danish characters from this feed: http://www.computerworld.dk/rss/all. This feed also does not return a HTTP encoding header.

@samuelclay
Copy link
Owner

When I run apparent_encoding on that URL, I get MacCyrillic, and it takes about 1 second to process. So the timing makes apparent_encoding a non-starter, and the fact that it was wrong is also a problem.

I wish I didn't have to encode the data to massage it into utf-8, but that works for a number of other URLs that don't having mistaken encoding.

@samuelclay
Copy link
Owner

So that should fix it.

@jhecking
Copy link

jhecking commented Apr 8, 2014

I think your fix addresses the issue with the spiegel.de feed/web site but not the issue with the loc.gov feed/site. In the former case the site does return a valid content encoding in the HTTP headers and response.text will use that information to correctly decode the content into Unicode text. In the latter case there is no content encoding specified in the headers. In that case response.encoding defaults to ISO-8859-1 and response.text will not be able to correctly decode the content. Calling text = text.encode(resp.encoding) will not fix the issue in that case either. Basically, if charset_declared is false the value of response.encoding is pretty much useless if I understand the requests documentation correctly.

If apparent_encoding is too slow and/or unreliable a better approach might be to parse the head of the HTML document to see if the encoding is specified in the meta tags (or the XML declaration for XHTML documents). Again, I'm not Python expert, but a bit of searching turned up this relevant stackoverflow questions about an encoding detection library in python and the recommendation for the Beautiful Soup / Unicode Dammit library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants