Unicode issues in text view #224

acdha · 2013-04-29T19:40:05Z

Using the feed http://www.loc.gov/rss/pao/events.xml and the linked article http://www.loc.gov/today/pr/2013/13-081.html the text view displays the Unicode characters incorrectly (i.e. WaldseemÃ¼ller rather than Waldseemüller - see e.g. http://blog.lumino.so/2012/08/20/fix-unicode-mistakes-with-python/).

The original & story view work as expected.

The text was updated successfully, but these errors were encountered:

jhecking · 2014-04-06T15:19:50Z

I noticed character encoding issues in the text view as well for a German feed I subscribe to. There are quite a number of users that complain about this in the forums as well, e.g. https://getsatisfaction.com/newsblur/topics/german_umlaute_are_wrong_coded_with_www_spiegel_de. (My issue is with the same feed from www.spiegel.de.)

I was looking around for the code that handles the text views and found [PageImporter#fetch_page])https://github.com/samuelclay/NewsBlur/blob/master/apps/rss_feeds/page_importer.py#L57). Now I'm no Python expert but I looked through the documentation for the requests library and it seem to me that Newsblur is incorrectly using the requests.text and requests.encoding methods:

response = requests.get(feed_link, headers=self.headers)
data = response.text
if response.encoding and response.encoding != 'utf-8':
  data = data.encode(response.encoding)

response.encoding will return the correct character set of the web site only if the site specifies it in the Content-Type HTTP header. Otherwise response.encoding will return ISO-8859-1 (which is correct according to the relevant RFC apparently.) response.text always returns Unicode. If the encoding is specified in the HTML body only then you are required to set response.encoding before calling response.text or to decode the response.content yourself. Calling response.apparent_encoding seems to be the right way to do so.

Here are two examples where I think the character decoding goes wrong in slightly different ways:

So in he case of www.spiegel.de what I think happens is that the web site returns the correct encoding (ISO-8859-1) in the Content-Type header. Therefore requests.text is able to decode the content correctly and returns valid Unicode. But then data.encode("ISO-8859-1") is applied and now the content is back in ISO-8859-1 encoding.

$ python
Python 2.7.5 (default, Aug 25 2013, 00:04:04)
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>> response = requests.get("http://www.spiegel.de/schulspiegel/dresscode-in-us-schule-maedchen-wehren-sich-gegen-leggingsverbot-a-962800.html")
>>> response.encoding
'ISO-8859-1'
>>> data = response.text
>>> type(data)
<type 'unicode'>
>>> data
...<strong>Weil sie die Jungs angeblich ablenken, d\xfcrfen M\xe4dchen an einer Schule in den USA keine Leggings mehr tragen. Die finden das Verbot unfair und wehren sich auf Facebook und mit einer Petition gegen den Dresscode.</strong>...
>>> data = data.encode(response.encoding)
>>> type(data)
<type 'str'>
>>> data
...<strong>Weil sie die Jungs angeblich ablenken, d\xfcrfen M\xe4dchen an einer Schule in den USA keine Leggings mehr tragen. Die finden das Verbot unfair und wehren sich auf Facebook und mit einer Petition gegen den Dresscode.</strong>...

\xfc is the correct Unicode code point for the German Umlaut "ü" in the word "dürfen". But after the final encode operation the content is no longer a Unicode string but a String object with ISO-8859-1 encoding.

In the case of the log.gov feed that @acdha had (has?) an issue with the web site does not return the correct encoding (UTF-8) in the Content-Type header. Instead it returns it only in the HTML body. So response.text is not able to correctly decode the content. Furthermore data.encode("ISO-8859-1") is then again applied to the result which doesn't help either.

>>> response = requests.get("http://www.loc.gov/today/pr/2013/13-081.html")
>>> response.encoding
'ISO-8859-1'
>>> response.apparent_encoding
'utf-8'
>>> data = response.text
>>> type(data)
<type 'unicode'>
>>> data
...<title>Conference on Cartography of Martin Waldseem\xc3\xbcller, May 17-18 | News Releases - Library of Congress</title>...
>>> data = data.encode(response.encoding)
>>> data
<title>Conference on Cartography of Martin Waldseem\xc3\xbcller, May 17-18 | News Releases - Library of Congress</title>
>>> type(data)
<type 'str'>
>>> response.encoding = response.apparent_encoding
>>> data = response.text
>>> data
...<title>Conference on Cartography of Martin Waldseem\xfcller, May 17-18 | News Releases - Library of Congress</title>...

Only when setting the right (i.e. 'apparent') encoding before calling response.text do we get the correct UTF-8 encoding for the Umlaut "ü" in the name "Waldseemüller".

This discussion on the requests issue tracker provides some good background on requests.text and requests.encoding: kennethreitz/requests#1737.

jhecking · 2014-04-06T15:22:26Z

Forgot to link to the relevant docs for the requests.Response class:

http://docs.python-requests.org/en/latest/api/#requests.Response

linnet · 2014-04-07T07:44:04Z

If more test examples are necessary, the Danish ComputerWorld has similar problems in the Text view with the Danish characters from this feed: http://www.computerworld.dk/rss/all. This feed also does not return a HTTP encoding header.

samuelclay · 2014-04-07T19:57:53Z

When I run apparent_encoding on that URL, I get MacCyrillic, and it takes about 1 second to process. So the timing makes apparent_encoding a non-starter, and the fact that it was wrong is also a problem.

I wish I didn't have to encode the data to massage it into utf-8, but that works for a number of other URLs that don't having mistaken encoding.

samuelclay · 2014-04-07T20:05:00Z

So that should fix it.

jhecking · 2014-04-08T02:53:32Z

I think your fix addresses the issue with the spiegel.de feed/web site but not the issue with the loc.gov feed/site. In the former case the site does return a valid content encoding in the HTTP headers and response.text will use that information to correctly decode the content into Unicode text. In the latter case there is no content encoding specified in the headers. In that case response.encoding defaults to ISO-8859-1 and response.text will not be able to correctly decode the content. Calling text = text.encode(resp.encoding) will not fix the issue in that case either. Basically, if charset_declared is false the value of response.encoding is pretty much useless if I understand the requests documentation correctly.

If apparent_encoding is too slow and/or unreliable a better approach might be to parse the head of the HTML document to see if the encoding is specified in the meta tags (or the XML declaration for XHTML documents). Again, I'm not Python expert, but a bit of searching turned up this relevant stackoverflow questions about an encoding detection library in python and the recommendation for the Beautiful Soup / Unicode Dammit library.

samuelclay closed this as completed in 7fb8320 Apr 7, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode issues in text view #224

Unicode issues in text view #224

acdha commented Apr 29, 2013

jhecking commented Apr 6, 2014

jhecking commented Apr 6, 2014

linnet commented Apr 7, 2014

samuelclay commented Apr 7, 2014

samuelclay commented Apr 7, 2014

jhecking commented Apr 8, 2014

Unicode issues in text view #224

Unicode issues in text view #224

Comments

acdha commented Apr 29, 2013

jhecking commented Apr 6, 2014

jhecking commented Apr 6, 2014

linnet commented Apr 7, 2014

samuelclay commented Apr 7, 2014

samuelclay commented Apr 7, 2014

jhecking commented Apr 8, 2014