-
Notifications
You must be signed in to change notification settings - Fork 987
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode issues in text view #224
Comments
I noticed character encoding issues in the text view as well for a German feed I subscribe to. There are quite a number of users that complain about this in the forums as well, e.g. https://getsatisfaction.com/newsblur/topics/german_umlaute_are_wrong_coded_with_www_spiegel_de. (My issue is with the same feed from www.spiegel.de.) I was looking around for the code that handles the text views and found [PageImporter#fetch_page])https://github.com/samuelclay/NewsBlur/blob/master/apps/rss_feeds/page_importer.py#L57). Now I'm no Python expert but I looked through the documentation for the
Here are two examples where I think the character decoding goes wrong in slightly different ways: So in he case of www.spiegel.de what I think happens is that the web site returns the correct encoding (ISO-8859-1) in the Content-Type header. Therefore $ python
Python 2.7.5 (default, Aug 25 2013, 00:04:04)
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>> response = requests.get("http://www.spiegel.de/schulspiegel/dresscode-in-us-schule-maedchen-wehren-sich-gegen-leggingsverbot-a-962800.html")
>>> response.encoding
'ISO-8859-1'
>>> data = response.text
>>> type(data)
<type 'unicode'>
>>> data
...<strong>Weil sie die Jungs angeblich ablenken, d\xfcrfen M\xe4dchen an einer Schule in den USA keine Leggings mehr tragen. Die finden das Verbot unfair und wehren sich auf Facebook und mit einer Petition gegen den Dresscode.</strong>...
>>> data = data.encode(response.encoding)
>>> type(data)
<type 'str'>
>>> data
...<strong>Weil sie die Jungs angeblich ablenken, d\xfcrfen M\xe4dchen an einer Schule in den USA keine Leggings mehr tragen. Die finden das Verbot unfair und wehren sich auf Facebook und mit einer Petition gegen den Dresscode.</strong>...
In the case of the log.gov feed that @acdha had (has?) an issue with the web site does not return the correct encoding (UTF-8) in the Content-Type header. Instead it returns it only in the HTML body. So >>> response = requests.get("http://www.loc.gov/today/pr/2013/13-081.html")
>>> response.encoding
'ISO-8859-1'
>>> response.apparent_encoding
'utf-8'
>>> data = response.text
>>> type(data)
<type 'unicode'>
>>> data
...<title>Conference on Cartography of Martin Waldseem\xc3\xbcller, May 17-18 | News Releases - Library of Congress</title>...
>>> data = data.encode(response.encoding)
>>> data
<title>Conference on Cartography of Martin Waldseem\xc3\xbcller, May 17-18 | News Releases - Library of Congress</title>
>>> type(data)
<type 'str'>
>>> response.encoding = response.apparent_encoding
>>> data = response.text
>>> data
...<title>Conference on Cartography of Martin Waldseem\xfcller, May 17-18 | News Releases - Library of Congress</title>... Only when setting the right (i.e. 'apparent') encoding before calling This discussion on the requests issue tracker provides some good background on |
Forgot to link to the relevant docs for the http://docs.python-requests.org/en/latest/api/#requests.Response |
If more test examples are necessary, the Danish ComputerWorld has similar problems in the Text view with the Danish characters from this feed: http://www.computerworld.dk/rss/all. This feed also does not return a HTTP encoding header. |
When I run I wish I didn't have to encode the data to massage it into utf-8, but that works for a number of other URLs that don't having mistaken encoding. |
So that should fix it. |
I think your fix addresses the issue with the spiegel.de feed/web site but not the issue with the loc.gov feed/site. In the former case the site does return a valid content encoding in the HTTP headers and If |
Using the feed http://www.loc.gov/rss/pao/events.xml and the linked article http://www.loc.gov/today/pr/2013/13-081.html the text view displays the Unicode characters incorrectly (i.e.
Waldseemüller
rather thanWaldseemüller
- see e.g. http://blog.lumino.so/2012/08/20/fix-unicode-mistakes-with-python/).The original & story view work as expected.
The text was updated successfully, but these errors were encountered: