Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError when crawling pages #20

Open
miso-belica opened this issue Jun 10, 2016 · 0 comments
Open

UnicodeDecodeError when crawling pages #20

miso-belica opened this issue Jun 10, 2016 · 0 comments
Assignees
Labels

Comments

@miso-belica
Copy link
Owner

miso-belica commented Jun 10, 2016

I've installed JusText on a Windows 2012 Server machine and it seems to be running fine overall. However, about 30-40% of the HTML files crash because of encoding issues. The error message I get it:

File "c:\python32\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 20410: character maps to <undefined>

and then the TXT file is empty for the HTML file that I'm trying to do JusText on.

An example of a page that is causing it to crash: http://www.democracynow.org/2012/7/6/peru_declares_state_of_emergency_as (byte position 20410, the word GONZÁLEZ). I've saved a copy of the file that I'm trying to do JusText on at:

I've tried every possible combination of

--encoding=...
--enc-force
--enc-errors=...

as well as every possible encoding on the files, and it's still crashing on these files. Any suggestions?

Thanks so much for your help.

Mark Davies, mark_davies (at) byu.edu
Professor of Linguistics / Brigham Young University
http://davies-linguistics.byu.edu/

** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **

@miso-belica miso-belica self-assigned this Jun 10, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant