Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash with URLs containing non ascii characters #6

Open
Arduous opened this issue Jul 5, 2018 · 21 comments
Open

Crash with URLs containing non ascii characters #6

Arduous opened this issue Jul 5, 2018 · 21 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@Arduous
Copy link
Contributor

Arduous commented Jul 5, 2018

My interpreter, as seen below is Python 2.7.
Here is the last URLs visited, and the traceback of the exception

INFO:root:Visiting http://www.kopimi.com/index.html
INFO:root:Visiting https://en.wikipedia.org/wiki/Piratbyr�n
Traceback (most recent call last):
 File "noisy.py", line 258, in <module>
   main()
 File "noisy.py", line 254, in main
   crawler.crawl()
 File "noisy.py", line 225, in crawl
   self._browse_from_links()
 File "noisy.py", line 166, in _browse_from_links
   self._browse_from_links(depth + 1)
 File "noisy.py", line 166, in _browse_from_links
   self._browse_from_links(depth + 1)
 File "noisy.py", line 147, in _browse_from_links
   sub_page = self._request(random_link).content
 File "noisy.py", line 41, in _request
   response = requests.get(url, headers=headers, timeout=5)
 File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 72, in get
   return request('get', url, params=params, **kwargs)
 File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 58, in request
   return session.request(method=method, url=url, **kwargs)
 File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 498, in request
   prep = self.prepare_request(req)
 File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 441, in prepare_request
   hooks=merge_hooks(request.hooks, self.hooks),
 File "/usr/local/lib/python2.7/dist-packages/requests/models.py", line 309, in prepare
   self.prepare_url(url, params)
 File "/usr/local/lib/python2.7/dist-packages/requests/models.py", line 359, in prepare_url
   url = url.decode('utf8')
 File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
   return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe5 in position 38: unexpected end of data

The correct link is https://en.wikipedia.org/wiki/Piratbyr%C3%A5n

I was not able to confirm, but I think that the problem is coming from:

noisy/noisy.py

Line 19 in ae70264

sys.setdefaultencoding('latin-1')

where latin-1 is mandated. Wouldn't a standard utf-8 approach work better?

Adding

# -*- coding: utf-8 -*-

at the top of noisy.py should do the trick. It would tell Python 2 to work with UTF-8, and be transparent to Python 3

@fanch317
Copy link

Reproducted by crawling url https://en.wikipedia.org/wiki/Svenska_Antipiratbyrån

INFO:root:Visiting https://en.wikipedia.org/wiki/Svenska_Antipiratbyr
Traceback (most recent call last):
  File "noisy.py", line 265, in <module>
    main()
  File "noisy.py", line 261, in main
    crawler.crawl()
  File "noisy.py", line 232, in crawl
    self._browse_from_links()
  File "noisy.py", line 173, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 173, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 173, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 173, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 173, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 173, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 173, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 173, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 154, in _browse_from_links
    sub_page = self._request(random_link).content
  File "noisy.py", line 48, in _request
    response = requests.get(url, headers=headers, timeout=5)
  File "/home/noisy/.local/lib/python2.7/site-packages/requests/api.py", line 72, in get
    return request('get', url, params=params, **kwargs)
  File "/home/noisy/.local/lib/python2.7/site-packages/requests/api.py", line 58, in request
    return session.request(method=method, url=url, **kwargs)
  File "/home/noisy/.local/lib/python2.7/site-packages/requests/sessions.py", line 498, in request
    prep = self.prepare_request(req)
  File "/home/noisy/.local/lib/python2.7/site-packages/requests/sessions.py", line 441, in prepare_request
    hooks=merge_hooks(request.hooks, self.hooks),
  File "/home/noisy/.local/lib/python2.7/site-packages/requests/models.py", line 309, in prepare
    self.prepare_url(url, params)
  File "/home/noisy/.local/lib/python2.7/site-packages/requests/models.py", line 359, in prepare_url
    url = url.decode('utf8')
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe5 in position 50: unexpected end of data

I'm on last master commit 0a0d50f

@1tayH 1tayH added bug Something isn't working help wanted Extra attention is needed labels Jul 19, 2018
@Random1984
Copy link

Random1984 commented Jul 20, 2018

I added an "except UnicodeDecodeError" in _browse_from_links with some messages to Debug and Info logs and now it's running to check if I get some info.

@Arduous
Copy link
Contributor Author

Arduous commented Jul 20, 2018

I dug further into this.

The problem is not with the encoding of the file noisy.py (set with # -- coding: utf-8 -- at the beginning of the file for instance as a good practice) but with the default or detected encoding.

The page at http://www.kopimi.com/ is in fact encoded in cp1252 as stated in its header

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head><meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
    <script type="text/javascript" src="/static/js/analytics.js" ></script>

It seems that the module requests is aware of the encoding, but for some reason it fails to identify properly the encoding of the page.

>>> import requests
>>> test = requests.get("http://www.kopimi.com")
>>> test.encoding
'ISO-8859-1'
>>> test.apparent_encoding
'ISO-8859-1'

I guess that a failure is expected when the content of the page is decoded to from its falsely identified ISO-8859-1 encoding to unicode prior to further transformation/analysis.

My suggestion is to detect a decoding error of links in _is_blacklisted. A diff file is attached. The drawback is that some links will be dropped, but it might be an acceptable trade-off against a much more complex encoding resilient parser.

For the record, a clean protection of the decoding would be something something like url = unicode(url.decode(errors = "replace")) but at the end of the day the link will likely not work and it would need to be protected as unicode() was dropped in Python3

For Python3, I do not know if the parsing is more efficient, or the conversion "fails silently"

@Ekultek
Copy link

Ekultek commented Jul 21, 2018

Why not just skip URLs that cause errors, since this program is designed to make noise in the HTTP world, it seems pointless to fix issues like this instead of just skipping to the next URL and continuing on with the sequence.

@Arduous
Copy link
Contributor Author

Arduous commented Jul 21, 2018

Dropping the troublesome links is indeed the approach taken in #10

@fanch317
Copy link

It seems to have worked for me since the last PR #10
Thanks for your contrib.

@Arduous
Copy link
Contributor Author

Arduous commented Aug 12, 2018

My pleasure. It seems that I messed up with the associated pull request(s). I'll resubmit one with the proper fix.

@wxdczd
Copy link

wxdczd commented Sep 25, 2018

INFO:root:Visiting http://lady.163.com/special/photo-search/#q=陈学冬
Traceback (most recent call last):
File "noisy.py", line 274, in
main()
File "noisy.py", line 270, in main
crawler.crawl()
File "noisy.py", line 238, in crawl
self._browse_from_links()
File "noisy.py", line 179, in _browse_from_links
self._browse_from_links(depth + 1)
File "noisy.py", line 160, in _browse_from_links
sub_page = self._request(random_link).content
File "noisy.py", line 49, in _request
response = requests.get(url, headers=headers, timeout=5)
File "/usr/lib/python2.7/dist-packages/requests/api.py", line 67, in get
return request('get', url, params=params, **kwargs)
File "/usr/lib/python2.7/dist-packages/requests/api.py", line 53, in request
return session.request(method=method, url=url, **kwargs)
File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 454, in request
prep = self.prepare_request(req)
File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 388, in prepare_request
hooks=merge_hooks(request.hooks, self.hooks),
File "/usr/lib/python2.7/dist-packages/requests/models.py", line 293, in prepare
self.prepare_url(url, params)
File "/usr/lib/python2.7/dist-packages/requests/models.py", line 332, in prepare_url
url = url.decode('utf8')
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xb3 in position 44: invalid start byte

@Arduous
Copy link
Contributor Author

Arduous commented Sep 26, 2018

@wxdczd was your code manually patched with https://github.com/1tayH/noisy/pull/16/files or is it the vanilla head from https://github.com/1tayH/noisy ?

@wxdczd
Copy link

wxdczd commented Sep 26, 2018

I am using the original. https:// https://github.com/1tayH/noisy

@Arduous
Copy link
Contributor Author

Arduous commented Oct 1, 2018

@wxdczd OK, so it is very likely that the encountered issue is addressed by the proposed pull request. If urgent, you might want to give a try to the master branch my fork.

@wxdczd
Copy link

wxdczd commented Oct 1, 2018

@Arduous Again, it crashed.

@Arduous
Copy link
Contributor Author

Arduous commented Oct 1, 2018

@wxdczd
With the code at https://github.com/1tayH/noisy it is expected and documented.
If you encounter an exception on my branch or with the patched code this another story. Logs would be welcome.

@wxdczd
Copy link

wxdczd commented Oct 3, 2018

INFO:root:Visiting http://ent.163.com/photoview/00AJ0003/659065.html
INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): ent.163.com
DEBUG:requests.packages.urllib3.connectionpool:"GET /photoview/00AJ0003/659065.html HTTP/1.1" 301 0
INFO:requests.packages.urllib3.connectionpool:Resetting dropped connection: ent.163.com
DEBUG:requests.packages.urllib3.connectionpool:"GET /photoview/00AJ0003/659065_mobile.html HTTP/1.1" 200 None
INFO:root:Visiting http://ent.163.com/special/photo-search/#q=??°?
Traceback (most recent call last):
File "noisy.py", line 273, in
main()
File "noisy.py", line 269, in main
crawler.crawl()
File "noisy.py", line 240, in crawl
self._browse_from_links()
File "noisy.py", line 181, in _browse_from_links
self._browse_from_links(depth + 1)
File "noisy.py", line 162, in _browse_from_links
sub_page = self._request(random_link).content
File "noisy.py", line 48, in _request
response = requests.get(url, headers=headers, timeout=5)
File "/usr/lib/python2.7/dist-packages/requests/api.py", line 67, in get
return request('get', url, params=params, **kwargs)
File "/usr/lib/python2.7/dist-packages/requests/api.py", line 53, in request
return session.request(method=method, url=url, **kwargs)
File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 454, in request
prep = self.prepare_request(req)
File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 388, in prepare_request
hooks=merge_hooks(request.hooks, self.hooks),
File "/usr/lib/python2.7/dist-packages/requests/models.py", line 293, in prepare
self.prepare_url(url, params)
File "/usr/lib/python2.7/dist-packages/requests/models.py", line 332, in prepare_url
url = url.decode('utf8')
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xd5 in position 43: invalid continuation byte

@DarknessStorm
Copy link

DarknessStorm commented Feb 1, 2019

Greetings. I'm using patched code from https://github.com/1tayH/noisy/pull/16/files (using Python 2.7. -
don't look at my path in files) and getting this:

INFO:root:Visiting https://www.guncelakademi.com/etiket/DЎnem
Traceback (most recent call last):
  File "noisy.py", line 273, in <module>
    main()
  File "noisy.py", line 269, in main
    crawler.crawl()
  File "noisy.py", line 240, in crawl
    self._browse_from_links()
  File "noisy.py", line 181, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 162, in _browse_from_links
    sub_page = self._request(random_link).content
  File "noisy.py", line 48, in _request
    response = requests.get(url, headers=headers, timeout=5)
  File "E:\Soft\python-3.6.0-embed-amd64\lib\site-packages\requests\api.py", lin
e 75, in get
    return request('get', url, params=params, **kwargs)
  File "E:\Soft\python-3.6.0-embed-amd64\lib\site-packages\requests\api.py", lin
e 60, in request
    return session.request(method=method, url=url, **kwargs)
  File "E:\Soft\python-3.6.0-embed-amd64\lib\site-packages\requests\sessions.py"
, line 519, in request
    prep = self.prepare_request(req)
  File "E:\Soft\python-3.6.0-embed-amd64\lib\site-packages\requests\sessions.py"
, line 462, in prepare_request
    hooks=merge_hooks(request.hooks, self.hooks),
  File "E:\Soft\python-3.6.0-embed-amd64\lib\site-packages\requests\models.py",
line 313, in prepare
    self.prepare_url(url, params)
  File "E:\Soft\python-3.6.0-embed-amd64\lib\site-packages\requests\models.py",
line 363, in prepare_url
    url = url.decode('utf8')
  File "E:\Soft\python-3.6.0-embed-amd64\lib\encodings\utf_8.py", line 16, in de
code
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xf6 in position 38: invalid
start byte

@Arduous
Copy link
Contributor Author

Arduous commented Feb 2, 2019

Hello, I see that the target webpage is not declaring its encoding properly. Unfortunately I am not able to reproduce. Could you indicate the version of the "requests" library by running "python -m requests.help"? The error masquerading was recently improved, and any exception from "requests" should get caught at line 177.

@DarknessStorm
Copy link

DarknessStorm commented Feb 3, 2019

Greetings.
it's

{
  "chardet": {
    "version": "3.0.4"
  },
  "cryptography": {
    "version": ""
  },
  "idna": {
    "version": "2.8"
  },
  "implementation": {
    "name": "CPython",
    "version": "2.7.15"
  },
  "platform": {
    "release": "7",
    "system": "Windows"
  },
  "pyOpenSSL": {
    "openssl_version": "",
    "version": null
  },
  "requests": {
    "version": "2.21.0"
  },
  "system_ssl": {
    "version": "100020ff"
  },
  "urllib3": {
    "version": "1.24.1"
  },
  "using_pyopenssl": false
}

@Arduous
Copy link
Contributor Author

Arduous commented Feb 3, 2019

Thank you. It seems I should have requested you to run "python3 -m requests.help" so that the correct interpreter is is used, but the error might not come from there as your python2 install seems up to date.

I am unfortunately not able to reproduce. Are you able to reproduce easily the issue, or did it happen randomly? In the earlier case, could you share your config.json?

What is the output of
python3 -c "import requests; print(requests.get('https://www.guncelakademi.com/etiket/DЎnem').status_code)"
a 200 status code or an exception?

@DarknessStorm
Copy link

DarknessStorm commented Feb 4, 2019

Seems i can't reproduce it easily, but here is another:

INFO:root:Visiting https://www.proektor78.ru/╧/
Traceback (most recent call last):
  File "noisy.py", line 273, in <module>
    main()
  File "noisy.py", line 269, in main
    crawler.crawl()
  File "noisy.py", line 240, in crawl
    self._browse_from_links()
  File "noisy.py", line 181, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 181, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 181, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 181, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 181, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 181, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 181, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 181, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 181, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 181, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 181, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 181, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 181, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 181, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 181, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 181, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 181, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 181, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 181, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 181, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 181, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 181, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 181, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 181, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 162, in _browse_from_links
    sub_page = self._request(random_link).content
  File "noisy.py", line 48, in _request
    response = requests.get(url, headers=headers, timeout=5)
  File "E:\Soft\python-3.6.0-embed-amd64\lib\site-packages\requests\api.py", lin
e 75, in get
    return request('get', url, params=params, **kwargs)
  File "E:\Soft\python-3.6.0-embed-amd64\lib\site-packages\requests\api.py", lin
e 60, in request
    return session.request(method=method, url=url, **kwargs)
  File "E:\Soft\python-3.6.0-embed-amd64\lib\site-packages\requests\sessions.py"
, line 519, in request
    prep = self.prepare_request(req)
  File "E:\Soft\python-3.6.0-embed-amd64\lib\site-packages\requests\sessions.py"
, line 462, in prepare_request
    hooks=merge_hooks(request.hooks, self.hooks),
  File "E:\Soft\python-3.6.0-embed-amd64\lib\site-packages\requests\models.py",
line 313, in prepare
    self.prepare_url(url, params)
  File "E:\Soft\python-3.6.0-embed-amd64\lib\site-packages\requests\models.py",
line 363, in prepare_url
    url = url.decode('utf8')
  File "E:\Soft\python-3.6.0-embed-amd64\lib\encodings\utf_8.py", line 16, in de
code
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xcf in position 26: invalid
continuation byte

this is my config (it has 1mil links, so the packed archive is about 6mb.
(URL=https://sharemods.com/2xa67zpteow2/config.7z.html - 6.2 MB)

@Arduous
Copy link
Contributor Author

Arduous commented Feb 10, 2019

Thanks, I'll be back in a few days, but in the while....
I think this exception might, or should be caught and addressed, or masqueraded by requests. I still cannot reproduce the error with noisy, or directly through requests. The latter would be needed to report efficiently the issue to them.
You might want this workaround:
replace the line of noisy.py:177 from
except requests.exceptions.RequestException:
to
except (requests.exceptions.RequestException, UnicodeDecodeError):

@DarknessStorm
Copy link

Seems working non-stop for 4 days with no errors.
Thanks : )
I'll post if there will be another crash.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

7 participants