Crash with URLs containing non ascii characters #6

Arduous · 2018-07-05T12:06:24Z

My interpreter, as seen below is Python 2.7.
Here is the last URLs visited, and the traceback of the exception

INFO:root:Visiting http://www.kopimi.com/index.html
INFO:root:Visiting https://en.wikipedia.org/wiki/Piratbyr�n
Traceback (most recent call last):
 File "noisy.py", line 258, in <module>
   main()
 File "noisy.py", line 254, in main
   crawler.crawl()
 File "noisy.py", line 225, in crawl
   self._browse_from_links()
 File "noisy.py", line 166, in _browse_from_links
   self._browse_from_links(depth + 1)
 File "noisy.py", line 166, in _browse_from_links
   self._browse_from_links(depth + 1)
 File "noisy.py", line 147, in _browse_from_links
   sub_page = self._request(random_link).content
 File "noisy.py", line 41, in _request
   response = requests.get(url, headers=headers, timeout=5)
 File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 72, in get
   return request('get', url, params=params, **kwargs)
 File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 58, in request
   return session.request(method=method, url=url, **kwargs)
 File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 498, in request
   prep = self.prepare_request(req)
 File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 441, in prepare_request
   hooks=merge_hooks(request.hooks, self.hooks),
 File "/usr/local/lib/python2.7/dist-packages/requests/models.py", line 309, in prepare
   self.prepare_url(url, params)
 File "/usr/local/lib/python2.7/dist-packages/requests/models.py", line 359, in prepare_url
   url = url.decode('utf8')
 File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
   return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe5 in position 38: unexpected end of data

The correct link is https://en.wikipedia.org/wiki/Piratbyr%C3%A5n

I was not able to confirm, but I think that the problem is coming from:

noisy/noisy.py

Line 19 in ae70264

sys.setdefaultencoding('latin-1')

where latin-1 is mandated. Wouldn't a standard utf-8 approach work better?

Adding

# -*- coding: utf-8 -*-

at the top of noisy.py should do the trick. It would tell Python 2 to work with UTF-8, and be transparent to Python 3

The text was updated successfully, but these errors were encountered:

fanch317 · 2018-07-14T19:25:07Z

Reproducted by crawling url https://en.wikipedia.org/wiki/Svenska_Antipiratbyrån

INFO:root:Visiting https://en.wikipedia.org/wiki/Svenska_Antipiratbyr
Traceback (most recent call last):
  File "noisy.py", line 265, in <module>
    main()
  File "noisy.py", line 261, in main
    crawler.crawl()
  File "noisy.py", line 232, in crawl
    self._browse_from_links()
  File "noisy.py", line 173, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 173, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 173, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 173, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 173, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 173, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 173, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 173, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 154, in _browse_from_links
    sub_page = self._request(random_link).content
  File "noisy.py", line 48, in _request
    response = requests.get(url, headers=headers, timeout=5)
  File "/home/noisy/.local/lib/python2.7/site-packages/requests/api.py", line 72, in get
    return request('get', url, params=params, **kwargs)
  File "/home/noisy/.local/lib/python2.7/site-packages/requests/api.py", line 58, in request
    return session.request(method=method, url=url, **kwargs)
  File "/home/noisy/.local/lib/python2.7/site-packages/requests/sessions.py", line 498, in request
    prep = self.prepare_request(req)
  File "/home/noisy/.local/lib/python2.7/site-packages/requests/sessions.py", line 441, in prepare_request
    hooks=merge_hooks(request.hooks, self.hooks),
  File "/home/noisy/.local/lib/python2.7/site-packages/requests/models.py", line 309, in prepare
    self.prepare_url(url, params)
  File "/home/noisy/.local/lib/python2.7/site-packages/requests/models.py", line 359, in prepare_url
    url = url.decode('utf8')
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe5 in position 50: unexpected end of data

I'm on last master commit 0a0d50f

Random1984 · 2018-07-20T10:16:14Z

I added an "except UnicodeDecodeError" in _browse_from_links with some messages to Debug and Info logs and now it's running to check if I get some info.

Arduous · 2018-07-20T15:38:31Z

I dug further into this.

The problem is not with the encoding of the file noisy.py (set with # -- coding: utf-8 -- at the beginning of the file for instance as a good practice) but with the default or detected encoding.

The page at http://www.kopimi.com/ is in fact encoded in cp1252 as stated in its header

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head><meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
    <script type="text/javascript" src="/static/js/analytics.js" ></script>

It seems that the module requests is aware of the encoding, but for some reason it fails to identify properly the encoding of the page.

>>> import requests
>>> test = requests.get("http://www.kopimi.com")
>>> test.encoding
'ISO-8859-1'
>>> test.apparent_encoding
'ISO-8859-1'

I guess that a failure is expected when the content of the page is decoded to from its falsely identified ISO-8859-1 encoding to unicode prior to further transformation/analysis.

My suggestion is to detect a decoding error of links in _is_blacklisted. A diff file is attached. The drawback is that some links will be dropped, but it might be an acceptable trade-off against a much more complex encoding resilient parser.

For the record, a clean protection of the decoding would be something something like url = unicode(url.decode(errors = "replace")) but at the end of the day the link will likely not work and it would need to be protected as unicode() was dropped in Python3

For Python3, I do not know if the parsing is more efficient, or the conversion "fails silently"

Ekultek · 2018-07-21T15:59:20Z

Why not just skip URLs that cause errors, since this program is designed to make noise in the HTTP world, it seems pointless to fix issues like this instead of just skipping to the next URL and continuing on with the sequence.

Arduous · 2018-07-21T18:17:12Z

Dropping the troublesome links is indeed the approach taken in #10

fanch317 · 2018-08-12T10:39:08Z

It seems to have worked for me since the last PR #10
Thanks for your contrib.

Arduous · 2018-08-12T18:17:54Z

My pleasure. It seems that I messed up with the associated pull request(s). I'll resubmit one with the proper fix.

wxdczd · 2018-09-25T01:31:06Z

INFO:root:Visiting http://lady.163.com/special/photo-search/#q=陈学冬
Traceback (most recent call last):
File "noisy.py", line 274, in
main()
File "noisy.py", line 270, in main
crawler.crawl()
File "noisy.py", line 238, in crawl
self._browse_from_links()
File "noisy.py", line 179, in _browse_from_links
self._browse_from_links(depth + 1)
File "noisy.py", line 160, in _browse_from_links
sub_page = self._request(random_link).content
File "noisy.py", line 49, in _request
response = requests.get(url, headers=headers, timeout=5)
File "/usr/lib/python2.7/dist-packages/requests/api.py", line 67, in get
return request('get', url, params=params, **kwargs)
File "/usr/lib/python2.7/dist-packages/requests/api.py", line 53, in request
return session.request(method=method, url=url, **kwargs)
File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 454, in request
prep = self.prepare_request(req)
File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 388, in prepare_request
hooks=merge_hooks(request.hooks, self.hooks),
File "/usr/lib/python2.7/dist-packages/requests/models.py", line 293, in prepare
self.prepare_url(url, params)
File "/usr/lib/python2.7/dist-packages/requests/models.py", line 332, in prepare_url
url = url.decode('utf8')
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xb3 in position 44: invalid start byte

Arduous · 2018-09-26T21:20:54Z

@wxdczd was your code manually patched with https://github.com/1tayH/noisy/pull/16/files or is it the vanilla head from https://github.com/1tayH/noisy ?

wxdczd · 2018-09-26T22:27:11Z

I am using the original. https:// https://github.com/1tayH/noisy

Arduous · 2018-10-01T08:59:00Z

@wxdczd OK, so it is very likely that the encountered issue is addressed by the proposed pull request. If urgent, you might want to give a try to the master branch my fork.

wxdczd · 2018-10-01T09:54:03Z

@Arduous Again, it crashed.

Arduous · 2018-10-01T10:40:20Z

@wxdczd
With the code at https://github.com/1tayH/noisy it is expected and documented.
If you encounter an exception on my branch or with the patched code this another story. Logs would be welcome.

wxdczd · 2018-10-03T20:42:23Z

INFO:root:Visiting http://ent.163.com/photoview/00AJ0003/659065.html
INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): ent.163.com
DEBUG:requests.packages.urllib3.connectionpool:"GET /photoview/00AJ0003/659065.html HTTP/1.1" 301 0
INFO:requests.packages.urllib3.connectionpool:Resetting dropped connection: ent.163.com
DEBUG:requests.packages.urllib3.connectionpool:"GET /photoview/00AJ0003/659065_mobile.html HTTP/1.1" 200 None
INFO:root:Visiting http://ent.163.com/special/photo-search/#q=??°?
Traceback (most recent call last):
File "noisy.py", line 273, in
main()
File "noisy.py", line 269, in main
crawler.crawl()
File "noisy.py", line 240, in crawl
self._browse_from_links()
File "noisy.py", line 181, in _browse_from_links
self._browse_from_links(depth + 1)
File "noisy.py", line 162, in _browse_from_links
sub_page = self._request(random_link).content
File "noisy.py", line 48, in _request
response = requests.get(url, headers=headers, timeout=5)
File "/usr/lib/python2.7/dist-packages/requests/api.py", line 67, in get
return request('get', url, params=params, **kwargs)
File "/usr/lib/python2.7/dist-packages/requests/api.py", line 53, in request
return session.request(method=method, url=url, **kwargs)
File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 454, in request
prep = self.prepare_request(req)
File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 388, in prepare_request
hooks=merge_hooks(request.hooks, self.hooks),
File "/usr/lib/python2.7/dist-packages/requests/models.py", line 293, in prepare
self.prepare_url(url, params)
File "/usr/lib/python2.7/dist-packages/requests/models.py", line 332, in prepare_url
url = url.decode('utf8')
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xd5 in position 43: invalid continuation byte

DarknessStorm · 2019-02-01T06:05:30Z

Greetings. I'm using patched code from https://github.com/1tayH/noisy/pull/16/files (using Python 2.7. -
don't look at my path in files) and getting this:

INFO:root:Visiting https://www.guncelakademi.com/etiket/DЎnem
Traceback (most recent call last):
  File "noisy.py", line 273, in <module>
    main()
  File "noisy.py", line 269, in main
    crawler.crawl()
  File "noisy.py", line 240, in crawl
    self._browse_from_links()
  File "noisy.py", line 181, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 162, in _browse_from_links
    sub_page = self._request(random_link).content
  File "noisy.py", line 48, in _request
    response = requests.get(url, headers=headers, timeout=5)
  File "E:\Soft\python-3.6.0-embed-amd64\lib\site-packages\requests\api.py", lin
e 75, in get
    return request('get', url, params=params, **kwargs)
  File "E:\Soft\python-3.6.0-embed-amd64\lib\site-packages\requests\api.py", lin
e 60, in request
    return session.request(method=method, url=url, **kwargs)
  File "E:\Soft\python-3.6.0-embed-amd64\lib\site-packages\requests\sessions.py"
, line 519, in request
    prep = self.prepare_request(req)
  File "E:\Soft\python-3.6.0-embed-amd64\lib\site-packages\requests\sessions.py"
, line 462, in prepare_request
    hooks=merge_hooks(request.hooks, self.hooks),
  File "E:\Soft\python-3.6.0-embed-amd64\lib\site-packages\requests\models.py",
line 313, in prepare
    self.prepare_url(url, params)
  File "E:\Soft\python-3.6.0-embed-amd64\lib\site-packages\requests\models.py",
line 363, in prepare_url
    url = url.decode('utf8')
  File "E:\Soft\python-3.6.0-embed-amd64\lib\encodings\utf_8.py", line 16, in de
code
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xf6 in position 38: invalid
start byte

Arduous · 2019-02-02T17:56:44Z

Hello, I see that the target webpage is not declaring its encoding properly. Unfortunately I am not able to reproduce. Could you indicate the version of the "requests" library by running "python -m requests.help"? The error masquerading was recently improved, and any exception from "requests" should get caught at line 177.

DarknessStorm · 2019-02-03T06:41:50Z

Greetings.
it's

{
  "chardet": {
    "version": "3.0.4"
  },
  "cryptography": {
    "version": ""
  },
  "idna": {
    "version": "2.8"
  },
  "implementation": {
    "name": "CPython",
    "version": "2.7.15"
  },
  "platform": {
    "release": "7",
    "system": "Windows"
  },
  "pyOpenSSL": {
    "openssl_version": "",
    "version": null
  },
  "requests": {
    "version": "2.21.0"
  },
  "system_ssl": {
    "version": "100020ff"
  },
  "urllib3": {
    "version": "1.24.1"
  },
  "using_pyopenssl": false
}

Arduous · 2019-02-03T12:33:47Z

Thank you. It seems I should have requested you to run "python3 -m requests.help" so that the correct interpreter is is used, but the error might not come from there as your python2 install seems up to date.

I am unfortunately not able to reproduce. Are you able to reproduce easily the issue, or did it happen randomly? In the earlier case, could you share your config.json?

What is the output of
python3 -c "import requests; print(requests.get('https://www.guncelakademi.com/etiket/DЎnem').status_code)"
a 200 status code or an exception?

DarknessStorm · 2019-02-04T06:49:31Z

Seems i can't reproduce it easily, but here is another:

INFO:root:Visiting https://www.proektor78.ru/╧/
Traceback (most recent call last):
  File "noisy.py", line 273, in <module>
    main()
  File "noisy.py", line 269, in main
    crawler.crawl()
  File "noisy.py", line 240, in crawl
    self._browse_from_links()
  File "noisy.py", line 181, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 181, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 181, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 181, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 181, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 181, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 181, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 181, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 181, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 181, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 181, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 181, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 181, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 181, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 181, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 181, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 181, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 181, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 181, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 181, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 181, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 181, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 181, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 181, in _browse_from_links
    self._browse_from_links(depth + 1)
  File "noisy.py", line 162, in _browse_from_links
    sub_page = self._request(random_link).content
  File "noisy.py", line 48, in _request
    response = requests.get(url, headers=headers, timeout=5)
  File "E:\Soft\python-3.6.0-embed-amd64\lib\site-packages\requests\api.py", lin
e 75, in get
    return request('get', url, params=params, **kwargs)
  File "E:\Soft\python-3.6.0-embed-amd64\lib\site-packages\requests\api.py", lin
e 60, in request
    return session.request(method=method, url=url, **kwargs)
  File "E:\Soft\python-3.6.0-embed-amd64\lib\site-packages\requests\sessions.py"
, line 519, in request
    prep = self.prepare_request(req)
  File "E:\Soft\python-3.6.0-embed-amd64\lib\site-packages\requests\sessions.py"
, line 462, in prepare_request
    hooks=merge_hooks(request.hooks, self.hooks),
  File "E:\Soft\python-3.6.0-embed-amd64\lib\site-packages\requests\models.py",
line 313, in prepare
    self.prepare_url(url, params)
  File "E:\Soft\python-3.6.0-embed-amd64\lib\site-packages\requests\models.py",
line 363, in prepare_url
    url = url.decode('utf8')
  File "E:\Soft\python-3.6.0-embed-amd64\lib\encodings\utf_8.py", line 16, in de
code
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xcf in position 26: invalid
continuation byte

this is my config (it has 1mil links, so the packed archive is about 6mb.
(URL=https://sharemods.com/2xa67zpteow2/config.7z.html - 6.2 MB)

Arduous · 2019-02-10T17:49:41Z

Thanks, I'll be back in a few days, but in the while....
I think this exception might, or should be caught and addressed, or masqueraded by requests. I still cannot reproduce the error with noisy, or directly through requests. The latter would be needed to report efficiently the issue to them.
You might want this workaround:
replace the line of noisy.py:177 from
except requests.exceptions.RequestException:
to
except (requests.exceptions.RequestException, UnicodeDecodeError):

DarknessStorm · 2019-02-15T06:58:04Z

Seems working non-stop for 4 days with no errors.
Thanks : )
I'll post if there will be another crash.

1tayH added bug Something isn't working help wanted Extra attention is needed labels Jul 19, 2018

This was referenced Jul 20, 2018

Decoding protection #9

Closed

Proposing a long term workaround to #6 v2 #10

Closed

Arduous mentioned this issue Aug 12, 2018

Filtering out URLs whose parsing encounter UnicodeDecodeError #16

Open

fireneat mentioned this issue Nov 28, 2021

Differences from the original version madereddy/noisy#7

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash with URLs containing non ascii characters #6

Crash with URLs containing non ascii characters #6

Arduous commented Jul 5, 2018

fanch317 commented Jul 14, 2018

Random1984 commented Jul 20, 2018 •

edited

Arduous commented Jul 20, 2018

Ekultek commented Jul 21, 2018

Arduous commented Jul 21, 2018

fanch317 commented Aug 12, 2018

Arduous commented Aug 12, 2018

wxdczd commented Sep 25, 2018

Arduous commented Sep 26, 2018

wxdczd commented Sep 26, 2018

Arduous commented Oct 1, 2018

wxdczd commented Oct 1, 2018

Arduous commented Oct 1, 2018

wxdczd commented Oct 3, 2018

DarknessStorm commented Feb 1, 2019 •

edited

Arduous commented Feb 2, 2019

DarknessStorm commented Feb 3, 2019 •

edited

Arduous commented Feb 3, 2019

DarknessStorm commented Feb 4, 2019 •

edited

Arduous commented Feb 10, 2019

DarknessStorm commented Feb 15, 2019

Crash with URLs containing non ascii characters #6

Crash with URLs containing non ascii characters #6

Comments

Arduous commented Jul 5, 2018

fanch317 commented Jul 14, 2018

Random1984 commented Jul 20, 2018 • edited

Arduous commented Jul 20, 2018

Ekultek commented Jul 21, 2018

Arduous commented Jul 21, 2018

fanch317 commented Aug 12, 2018

Arduous commented Aug 12, 2018

wxdczd commented Sep 25, 2018

Arduous commented Sep 26, 2018

wxdczd commented Sep 26, 2018

Arduous commented Oct 1, 2018

wxdczd commented Oct 1, 2018

Arduous commented Oct 1, 2018

wxdczd commented Oct 3, 2018

DarknessStorm commented Feb 1, 2019 • edited

Arduous commented Feb 2, 2019

DarknessStorm commented Feb 3, 2019 • edited

Arduous commented Feb 3, 2019

DarknessStorm commented Feb 4, 2019 • edited

Arduous commented Feb 10, 2019

DarknessStorm commented Feb 15, 2019

Random1984 commented Jul 20, 2018 •

edited

DarknessStorm commented Feb 1, 2019 •

edited

DarknessStorm commented Feb 3, 2019 •

edited

DarknessStorm commented Feb 4, 2019 •

edited