Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

safe_url_string handling IPv6 URLs #193

Open
Tracked by #221 ...
Cash111 opened this issue Oct 9, 2022 · 5 comments · May be fixed by #221
Open
Tracked by #221 ...

safe_url_string handling IPv6 URLs #193

Cash111 opened this issue Oct 9, 2022 · 5 comments · May be fixed by #221

Comments

@Cash111
Copy link

Cash111 commented Oct 9, 2022

Description

Demo spider with settings:

DNS_RESOLVER = "scrapy.resolver.CachingHostnameResolver"
import scrapy


class DemoSpider(scrapy.Spider):
    name = 'demo_spider'
    start_urls = ['https://[2402:4e00:40:40::2:3b6]']

    def parse(self, response, **kwargs):
        print(response.body)
        print(response)

Command to start the spider:

scrapy crawl demo_spider -s JOBDIR=./jobs/run-1

When i use the JOBDIR parameter, it cause an exception:

Traceback (most recent call last):
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/commands/crawl.py", line 27, in run
    self.crawler_process.start()
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/crawler.py", line 348, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/twisted/internet/base.py", line 1318, in run
    self.mainLoop()
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/twisted/internet/base.py", line 1328, in mainLoop
    reactorBaseSelf.runUntilCurrent()
--- <exception caught here> ---
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/twisted/internet/base.py", line 994, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/utils/reactor.py", line 51, in __call__
    return self._func(*self._a, **self._kw)
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/core/engine.py", line 147, in _next_request
    while not self._needs_backout() and self._next_request_from_scheduler() is not None:
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/core/engine.py", line 176, in _next_request_from_scheduler
    request = self.slot.scheduler.next_request()
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/core/scheduler.py", line 263, in next_request
    request = self._dqpop()
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/core/scheduler.py", line 299, in _dqpop
    return self.dqs.pop()
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/pqueues.py", line 99, in pop
    m = q.pop()
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/squeues.py", line 78, in pop
    return request_from_dict(request, spider=self.spider)
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/utils/request.py", line 124, in request_from_dict
    return request_cls(**kwargs)
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/http/request/__init__.py", line 60, in __init__
    self._set_url(url)
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/http/request/__init__.py", line 100, in _set_url
    s = safe_url_string(url, self.encoding)
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/w3lib/url.py", line 103, in safe_url_string
    parts.port,
  File "/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/urllib/parse.py", line 178, in port
    raise ValueError(message) from None
builtins.ValueError: Port could not be cast to integer value as '4e00:40:40::2:3b6'

2022-10-09 13:57:19 [twisted] CRITICAL: Unhandled Error
Traceback (most recent call last):
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/commands/crawl.py", line 27, in run
    self.crawler_process.start()
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/crawler.py", line 348, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/twisted/internet/base.py", line 1318, in run
    self.mainLoop()
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/twisted/internet/base.py", line 1328, in mainLoop
    reactorBaseSelf.runUntilCurrent()
--- <exception caught here> ---
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/twisted/internet/base.py", line 994, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/utils/reactor.py", line 51, in __call__
    return self._func(*self._a, **self._kw)
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/core/engine.py", line 147, in _next_request
    while not self._needs_backout() and self._next_request_from_scheduler() is not None:
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/core/engine.py", line 176, in _next_request_from_scheduler
    request = self.slot.scheduler.next_request()
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/core/scheduler.py", line 263, in next_request
    request = self._dqpop()
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/core/scheduler.py", line 299, in _dqpop
    return self.dqs.pop()
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/pqueues.py", line 99, in pop
    m = q.pop()
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/squeues.py", line 78, in pop
    return request_from_dict(request, spider=self.spider)
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/utils/request.py", line 124, in request_from_dict
    return request_cls(**kwargs)
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/http/request/__init__.py", line 60, in __init__
    self._set_url(url)
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/http/request/__init__.py", line 100, in _set_url
    s = safe_url_string(url, self.encoding)
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/w3lib/url.py", line 103, in safe_url_string
    parts.port,
  File "/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/urllib/parse.py", line 178, in port
    raise ValueError(message) from None
builtins.ValueError: Port could not be cast to integer value as '4e00:40:40::2:3b6'

I debugged and found that the problem was in urllib.parse#L202,as shown below:

SefJhostinfo(selp seilf SolftRewltCschene- 'https', netlec- 24024r0040402306443  PotheVp

And when I stopped using the JOBDIR parameter and debugged again, I found that the problem still existed. At this point, the problem is in middlewares such as CookieJar, RetryMiddleware,RobotsTxtMiddleware and so on.

hosts - potenttal domatnunotches(req host) hoEte T40  2w' '2402 tocol”)

The problem should be in the creation of the Request instance,It called self._set_url and then parsed the url https://[2402:4e00:40:40::2:3b6] to https://2402:4e00:40:40::2:3b6 .

When the middlewares create another instance of Request based on Request.url, calling self._set_url will return the wrong hostname and port.

Versions

$ scrapy version --verbose
Scrapy       : 2.6.3
lxml         : 4.9.1.0
libxml2      : 2.9.4
cssselect    : 1.1.0
parsel       : 1.6.0
w3lib        : 2.0.1
Twisted      : 22.8.0
Python       : 3.9.6 (default, Sep 13 2022, 22:03:16) - [Clang 14.0.0 (clang-1400.0.29.102)]
pyOpenSSL    : 22.0.0 (OpenSSL 3.0.5 5 Jul 2022)
cryptography : 37.0.4
Platform     : macOS-12.6-arm64-arm-64bit
@Cash111
Copy link
Author

Cash111 commented Oct 10, 2022

Temporarily solved this problem by downgrading w3lib to 1.22.0

@wRAR wRAR transferred this issue from scrapy/scrapy Oct 10, 2022
@wRAR
Copy link
Member

wRAR commented Oct 10, 2022

In [5]: safe_url_string('https://[2402:4e00:40:40::2:3b6]')
Out[5]: 'https://2402:4e00:40:40::2:3b6'

In [6]: safe_url_string('https://[2402:4e00:40:40::2:3b6]:80')
Out[6]: 'https://2402:4e00:40:40::2:3b6:80'

This indeed looks like a bug.

@wRAR wRAR changed the title Error when requesting ipv6 address directly with JOBDIR parameter safe_url_string handling IPv6 URLs Oct 10, 2022
@wRAR
Copy link
Member

wRAR commented Oct 10, 2022

urlsplit returns '[2402:4e00:40:40::2:3b6]:80' in netloc but 2402:4e00:40:40::2:3b6 in hostname, and safe_url_string uses this hostname value directly without putting it in brackets again. There may be some code in urllib that should be used here instead.

@himanshu007-creator
Copy link

Hi, i would like to work on this issue

@wRAR
Copy link
Member

wRAR commented Oct 17, 2022

@himanshu007-creator sure, no problem with that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants