safe_url_string handling IPv6 URLs #193

Cash111 · 2022-10-09T07:01:31Z

Description

Demo spider with settings：

DNS_RESOLVER = "scrapy.resolver.CachingHostnameResolver"

import scrapy


class DemoSpider(scrapy.Spider):
    name = 'demo_spider'
    start_urls = ['https://[2402:4e00:40:40::2:3b6]']

    def parse(self, response, **kwargs):
        print(response.body)
        print(response)

Command to start the spider:

scrapy crawl demo_spider -s JOBDIR=./jobs/run-1

When i use the JOBDIR parameter, it cause an exception:

Traceback (most recent call last):
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/commands/crawl.py", line 27, in run
    self.crawler_process.start()
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/crawler.py", line 348, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/twisted/internet/base.py", line 1318, in run
    self.mainLoop()
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/twisted/internet/base.py", line 1328, in mainLoop
    reactorBaseSelf.runUntilCurrent()
--- <exception caught here> ---
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/twisted/internet/base.py", line 994, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/utils/reactor.py", line 51, in __call__
    return self._func(*self._a, **self._kw)
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/core/engine.py", line 147, in _next_request
    while not self._needs_backout() and self._next_request_from_scheduler() is not None:
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/core/engine.py", line 176, in _next_request_from_scheduler
    request = self.slot.scheduler.next_request()
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/core/scheduler.py", line 263, in next_request
    request = self._dqpop()
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/core/scheduler.py", line 299, in _dqpop
    return self.dqs.pop()
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/pqueues.py", line 99, in pop
    m = q.pop()
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/squeues.py", line 78, in pop
    return request_from_dict(request, spider=self.spider)
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/utils/request.py", line 124, in request_from_dict
    return request_cls(**kwargs)
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/http/request/__init__.py", line 60, in __init__
    self._set_url(url)
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/http/request/__init__.py", line 100, in _set_url
    s = safe_url_string(url, self.encoding)
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/w3lib/url.py", line 103, in safe_url_string
    parts.port,
  File "/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/urllib/parse.py", line 178, in port
    raise ValueError(message) from None
builtins.ValueError: Port could not be cast to integer value as '4e00:40:40::2:3b6'

2022-10-09 13:57:19 [twisted] CRITICAL: Unhandled Error
Traceback (most recent call last):
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/commands/crawl.py", line 27, in run
    self.crawler_process.start()
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/crawler.py", line 348, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/twisted/internet/base.py", line 1318, in run
    self.mainLoop()
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/twisted/internet/base.py", line 1328, in mainLoop
    reactorBaseSelf.runUntilCurrent()
--- <exception caught here> ---
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/twisted/internet/base.py", line 994, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/utils/reactor.py", line 51, in __call__
    return self._func(*self._a, **self._kw)
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/core/engine.py", line 147, in _next_request
    while not self._needs_backout() and self._next_request_from_scheduler() is not None:
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/core/engine.py", line 176, in _next_request_from_scheduler
    request = self.slot.scheduler.next_request()
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/core/scheduler.py", line 263, in next_request
    request = self._dqpop()
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/core/scheduler.py", line 299, in _dqpop
    return self.dqs.pop()
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/pqueues.py", line 99, in pop
    m = q.pop()
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/squeues.py", line 78, in pop
    return request_from_dict(request, spider=self.spider)
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/utils/request.py", line 124, in request_from_dict
    return request_cls(**kwargs)
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/http/request/__init__.py", line 60, in __init__
    self._set_url(url)
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/http/request/__init__.py", line 100, in _set_url
    s = safe_url_string(url, self.encoding)
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/w3lib/url.py", line 103, in safe_url_string
    parts.port,
  File "/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/urllib/parse.py", line 178, in port
    raise ValueError(message) from None
builtins.ValueError: Port could not be cast to integer value as '4e00:40:40::2:3b6'

I debugged and found that the problem was in urllib.parse#L202,as shown below:

And when I stopped using the JOBDIR parameter and debugged again, I found that the problem still existed. At this point, the problem is in middlewares such as CookieJar, RetryMiddleware,RobotsTxtMiddleware and so on.

The problem should be in the creation of the Request instance,It called self._set_url and then parsed the url https://[2402:4e00:40:40::2:3b6] to https://2402:4e00:40:40::2:3b6 .

When the middlewares create another instance of Request based on Request.url, calling self._set_url will return the wrong hostname and port.

Versions

$ scrapy version --verbose
Scrapy       : 2.6.3
lxml         : 4.9.1.0
libxml2      : 2.9.4
cssselect    : 1.1.0
parsel       : 1.6.0
w3lib        : 2.0.1
Twisted      : 22.8.0
Python       : 3.9.6 (default, Sep 13 2022, 22:03:16) - [Clang 14.0.0 (clang-1400.0.29.102)]
pyOpenSSL    : 22.0.0 (OpenSSL 3.0.5 5 Jul 2022)
cryptography : 37.0.4
Platform     : macOS-12.6-arm64-arm-64bit

The text was updated successfully, but these errors were encountered:

Cash111 · 2022-10-10T02:25:09Z

Temporarily solved this problem by downgrading w3lib to 1.22.0

wRAR · 2022-10-10T05:20:00Z

In [5]: safe_url_string('https://[2402:4e00:40:40::2:3b6]')
Out[5]: 'https://2402:4e00:40:40::2:3b6'

In [6]: safe_url_string('https://[2402:4e00:40:40::2:3b6]:80')
Out[6]: 'https://2402:4e00:40:40::2:3b6:80'

This indeed looks like a bug.

wRAR · 2022-10-10T05:55:13Z

urlsplit returns '[2402:4e00:40:40::2:3b6]:80' in netloc but 2402:4e00:40:40::2:3b6 in hostname, and safe_url_string uses this hostname value directly without putting it in brackets again. There may be some code in urllib that should be used here instead.

himanshu007-creator · 2022-10-17T10:40:51Z

Hi, i would like to work on this issue

wRAR · 2022-10-17T10:55:45Z

@himanshu007-creator sure, no problem with that

wRAR transferred this issue from scrapy/scrapy Oct 10, 2022

wRAR changed the title ~~Error when requesting ipv6 address directly with JOBDIR parameter~~ safe_url_string handling IPv6 URLs Oct 10, 2022

wRAR added the bug label Oct 10, 2022

Gallaecio added the good first issue label Oct 11, 2022

Gallaecio mentioned this issue Nov 8, 2022

Make safe_url_string safer #201

Closed

10 tasks

wRAR mentioned this issue Nov 17, 2022

IPv6 addresses not correctly recognized scrapy/scrapy#1832

Closed

Gallaecio linked a pull request Feb 13, 2024 that will close this issue

Implement a safe_url based on all standards #221

Draft

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

safe_url_string handling IPv6 URLs #193

safe_url_string handling IPv6 URLs #193

Cash111 commented Oct 9, 2022 •

edited

Cash111 commented Oct 10, 2022

wRAR commented Oct 10, 2022

wRAR commented Oct 10, 2022

himanshu007-creator commented Oct 17, 2022

wRAR commented Oct 17, 2022

safe_url_string handling IPv6 URLs #193

safe_url_string handling IPv6 URLs #193

Comments

Cash111 commented Oct 9, 2022 • edited

Description

Versions

Cash111 commented Oct 10, 2022

wRAR commented Oct 10, 2022

wRAR commented Oct 10, 2022

himanshu007-creator commented Oct 17, 2022

wRAR commented Oct 17, 2022

Cash111 commented Oct 9, 2022 •

edited