You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I debugged and found that the problem was in urllib.parse#L202,as shown below:
And when I stopped using the JOBDIR parameter and debugged again, I found that the problem still existed. At this point, the problem is in middlewares such as CookieJar, RetryMiddleware,RobotsTxtMiddleware and so on.
The problem should be in the creation of the Request instance,It called self._set_url and then parsed the url https://[2402:4e00:40:40::2:3b6] to https://2402:4e00:40:40::2:3b6 .
When the middlewares create another instance of Request based on Request.url, calling self._set_url will return the wrong hostname and port.
In [5]: safe_url_string('https://[2402:4e00:40:40::2:3b6]')
Out[5]: 'https://2402:4e00:40:40::2:3b6'
In [6]: safe_url_string('https://[2402:4e00:40:40::2:3b6]:80')
Out[6]: 'https://2402:4e00:40:40::2:3b6:80'
This indeed looks like a bug.
wRAR
changed the title
Error when requesting ipv6 address directly with JOBDIR parameter
safe_url_string handling IPv6 URLs
Oct 10, 2022
urlsplit returns '[2402:4e00:40:40::2:3b6]:80' in netloc but 2402:4e00:40:40::2:3b6 in hostname, and safe_url_string uses this hostname value directly without putting it in brackets again. There may be some code in urllib that should be used here instead.
Description
Demo spider with settings:
Command to start the spider:
When i use the
JOBDIR
parameter, it cause an exception:I debugged and found that the problem was in
urllib.parse#L202
,as shown below:And when I stopped using the
JOBDIR
parameter and debugged again, I found that the problem still existed. At this point, the problem is in middlewares such asCookieJar
,RetryMiddleware
,RobotsTxtMiddleware
and so on.The problem should be in the creation of the
Request
instance,It called self._set_url and then parsed the urlhttps://[2402:4e00:40:40::2:3b6]
tohttps://2402:4e00:40:40::2:3b6
.When the middlewares create another instance of
Request
based onRequest.url
, calling self._set_url will return the wrong hostname and port.Versions
The text was updated successfully, but these errors were encountered: