Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Web pages "down" when crawling (Crawler_AIL exception) #580

Closed
alexisteleco opened this issue Apr 19, 2022 · 1 comment
Closed

Web pages "down" when crawling (Crawler_AIL exception) #580

alexisteleco opened this issue Apr 19, 2022 · 1 comment

Comments

@alexisteleco
Copy link

(Not sure whether this is related to issue #490)

Context

I installed and configured AIL, Ail splash manager, Tor, i2P and squid, but all crawled domains seems to be "down", when they really aren't.
I use the tor proxy provided by The torproject.

As far as I know, Tor is working as expected:

curl --socks5 localhost:9050 --socks5-hostname localhost:9050 -s https://check.torproject.org/ | cat | grep -m 1 Congratulations | xargs
# Congratulations. This browser is configured to use Tor.

Also the squid proxy works

curl --proxy "http://172.17.0.1:3128" "http://google.it"                                                                          
#<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
#<TITLE>301 Moved</TITLE></HEAD><BODY>
#<H1>301 Moved</H1>
#The document has moved
#<A HREF="http://www.google.it/">here</A>.
#</BODY></HTML>

Configurations can be seen here:

ail-splash-manager/config/proxies_profiles.cfg
# Tor: torrc default proxy
# use The torproject proxy https://2019.www.torproject.org/docs/debian
# (up to date, solve issues with v3 onion addresses)

# proxy name
[default_tor]
# proxy host
host=172.17.0.1
# proxy port
port=9050
# proxy type
type=SOCKS5
# proxy description
description=tor default proxy
# crawler type (tor or i2p or web)
crawler_type=tor

# SQUID proxy
[web_proxy]
host=172.17.0.1
port=3128
type=HTTP
description=web proxy
crawler_type=web

# I2P proxy
[default_i2p]
host=127.0.0.1
port=4444
type=HTTP
description=i2p default proxy
crawler_type=i2p
ail-splash-manager/config/containers.cfg
# proxy_name: proxy name (defined in proxies_profiles.cfg)
# port: single port or port range (ex: 8050 or 8050-8052)
# cpu: max number of cpu allocated
# memory: RAM (G) allocated
# maxrss: max unbound in-memory cache (Mb, Restart Splash when full)
# description: docker description
[default_splash_tor]
proxy_name=default_tor
port=8050-8052
cpu=1
memory=1
maxrss=2000
description= default splash tor
net=bridge

# Splash with SQUID proxy
[web_splash]
proxy_name=web_proxy
port=8060
cpu=1
memory=1
maxrss=2000
description= web splash (squid)
net=bridge

# Splash with I2P proxy
[default_splash_i2p] # section name: splash name
proxy_name=default_i2p
port=8053-8055
cpu=1
memory=1
maxrss=2000
description=default splash i2p
net=host
/etc/squid/conf.d/ail.conf
#
# Squid configuration settings for Ail
#
acl localnet src 172.17.0.0/16 # Docker IP range
http_access allow localnet

Errors

With onion sites (using Tor)

on the Crawler_AIL screen

This happens when I run a web crawler trough the web GUI

------------------START CRAWLER------------------
crawler type:     onion
-------------------------------------------------
url:         http://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.onion
domain:      xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.onion
domain_url:  http://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.onion

Launching Crawler: http://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.onion
2022-04-19 15:17:43 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: scrapybot)
2022-04-19 15:17:43 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.4.0, Python 3.8.10 (default, Mar 15 2022, 12:22:08) - [GCC 9.4.0], pyOpenSSL 22.0.0 (OpenSSL 1.1.1n  15 Mar 2022), cryptography 36.0.2, Platform Linux-5.13.0-1021-aws-x86_64-with-glibc2.29
2022-04-19 15:17:43 [scrapy.crawler] INFO: Overridden settings:
{'CLOSESPIDER_PAGECOUNT': 1,
 'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; rv:78.0) Gecko/20100101 '
               'Firefox/78.0'}
2022-04-19 15:17:43 [scrapy.extensions.telnet] INFO: Telnet Password: xxxxxxxxxxxxxxx
2022-04-19 15:17:43 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.closespider.CloseSpider',
 'scrapy.extensions.logstats.LogStats']
2022-04-19 15:17:43 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy_splash.SplashDeduplicateArgsMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy_splash.SplashCookiesMiddleware',
 'scrapy_splash.SplashMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-04-19 15:17:43 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy_splash.SplashDeduplicateArgsMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-04-19 15:17:43 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-04-19 15:17:43 [scrapy.core.engine] INFO: Spider opened
2022-04-19 15:17:43 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-04-19 15:17:43 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-04-19 15:17:43 [py.warnings] WARNING: /opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy_splash/request.py:42: ScrapyDeprecationWarning: Call to deprecated function to_native_str. Use to_unicode instead.
  url = to_native_str(url)

2022-04-19 15:18:10 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.onion via http://127.0.0.1:8051/execute> (referer: None)
2022-04-19 15:18:10 [scrapy.core.engine] INFO: Closing spider (closespider_pagecount)
2022-04-19 15:18:10 [scrapy.core.scraper] ERROR: Spider error processing <GET http://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.onion via http://127.0.0.1:8051/execute> (referer: None)
Traceback (most recent call last):
  File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/utils/defer.py", line 132, in iter_errback
    yield next(it)
  File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/utils/python.py", line 354, in __next__
    return next(self.data)
  File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/utils/python.py", line 354, in __next__
    return next(self.data)
  File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 66, in _evaluate_iterable
    for r in iterable:
  File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy_splash/middleware.py", line 162, in process_spider_output
    for el in result:
  File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 66, in _evaluate_iterable
    for r in iterable:
  File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 66, in _evaluate_iterable
    for r in iterable:
  File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/spidermiddlewares/referer.py", line 342, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 66, in _evaluate_iterable
    for r in iterable:
  File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/spidermiddlewares/urllength.py", line 40, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 66, in _evaluate_iterable
    for r in iterable:
  File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 66, in _evaluate_iterable
    for r in iterable:
  File "/opt/ail-framework/bin/torcrawler/TorSplashCrawler.py", line 251, in parse
    crawlers.add_domain_root_item(item_id, self.domain_type, self.domains[0], self.date_epoch, self.port)
  File "/opt/ail-framework/bin/lib/crawlers.py", line 808, in add_domain_root_item
    r_serv_onion.zadd('crawler_history_{}:{}:{}'.format(domain_type, domain, port), epoch_date, root_item)
  File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/redis/commands/core.py", line 3920, in zadd
    for pair in mapping.items():
AttributeError: 'int' object has no attribute 'items'
2022-04-19 15:18:10 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2194,
 'downloader/request_count': 1,
 'downloader/request_method_count/POST': 1,
 'downloader/response_bytes': 335359,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 26.840496,
 'finish_reason': 'closespider_pagecount',
 'finish_time': datetime.datetime(2022, 4, 19, 15, 18, 10, 312840),
 'log_count/DEBUG': 1,
 'log_count/ERROR': 1,
 'log_count/INFO': 10,
 'log_count/WARNING': 1,
 'memusage/max': 84561920,
 'memusage/startup': 84561920,
 'response_received_count': 1,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'spider_exceptions/AttributeError': 1,
 'splash/execute/request_count': 1,
 'splash/execute/response_count/200': 1,
 'start_time': datetime.datetime(2022, 4, 19, 15, 17, 43, 472344)}
2022-04-19 15:18:10 [scrapy.core.engine] INFO: Spider closed (closespider_pagecount)
2022-04-19 15:18:10 [scrapy.core.engine] INFO: Error while scheduling new request
Traceback (most recent call last):
  File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/twisted/internet/task.py", line 526, in _oneWorkUnit
    result = next(self._iterator)
StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/twisted/internet/defer.py", line 857, in _runCallbacks
    current.result = callback(  # type: ignore[misc]
  File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/core/engine.py", line 187, in <lambda>
    d.addBoth(lambda _: self.slot.nextcall.schedule())
AttributeError: 'NoneType' object has no attribute 'nextcall'

Crawling regular sites (using squid)

In the Crawler_AIL screen
SPLASH Name:   web_splash
Proxy Name:    web_proxy
Crawler Type:  web
deque(['web_splash', 'regular'])
splash url: 127.0.0.1:8060


------------------START CRAWLER------------------
crawler type:     regular
-------------------------------------------------
url:         http://google.co.jp
domain:      google.co.jp
domain_url:  http://google.co.jp

Launching Crawler: http://google.co.jp
2022-04-19 15:09:24 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: scrapybot)
2022-04-19 15:09:24 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.4.0, Python 3.8.10 (default, Mar 15 2022, 12:22:08) - [GCC 9.4.0], pyOpenSSL 22.0.0 (OpenSSL 1.1.1n  15 Mar 2022), cryptography 36.0.2, Platform Linux-5.13.0-1021-aws-x86_64-with-glibc2.29
2022-04-19 15:09:24 [scrapy.crawler] INFO: Overridden settings:
{'CLOSESPIDER_PAGECOUNT': 1,
 'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; rv:78.0) Gecko/20100101 '
               'Firefox/78.0'}
2022-04-19 15:09:24 [scrapy.extensions.telnet] INFO: Telnet Password: xxxxxxxxxxxxxxx
2022-04-19 15:09:24 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.closespider.CloseSpider',
 'scrapy.extensions.logstats.LogStats']
2022-04-19 15:09:24 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy_splash.SplashDeduplicateArgsMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy_splash.SplashCookiesMiddleware',
 'scrapy_splash.SplashMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-04-19 15:09:24 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy_splash.SplashDeduplicateArgsMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-04-19 15:09:24 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-04-19 15:09:24 [scrapy.core.engine] INFO: Spider opened
2022-04-19 15:09:24 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-04-19 15:09:24 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-04-19 15:09:24 [py.warnings] WARNING: /opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy_splash/request.py:42: ScrapyDeprecationWarning: Call to deprecated function to_native_str. Use to_unicode instead.
  url = to_native_str(url)

2022-04-19 15:09:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://google.co.jp via http://127.0.0.1:8060/execute> (referer: None)
2022-04-19 15:09:36 [scrapy.core.engine] INFO: Closing spider (closespider_pagecount)
2022-04-19 15:09:36 [scrapy.core.scraper] ERROR: Spider error processing <GET http://google.co.jp via http://127.0.0.1:8060/execute> (referer: None)
Traceback (most recent call last):
  File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/utils/defer.py", line 132, in iter_errback
    yield next(it)
  File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/utils/python.py", line 354, in __next__
    return next(self.data)
  File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/utils/python.py", line 354, in __next__
    return next(self.data)
  File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 66, in _evaluate_iterable
    for r in iterable:
  File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy_splash/middleware.py", line 162, in process_spider_output
    for el in result:
  File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 66, in _evaluate_iterable
    for r in iterable:
  File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 66, in _evaluate_iterable
    for r in iterable:
  File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/spidermiddlewares/referer.py", line 342, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 66, in _evaluate_iterable
    for r in iterable:
  File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/spidermiddlewares/urllength.py", line 40, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 66, in _evaluate_iterable
    for r in iterable:
  File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 66, in _evaluate_iterable
    for r in iterable:
  File "/opt/ail-framework/bin/torcrawler/TorSplashCrawler.py", line 251, in parse
    crawlers.add_domain_root_item(item_id, self.domain_type, self.domains[0], self.date_epoch, self.port)
  File "/opt/ail-framework/bin/lib/crawlers.py", line 808, in add_domain_root_item
    r_serv_onion.zadd('crawler_history_{}:{}:{}'.format(domain_type, domain, port), epoch_date, root_item)
  File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/redis/commands/core.py", line 3920, in zadd
    for pair in mapping.items():
AttributeError: 'int' object has no attribute 'items'
2022-04-19 15:09:36 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2144,
 'downloader/request_count': 1,
 'downloader/request_method_count/POST': 1,
 'downloader/response_bytes': 3220010,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 11.858315,
 'finish_reason': 'closespider_pagecount',
 'finish_time': datetime.datetime(2022, 4, 19, 15, 9, 36, 401577),
 'log_count/DEBUG': 1,
 'log_count/ERROR': 1,
 'log_count/INFO': 10,
 'log_count/WARNING': 1,
 'memusage/max': 84520960,
 'memusage/startup': 84520960,
 'response_received_count': 1,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'spider_exceptions/AttributeError': 1,
 'splash/execute/request_count': 1,
 'splash/execute/response_count/200': 1,
 'start_time': datetime.datetime(2022, 4, 19, 15, 9, 24, 543262)}
2022-04-19 15:09:36 [scrapy.core.engine] INFO: Spider closed (closespider_pagecount)
2022-04-19 15:09:36 [scrapy.core.engine] INFO: Error while scheduling new request
Traceback (most recent call last):
  File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/twisted/internet/task.py", line 526, in _oneWorkUnit
    result = next(self._iterator)
StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/twisted/internet/defer.py", line 857, in _runCallbacks
    current.result = callback(  # type: ignore[misc]
  File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/core/engine.py", line 187, in <lambda>
    d.addBoth(lambda _: self.slot.nextcall.schedule())
AttributeError: 'NoneType' object has no attribute 'nextcall'

How can I solve the errors or at least continue the debugging?

Thanks in advance

@alexisteleco
Copy link
Author

It is solved!

The problem was a bug in scrapy introduced after 2.5.1 version, and should be solved in version 2.6.2 as mentioned here.

At the time being right now (22 days after that comment), the 2.6.2 version is not available yet via pip install scrapy==2.6.2, so the workaround until that version is available can be either download directly from the branch o downgrade to 2.5.1 version. The downgrade procedure can be seen below.

# Don't forget to activate the AIL virtual environment
cd ail-framework/AILENV/bin/ && source activate
pip install scrapy==2.5.1
#Just in case you had the previous one
pip install scrapy-splash=0.8.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant