Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When crawling, all domains appear to be DOWN #490

Open
sunil3590 opened this issue Apr 10, 2020 · 5 comments
Open

When crawling, all domains appear to be DOWN #490

sunil3590 opened this issue Apr 10, 2020 · 5 comments

Comments

@sunil3590
Copy link
Contributor

sunil3590 commented Apr 10, 2020

ISSUE
I tried to crawl a regular domain (not .onion) and the status fo the domain comes up as DOWN. I've tried this will multiple domains and even .onion domains but the result is the same, all domains are DOWN.

SETUP
I have AIL, TOR, and Splash all installed and running on a single machine with one docker instance of Splash running on 8050 and Tor running on 9050

tcp        0      0 127.0.0.1:9050          0.0.0.0:*               LISTEN      18298/tor           
tcp6       0      0 :::8050                 :::*                    LISTEN      22611/docker-proxy 

Logs from Splash Docker

2020-04-10 08:56:20.300419 [-] "X.X.X.X" - - [10/Apr/2020:08:56:19 +0000] "GET / HTTP/1.1" 200 7679 "-" "python-requests/2.22.0"
2020-04-10 08:56:20.859058 [render] [140342956635136] loadFinished: unknown error
2020-04-10 08:56:20.860248 [events] {"path": "/execute", "rendertime": 0.007615327835083008, "maxrss": 176844, "load": [0.05, 0.19, 0.18], "fds": 60, "active": 0, "qsize": 0, "_id": 140342956635136, "method": "POST", "timestamp": 1586508980, "user-agent": "Mozilla/5.0 (Windows NT 6.1; rv:60.0) Gecko/20100101 Firefox/60.0", "args": {"cookies": [], "headers": {"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "en", "User-Agent": "Mozilla/5.0 (Windows NT 6.1; rv:60.0) Gecko/20100101 Firefox/60.0"}, "lua_source": "\nfunction main(splash, args)\n    -- Default values\n    splash.js_enabled = true\n    splash.private_mode_enabled = true\n    splash.images_enabled = true\n    splash.webgl_enabled = true\n    splash.media_source_enabled = true\n\n    -- Force enable things\n    splash.plugins_enabled = true\n    splash.request_body_enabled = true\n    splash.response_body_enabled = true\n\n    splash.indexeddb_enabled = true\n    splash.html5_media_enabled = true\n    splash.http2_enabled = true\n\n    -- User defined\n    splash.resource_timeout = args.resource_timeout\n    splash.timeout = args.timeout\n\n    -- Allow to pass cookies\n    splash:init_cookies(args.cookies)\n\n    -- Run\n    ok, reason = splash:go{args.url}\n    if not ok and not reason:find(\"http\") then\n        return {\n            error = reason,\n            last_url = splash:url()\n        }\n    end\n    if reason == \"http504\" then\n        splash:set_result_status_code(504)\n        return ''\n    end\n\n    splash:wait{args.wait}\n    -- Page instrumentation\n    -- splash.scroll_position = {y=1000}\n    splash:wait{args.wait}\n    -- Response\n    return {\n        har = splash:har(),\n        html = splash:html(),\n        png = splash:png{render_all=true},\n        cookies = splash:get_cookies(),\n        last_url = splash:url()\n    }\nend\n", "resource_timeout": 30, "timeout": 30, "url": "http://somedomain.onion", "wait": 10, "uid": 140342956635136}, "status_code": 200, "client_ip": "172.17.0.1"}
2020-04-10 08:56:20.860431 [-] "172.17.0.1" - - [10/Apr/2020:08:56:19 +0000] "POST /execute HTTP/1.1" 200 68 "-" "Mozilla/5.0 (Windows NT 6.1; rv:60.0) Gecko/20100101 Firefox/60.0"

The line of code in Splash generating the error message above
https://github.com/scrapinghub/splash/blob/9fda128b8485dd5f67eb103cd30df8f325a90bb0/splash/engines/webkit/browser_tab.py#L446

@GaganBhat
Copy link

Were you able to fix this? @sunil3590
Experiencing the same issue, Splash Down and all domains are down.

@GaganBhat
Copy link

@Terrtia
I'm having a similar issue with Tor links where I get a "SPLASH DOWN" error but only with onion links.
image

Regular crawler however works.
image

@TheFausap
Copy link

Hello I have the same issue. Is there any update? thanks

@TheFausap
Copy link

Maybe I found the error in the screen logs (screen -r Crawlers_AIL):

 File "/opt/AIL/bin/torcrawler/TorSplashCrawler.py", line 181, in parse
    error_retry = request.meta.get('error_retry', 0)
NameError: name 'request' is not defined

@matriceria
Copy link

@TheFausap @Terrtia did you find the fix for this? i also cant crawl any onion domain since they appear to be down

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants