Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to implement Scrapy Splash in Virtual Machine #301

Open
lime-n opened this issue Feb 16, 2022 · 1 comment
Open

How to implement Scrapy Splash in Virtual Machine #301

lime-n opened this issue Feb 16, 2022 · 1 comment

Comments

@lime-n
Copy link

lime-n commented Feb 16, 2022

How do I run scrapy splash on a virtual machine with linux? Essentially, I have a lua script that requires me to send keys onto a site to log in and then scrape it.

I have installed docker however I cannot seem to get the scraper to work as it won't connect to the server.

Are there any simple steps that I can follow to get this to work on a VM? Like what should I install, and what should I do next before running scrapy crawl spider.

As for docker, I have implemented the following whilst in admin mode:

docker run -p 8050:8050 scrapinghub/splash --max-timeout 3600

However this is currently running and I'd like it to run in on the background. I cannot seem to figure this out; I have tried:

docker run -d 8050:8050 scrapinghub/splash --max-timeout 3600

But I just get the error:

Unable to find image '8050:8050' locally

I believe this may solve my issue or perhaps not and I need some further installations. Please let me know! I really need expert guidance to figure this out.

I have opened another instance whilst docker was running on the first instance.

I get the following error when running the scrapy crawler:

2022-02-16 02:55:26 [scrapy_splash.middleware] WARNING: Bad request to Splash: {'error': 400, 'type': 'ScriptError', 'description': 'Error happened while executing Lua script', 'info': 
{'type': 'JS_ERROR', 'js_error_type': 'TypeError', 'js_error_message': 'null is not an object (evaluating \'document.querySelector("button:nth-child(2)").getClientRects\')', 'js_error':
 'TypeError: null is not an object (evaluating \'document.querySelector("button:nth-child(2)").getClientRects\')', 'message': '[string "..."]:12: error during JS function call: \'TypeEr
ror: null is not an object (evaluating \\\'document.querySelector("button:nth-child(2)").getClientRects\\\')\'', 'source': '[string "..."]', 'line_number': 12, 'error': 'error during JS
 function call: \'TypeError: null is not an object (evaluating \\\'document.querySelector("button:nth-child(2)").getClientRects\\\')\''}}
2022-02-16 02:55:26 [scrapy.core.engine] DEBUG: Crawled (400) <GET http://instagram.com/ via http://localhost:8050/execute> (referer: None)
2022-02-16 02:55:26 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 http://instagram.com/>: HTTP status code is not handled or not allowed

The scraper works perfectly fine on my mac so there's definitely an installation that I am missing somewhere.

@Ehsan-U
Copy link

Ehsan-U commented Mar 9, 2023

To run the instance in the background, you can use the -d flag, which stands for detached mode. Here's the updated command

docker run -d -p 8050:8050 scrapinghub/splash --max-timeout 3600

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants