Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

500 Internal Server Error with scrapy/splash/scrapoxy #222

Open
4 of 14 tasks
devitdc opened this issue Feb 14, 2024 · 1 comment
Open
4 of 14 tasks

500 Internal Server Error with scrapy/splash/scrapoxy #222

devitdc opened this issue Feb 14, 2024 · 1 comment

Comments

@devitdc
Copy link

devitdc commented Feb 14, 2024

Current Behavior

Hi, i use scrapy (2.8.0), scrapoxy (with docker image fabienvauchelles/scrapoxy:latest) and splash (3.5) to scrape data but i got a 500 Internal Server Error when splash is running. To illustrate the error I use the website https://quotes.toscrape.com/login

Scrapy is running on macos on host 192.168.0.12.
Scrapoxy is running with docker image on debian 11.9 on host 192.168.0.103.
Splash is running with docker image on debian 11.9 on host 192.168.0.102.

Scrapy settings.py configuration :

# Scrapoxy setup
CONCURRENT_REQUESTS_PER_DOMAIN = 1
RETRY_TIMES = 0

SCRAPOXY_MASTER = "http://192.168.0.103:8888"
SCRAPOXY_API = "http://192.168.0.103:8890/api"
SCRAPOXY_USERNAME = "username"
SCRAPOXY_PASSWORD = "password"

SCRAPOXY_BLACKLIST_HTTP_STATUS_CODES = [400, 429, 503]
SCRAPOXY_SLEEP_MIN = 60
SCRAPOXY_SLEEP_MAX = 180
# End Scrapoxy setup

# Splash setup
SPLASH_URL = 'http://192.168.0.102:8050'
# End Splash setup

USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"

ROBOTSTXT_OBEY = False

SPIDER_MIDDLEWARES = {
    "scrapoxy.StickySpiderMiddleware": 101,
}

DOWNLOADER_MIDDLEWARES = {
    # scrapoxy middleware
    'scrapoxy.ProxyDownloaderMiddleware': 100,
    'scrapoxy.BlacklistDownloaderMiddleware': 101,
    ###################
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 300,
    ###################
    # splash middleware
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
    ###################
}

REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"

Scrapy spider :

import scrapy
from scrapy_splash import SplashRequest

class SplashloginquotesSpider(scrapy.Spider):
    name = "splashLoginQuotes"
    allowed_domains = ["quotes.toscrape.com"]
    start_url = "http://quotes.toscrape.com/login"
    lua_code = '''
    function main(splash, args)
        splash:init_cookies(splash.args.cookies)
        assert(splash:go(args.url))
        assert(splash:wait(2))
        assert(splash:set_viewport_full())
        
        form = splash:select('form[action="/login"]')
        token = splash:select('input[name="csrf_token"]').value
        values = {
            csrf_token = token,
            username = 'demo',
            password = 'demo'
        }
        assert(form:fill(values))
        assert(form:submit())
        assert(splash:wait(2))
        
        return {
            html = splash:html(),
            png = splash:png(),
            har = splash:har(),
            cookies = splash:get_cookies(),
        }
    end
    '''

    def start_requests(self):
        yield SplashRequest (
            url=self.start_url,
            callback = self.parse,
            endpoint="execute",
            args = {
                'width': 1000,
                'lua_source': self.lua_code,
                'url': self.start_url
            }
        )
    
    def parse(self, response):
        quotes = response.xpath("//div[@class='quote']") 

        for quote in quotes:
            quote_text = quote.xpath(".//span[@class='text']/text()").get()
            yield {
                'quote': quote_text
            }

Expected Behavior

Everything works with scrapy and scrapoxy.
Everything works with scrapy and splash.

But the aim is to be able to use scrapy, scrapoxy and splash in the same scrapy project.

Steps to Reproduce

I use OVH Public Cloud with 6 proxies.

Failure Logs

Scrapoxy log :
ERROR [MasterService] request_error: socket hang up from proxy 133cbcd6-f593-4853-8469-14525945484c:5283b824-e59c-4bb0-b701-c4b291dad8ae (POST http://192.168.0.102:8050/execute)

Scrapy log :
2024-02-14 23:12:33 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET http://quotes.toscrape.com/login via http://192.168.0.102:8050/execute> (failed 1 times): 500 Internal Server Error
2024-02-14 23:12:33 [scrapy.core.engine] DEBUG: Crawled (500) <GET http://quotes.toscrape.com/login via http://192.168.0.102:8050/execute> (referer: None)
2024-02-14 23:12:33 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <500 http://192.168.0.102:8050/execute>: HTTP status code is not handled or not allowed

Splash log :
2024-02-14 19:23:46.858428 [-] "192.168.0.12" - - [14/Feb/2024:19:23:45 +0000] "POST /execute HTTP/1.1" 400 311 "http://192.168.0.102:8050/info?wait=0.5&images=1&expand=1&timeout=90.0&url=http%3A%2F%2Fquotes.toscrape.com%2Flogin&lua_source=function+main%28splash%2C+args%29%0D%0A++assert%28splash%3Ago%28args.url%29%29%0D%0A++assert%28splash%3Await%281%29%29%0D%0A++assert%28splash%3Aset_viewport_full%28%29%29%0D%0A++%0D%0A++splash%3Aon_request%28function%28request%29%0D%0A++++++request%3Aset_proxy%7B%0D%0A++++++++host+%3D+%22192.168.0.102%22%2C%0D%0A++++++++port+%3D+8888%2C%0D%0A++++++++username+%3D+%27mik4rrlmtlfz8o4q0wk6y%27%2C%0D%0A++++++++password+%3D+%27xa49grzmg4qewtbknxxs17%27%2C%0D%0A++++++++type+%3D+%27http%27%0D%0A++++++%7D%0D%0A++end%29%0D%0A++%0D%0A++--+On+r%C3%A9cup%C3%A8re+le+formulaire+--%0D%0A++form+%3D+splash%3Aselect%28%27form%5Baction%3D%22%2Flogin%22%5D%27%29%0D%0A++--+On+r%C3%A9cup%C3%A8re+la+valeur+du+token+csrf+--%0D%0A++token+%3D+splash%3Aselect%28%27input%5Bname%3D%22csrf_token%22%5D%27%29.value%0D%0A++--+On+d%C3%A9finit+les+%C3%A9l%C3%A9ments+%C3%A0+soumettre+au+formulaire+--%0D%0A++values+%3D+%7B%0D%0A++++csrf_token+%3D+token%2C%0D%0A++++username+%3D+%27demo%27%2C%0D%0A++++password+%3D+%27demo%27%0D%0A++%7D%0D%0A++--+On+remplit+le+formulaire+avec+les+donn%C3%A9es+--%0D%0A++assert%28form%3Afill%28values%29%29%0D%0A++--+On+envoie+le+formulaire+au+serveur+pour+se+connecter+--%0D%0A++assert%28form%3Asubmit%28%29%29%0D%0A%0D%0A++assert%28splash%3Await%282%29%29%0D%0A++%0D%0A++return+%7B%0D%0A++++html+%3D+splash%3Ahtml%28%29%2C%0D%0A++++png+%3D+splash%3Apng%28%29%2C%0D%0A++++har+%3D+splash%3Ahar%28%29%0D%0A++%7D%0D%0Aend" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"

Scrapoxy Version

docker version

Custom Version

  • No
  • Yes

Deployment

  • Docker
  • Docker Compose
  • Kubernetes
  • NPM
  • Other (Specify in Additional Information)

Operating System

  • Linux
  • Windows
  • macOS
  • Other (Specify in Additional Information)

Storage

  • File (default)
  • MongoDB & RabbitMQ
  • Other (Specify in Additional Information)

Additional Information

No response

@fabienvauchelles
Copy link
Owner

Ok thanks. I will try to reproduce.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants