500 Internal Server Error with scrapy/splash/scrapoxy #222

devitdc · 2024-02-14T22:54:11Z

Current Behavior

Hi, i use scrapy (2.8.0), scrapoxy (with docker image fabienvauchelles/scrapoxy:latest) and splash (3.5) to scrape data but i got a 500 Internal Server Error when splash is running. To illustrate the error I use the website https://quotes.toscrape.com/login

Scrapy is running on macos on host 192.168.0.12.
Scrapoxy is running with docker image on debian 11.9 on host 192.168.0.103.
Splash is running with docker image on debian 11.9 on host 192.168.0.102.

Scrapy settings.py configuration :

# Scrapoxy setup
CONCURRENT_REQUESTS_PER_DOMAIN = 1
RETRY_TIMES = 0

SCRAPOXY_MASTER = "http://192.168.0.103:8888"
SCRAPOXY_API = "http://192.168.0.103:8890/api"
SCRAPOXY_USERNAME = "username"
SCRAPOXY_PASSWORD = "password"

SCRAPOXY_BLACKLIST_HTTP_STATUS_CODES = [400, 429, 503]
SCRAPOXY_SLEEP_MIN = 60
SCRAPOXY_SLEEP_MAX = 180
# End Scrapoxy setup

# Splash setup
SPLASH_URL = 'http://192.168.0.102:8050'
# End Splash setup

USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"

ROBOTSTXT_OBEY = False

SPIDER_MIDDLEWARES = {
    "scrapoxy.StickySpiderMiddleware": 101,
}

DOWNLOADER_MIDDLEWARES = {
    # scrapoxy middleware
    'scrapoxy.ProxyDownloaderMiddleware': 100,
    'scrapoxy.BlacklistDownloaderMiddleware': 101,
    ###################
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 300,
    ###################
    # splash middleware
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
    ###################
}

REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"

Scrapy spider :

import scrapy
from scrapy_splash import SplashRequest

class SplashloginquotesSpider(scrapy.Spider):
    name = "splashLoginQuotes"
    allowed_domains = ["quotes.toscrape.com"]
    start_url = "http://quotes.toscrape.com/login"
    lua_code = '''
    function main(splash, args)
        splash:init_cookies(splash.args.cookies)
        assert(splash:go(args.url))
        assert(splash:wait(2))
        assert(splash:set_viewport_full())
        
        form = splash:select('form[action="/login"]')
        token = splash:select('input[name="csrf_token"]').value
        values = {
            csrf_token = token,
            username = 'demo',
            password = 'demo'
        }
        assert(form:fill(values))
        assert(form:submit())
        assert(splash:wait(2))
        
        return {
            html = splash:html(),
            png = splash:png(),
            har = splash:har(),
            cookies = splash:get_cookies(),
        }
    end
    '''

    def start_requests(self):
        yield SplashRequest (
            url=self.start_url,
            callback = self.parse,
            endpoint="execute",
            args = {
                'width': 1000,
                'lua_source': self.lua_code,
                'url': self.start_url
            }
        )
    
    def parse(self, response):
        quotes = response.xpath("//div[@class='quote']") 

        for quote in quotes:
            quote_text = quote.xpath(".//span[@class='text']/text()").get()
            yield {
                'quote': quote_text
            }

Expected Behavior

Everything works with scrapy and scrapoxy.
Everything works with scrapy and splash.

But the aim is to be able to use scrapy, scrapoxy and splash in the same scrapy project.

Steps to Reproduce

I use OVH Public Cloud with 6 proxies.

Failure Logs

Scrapoxy log :
ERROR [MasterService] request_error: socket hang up from proxy 133cbcd6-f593-4853-8469-14525945484c:5283b824-e59c-4bb0-b701-c4b291dad8ae (POST http://192.168.0.102:8050/execute)

Scrapy log :
2024-02-14 23:12:33 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET http://quotes.toscrape.com/login via http://192.168.0.102:8050/execute> (failed 1 times): 500 Internal Server Error
2024-02-14 23:12:33 [scrapy.core.engine] DEBUG: Crawled (500) <GET http://quotes.toscrape.com/login via http://192.168.0.102:8050/execute> (referer: None)
2024-02-14 23:12:33 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <500 http://192.168.0.102:8050/execute>: HTTP status code is not handled or not allowed

Splash log :
2024-02-14 19:23:46.858428 [-] "192.168.0.12" - - [14/Feb/2024:19:23:45 +0000] "POST /execute HTTP/1.1" 400 311 "http://192.168.0.102:8050/info?wait=0.5&images=1&expand=1&timeout=90.0&url=http%3A%2F%2Fquotes.toscrape.com%2Flogin&lua_source=function+main%28splash%2C+args%29%0D%0A++assert%28splash%3Ago%28args.url%29%29%0D%0A++assert%28splash%3Await%281%29%29%0D%0A++assert%28splash%3Aset_viewport_full%28%29%29%0D%0A++%0D%0A++splash%3Aon_request%28function%28request%29%0D%0A++++++request%3Aset_proxy%7B%0D%0A++++++++host+%3D+%22192.168.0.102%22%2C%0D%0A++++++++port+%3D+8888%2C%0D%0A++++++++username+%3D+%27mik4rrlmtlfz8o4q0wk6y%27%2C%0D%0A++++++++password+%3D+%27xa49grzmg4qewtbknxxs17%27%2C%0D%0A++++++++type+%3D+%27http%27%0D%0A++++++%7D%0D%0A++end%29%0D%0A++%0D%0A++--+On+r%C3%A9cup%C3%A8re+le+formulaire+--%0D%0A++form+%3D+splash%3Aselect%28%27form%5Baction%3D%22%2Flogin%22%5D%27%29%0D%0A++--+On+r%C3%A9cup%C3%A8re+la+valeur+du+token+csrf+--%0D%0A++token+%3D+splash%3Aselect%28%27input%5Bname%3D%22csrf_token%22%5D%27%29.value%0D%0A++--+On+d%C3%A9finit+les+%C3%A9l%C3%A9ments+%C3%A0+soumettre+au+formulaire+--%0D%0A++values+%3D+%7B%0D%0A++++csrf_token+%3D+token%2C%0D%0A++++username+%3D+%27demo%27%2C%0D%0A++++password+%3D+%27demo%27%0D%0A++%7D%0D%0A++--+On+remplit+le+formulaire+avec+les+donn%C3%A9es+--%0D%0A++assert%28form%3Afill%28values%29%29%0D%0A++--+On+envoie+le+formulaire+au+serveur+pour+se+connecter+--%0D%0A++assert%28form%3Asubmit%28%29%29%0D%0A%0D%0A++assert%28splash%3Await%282%29%29%0D%0A++%0D%0A++return+%7B%0D%0A++++html+%3D+splash%3Ahtml%28%29%2C%0D%0A++++png+%3D+splash%3Apng%28%29%2C%0D%0A++++har+%3D+splash%3Ahar%28%29%0D%0A++%7D%0D%0Aend" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"

Scrapoxy Version

docker version

Custom Version

No
Yes

Deployment

Operating System

Linux
Windows
macOS
Other (Specify in Additional Information)

Storage

File (default)
MongoDB & RabbitMQ
Other (Specify in Additional Information)

Additional Information

No response

fabienvauchelles · 2024-02-15T08:45:18Z

Ok thanks. I will try to reproduce.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

500 Internal Server Error with scrapy/splash/scrapoxy #222

500 Internal Server Error with scrapy/splash/scrapoxy #222

devitdc commented Feb 14, 2024

fabienvauchelles commented Feb 15, 2024

500 Internal Server Error with scrapy/splash/scrapoxy #222

500 Internal Server Error with scrapy/splash/scrapoxy #222

Comments

devitdc commented Feb 14, 2024

Current Behavior

Expected Behavior

Steps to Reproduce

Failure Logs

Scrapoxy Version

Custom Version

Deployment

Operating System

Storage

Additional Information

fabienvauchelles commented Feb 15, 2024