Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chromium engine not working with lua #1186

Open
kolumbyt opened this issue Dec 10, 2023 · 0 comments
Open

Chromium engine not working with lua #1186

kolumbyt opened this issue Dec 10, 2023 · 0 comments

Comments

@kolumbyt
Copy link

kolumbyt commented Dec 10, 2023

Hello,
I'm trying to make scraping bot for a site that uses javascript. I have about 20 urls from the site and would like to scale to houndreds, I need the urls to be scraped quite often, so I tried using lua script do make "dynamic" waiting times. When I use the default webkit engine, the html output of the site is just text that says that the site doesn't support this browser, that's why I'm using chromium engine. Without the lua script the scraping gave output items only with chromium engine, but it did work. After I tried it with lua I got errors with chromium engine, and with webkit it executed without errors, but didn't give any output items, because as I said the site doesn't support it. This is the start request I'm using with the lua:

#Start_request
def start_requests(self):
    lua_script = """
    function main(splash, args)
        splash:set_user_agent(args.user_agent)
        assert(splash:go(args.url))
        local try_count = 0
        local max_tries = 10
        while try_count < max_tries do
            splash:wait(1)
            local match_rows = splash:select_all('.o-matchRow')
            if #match_rows > 0 then
                break
            end
            try_count = try_count + 1
        end
        return {html = splash:html()}
    end
    """

    # Chrome user agent
    user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'

    for url in self.start_urls:
        yield SplashRequest(
            url=url,
            callback=self.parse,
            endpoint='execute',
            args={
                'lua_source': lua_script,
                'user_agent': user_agent,
                'engine': 'chromium'
            }
        )

It's something simple I wanted to test out. Does anyone know what is the deal with lua and chromium engine, or how can I use webkit when the site doesn't support it? (Btw sorry for my English, I'm not a native speaker) These are the errors with chromium engine:

2023-12-10 09:49:45 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.tipsport.cz/kurzy/fotbal-16?limit=1000 via http://localhost:8050/execute> Traceback (most recent call last): File "C:\Users\Kryštof\AppData\Local\Programs\Python\Python311\Lib\site-packages\twisted\internet\defer.py", line 1697, in _inlineCallbacks result = context.run(gen.send, result) File "C:\Users\Kryštof\AppData\Local\Programs\Python\Python311\Lib\site-packages\scrapy\core\downloader\middleware.py", line 68, in process_response method(request=request, response=response, spider=spider) File "C:\Users\Kryštof\AppData\Local\Programs\Python\Python311\Lib\site-packages\scrapy_splash\middleware.py", line 412, in process_response response = self._change_response_class(request, response) File "C:\Users\Kryštof\AppData\Local\Programs\Python\Python311\Lib\site-packages\scrapy_splash\middleware.py", line 433, in _change_response_class response = response.replace(cls=respcls, request=request) File "C:\Users\Kryštof\AppData\Local\Programs\Python\Python311\Lib\site-packages\scrapy\http\response\__init__.py", line 125, in replace return cls(*args, **kwargs) File "C:\Users\Kryštof\AppData\Local\Programs\Python\Python311\Lib\site-packages\scrapy_splash\response.py", line 120, in __init__ self._load_from_json() File "C:\Users\Kryštof\AppData\Local\Programs\Python\Python311\Lib\site-packages\scrapy_splash\response.py", line 174, in _load_from_json error = self.data['info']['error'] TypeError: string indices must be integers, not 'str' 2023-12-10 09:49:45 [scrapy.core.engine] INFO: Closing spider (finished)

I've been trying to set it up correctly for the past few days, but I'm not really getting anywhere. It seems I should build a custom image for splash, so I did, and it doesn't really work. The element I'm checking for is in there, it worked without the lua script before. User agent didn't do anything either, it seems that I need to have the chromium engine. And the data should be handled correctly, because it worked before with working item output. What should I try next? The issue should be just with lua not working with chromium engine. Or are there other options to make the "dynamic" waits? Or can I use webkit on a site that doesn't support it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant