Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug? splash 3.0+ instances locking up on certain SSL requests. Does not happen on 2.3.3 #1164

Open
minispeck opened this issue Oct 27, 2022 · 23 comments

Comments

@minispeck
Copy link

minispeck commented Oct 27, 2022

My issue happens on splash 3.0 and 3.5 but NOT on 2.3.3. i am currently running prod on 2.3.3 as a workaround and would like a permanent solution to run 3.x

i have been running splash + HAProxy set up by aquarium for years before experiencing this issue, including successfully rendering the sites in question without issue prior to the day before yesterday

here is a url that consistently produces the issue, even simply using render.html from [host]:8050
https://www.schooljobs.com/careers/kirkwoodcc/jobs/3776251/adjunct-dental-hygiene

happens with aquarium default configuration

this happens in both dev (mac OS 15+) and prod (ubuntu) environments, and i did try wiping all my containers and starting over with aquarium. splash works fine for other urls but the above and some others kills it. every time, it locks up the entire docker container (immediately) and the HAPROXY stats shows a level 7 timeout (splash 3.5) or Level 4 timeout (3.0).

image

image

i cannot attach to a splash docker instance that hangs in this way - if i try, my terminal hangs.

thanks to docker-compose with aquarium i can watch splash output live. on 3.5 i often don't even get to see output of the request starting. sometimes i just see the request and then no more output as the instance hangs

image

on 3.0 only i get the following info

image

i have googled the network issue and found a bunch of issues right here in this repo with no clear answers about what is going on.

happy to be very responsive. please let me know if more info is needed. I want to get back to splash 3.x

@minispeck minispeck changed the title scrapy-splash via aquarium, splash instances locking up on request Bug? splash 3.0+ instances locking up on certain SSL requests. Does not happen on 2.3.3 Oct 27, 2022
@rodrigosfelix
Copy link

Same problem

@Gallaecio
Copy link
Member

Since you say the issue started happening recently, without Splash itself changing, and assuming it is not something that has changed on the target websites, it means something other than Splash itself changed on your end. I assume some newer version of a dependency is at fault here.

My best guess would be Twisted, as Splash 2.3.3 caps it at 16.3.0, while 3.0+ do not cap it, and there have been recent releases. It would be great if someone could try if freezing Twisted at 16.3.0 works. If it does, we could then find the specific version where the issue starts happening, and that would help identify the issue. I would not discard that the problem is not Twisted itself, but some indirect dependency that Splash gets through its dependency on Twisted.

@minispeck
Copy link
Author

minispeck commented Nov 10, 2022

@Gallaecio i'll give it a try today and report back

edit: day got away from me, shooting for monday

@minispeck
Copy link
Author

minispeck commented Nov 14, 2022

@Gallaecio forcing twisted to 16.3.0 in a splash 3.5 docker container did not resolve the issue. the symptoms are the same.

for clarity in case i did something wrong, i did

docker exec -ti container_name /bin/bash

and once connected, ran

pip install twisted==16.3.0

afterword i ran pip freeze and confirmed the twisted version was indeed 16.3.0

then i ran my scraper that is known to cause the issue and observed the same symptoms

@Gallaecio
Copy link
Member

Did running pip install twisted==16.3.0 output any warning about existing dependencies being incompatible?

@minispeck
Copy link
Author

minispeck commented Nov 15, 2022

@Gallaecio one more piece of context, for these tests on my dev environment i'm running one splash 3.5 instance on twisted 16.3.0 and two on default (twisted 19 something)

although i did get the compatibility warning, the instance using twisted 16.3.0 works fine with sites that don't cause this issue, and exhibits the exact same failure behavior with the site that does cause the issue.

edit: i noticed my (working) splash 2.3.3 on prod is actually running twisted 16.1.1 - so i tried that version with splash 3.5 and observed the same issue. so i do not think the twisted version is the problem

@Gallaecio
Copy link
Member

i did get the compatibility warning

Which packages was it about? It is possible the issue is not Twisted, but an indirect dependency.

If the issue is neither Twisted nor an indirect dependency, and it is actually an upstream change that is incompatible with newer Splash (i.e. with the WebKit version upgrade Splash 3.0 got), fixing the issue may be rather hard, and unlikely to be done any time soon, if ever.

@minispeck
Copy link
Author

@Gallaecio the only warning was about splash incompatibility
image

@Gallaecio
Copy link
Member

Then I don’t think Twisted is the issue :(

@minispeck
Copy link
Author

@Gallaecio are there any more verbose logs i can produce for splash somehow, or from some directory? there is a splash verbosity setting that defaults to 1 during aquarium setup. I will try messing with that along with anything else you suggest

@Gallaecio
Copy link
Member

I am not familiar enough with Splash to help much further.

and assuming it is not something that has changed on the target websites

I might have been wrong here, given dependencies are not an issue. Maybe those websites somehow stopped working with the version of WebKit that Splash 3.x uses.

@minispeck
Copy link
Author

I might have been wrong here, given dependencies are not an issue. Maybe those websites somehow stopped working with the version of WebKit that Splash 3.x uses.

this might be true, but splash silently locking up and dying is not good behavior in this case

@minispeck
Copy link
Author

bump. any ideas, anyone?

@gtsupport-com
Copy link

Recaptcha introduced code that breaks Splash 3.X in October, confirmed with 3.2 and 3.5. For simply reading a site, adding an on_request() hook at the beginning of your script that blocks any attempts to access a URL that contains "recaptcha/releases" will prevent it from locking up.

I'm not aware of any workarounds or any root-cause information as to what that Javascript is doing that is breaking Splash.

@minispeck
Copy link
Author

@gtsupport-com thank you for the answer - and my apologies, i'm using the built in splash render.html - are you talking about the lua script? I never did learn lua, could you spell this out for me?

thanks

@gtsupport-com
Copy link

@minispeck
The methods I've used involved this: splash-on-request

All of my experience has been via /execute and lua scripts thus I'm not familiar with the options for the built in renderers. My first guess would be to place your own proxy in front of your splash instance and block it via that proxy. I don't see an option in the splash documentation to auto-blacklist certain urls; if you're dependent on render.html I don't have an easy answer for you.

@minispeck
Copy link
Author

minispeck commented Jan 26, 2023

@gtsupport-com oh sorry i meant, i'm happy to move to execute endpoint, just 0 lua knowledge, so assuming i start with a copy of the default script, could you toss me some sample code for on_request to kill those requests?

@gtsupport-com
Copy link

This will grab that page - delete the "args.url= ..." line if you are passing the URL in externally.
Last line returns both a PNG and HTML, replace with "return splash:html()" if you only need the HTML back for data extraction.

There are a large number of examples on the Splash documentation site, it would be worth your while to dig into the tutorial so you can troubleshoot/tweak if necessary.

function main(splash, args)
  args.url = [[https://www.schooljobs.com/careers/kirkwoodcc/jobs/3776251/adjunct-dental-hygiene]]
  splash:on_request(function(request)
    if string.find(request.url, "recaptcha/releases", 1, true) ~= nil then
        request.abort()
    end
  end)
  splash:go{args.url}
  splash:wait(2)
  return {png=splash:png(), html=splash:html()}
end

@gtsupport-com
Copy link

Note that it was also identified by @benreece in #1167 that not only Recaptcha but certain WP plugins cause this issue

@alosultan
Copy link

@minispeck You should set the engine parameter to chromium instead of webkit (the default engine). In this case, Recaptcha will not disrupt Splash 3.X. However, it's important to note that the Splash documentation warns that the chromium engine is currently in the pre-alpha stage and could potentially lead to crashes in Splash.

Another issue arises from the fact that the webkit engine does not pass the check for whether JavaScript is enabled or not, which poses a problem for us even with basic websites that perform this verification.

Please take into consideration: @kmike | @immerrr | @Gallaecio

@alosultan
Copy link

@minispeck If you insist on using the WebKit engine (it's lightweight and fast, but QtWebKit is awaiting updates - here I want to thank @annulen for his great efforts: большое Вам спасибо), you'll need to utilize the filters parameter, as recommended by @gtsupport-com, as a temporary solution.

@annulen
Copy link

annulen commented Aug 5, 2023

FYI, you can get updated version of QtWebKit maintained by @mnutt at https://github.com/movableink/webkit/ — it's very close to WebKit's bleeding edge and should have much better compatibility with modern web content (though it's not polished at the moment and can have quite a few rough edges).

@alosultan
Copy link

This is great & worth a try.
@annulen @mnutt Thank you for your great efforts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants