New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ruby 3.3 and stuck workers. #1895
Comments
We've been experiencing similar issues with frozen workers, ever since upgrading to ruby 3.3.0: Versions
|
Small Update Historically we have had some workers die, where they freeze and also loose However with ruby 3.3 these workers, they never loose their heartbeat, and thus never @rexmadden, @leboshi, have you had a similar experience? Seems like there is an historical issue here that has mutated with ruby 3.3 |
We didn't get a chance to dive into a diagnosis as deeply as you did, but the symptoms are consistent with yours. Workers would run a few jobs then just hang forever. In the |
We've been having this exact issue. Since upgrading Ruby from 3.1.2 to 3.3.0 about a month ago, resque workers are going to ~0% CPU / stopping most logs, but not dying. They stopped working one by one over the course of a couple days. Bouncing the service restored them for a while, but they eventually stop working again. In every pod's logs I've checked, the last 2 real things logged are similar to the following (always ends with
Normally next we'd see something like Versions:
Update: We are no longer seeing lazy (0% CPU) workers after downgrading to Ruby 3.2.3. Obviously this is not a long-term solution though. |
Just wanted to put an update here since I have been looking into this in the background, but also I'm not running Ruby 3.3 anywhere in production yet so I haven't been able to triage at scale. I could use a little help in nailing down the exact behavior here though and any logs or other info that people can provide. Specifically, are people seeing the actual forked workers die off/exit? Or are they actually running but just no longer receiving jobs? Lastly, what version of redis server and gem are you running? |
Thanks for taking a look! I just added some more version details to my comment above. We did not see the resqueworker service pods exit. They kept running but stopped processing jobs. If I remember correctly, we did continue to see some random logging (unrelated to processing jobs). Not sure if that answers your question. |
I accidentally closed this, and reponed. @PatrickTulskie What we noticed was jobs that should take a few seconds were taking 45+ minutes and the worker simply never stopped working that job. We also saw no logs from those workers, and we have extensive logging around our jobs. The only logs we got was from the enqueuer, We often get workers similar to this, but those workers are missing a heartbeat we have an outside task that detects these and runs However in this case these workers still had active hearbeats and their state was When we unregistered these workers via getting their ID and unregistering them or Restarting or worker dynos we noticed that the failed jobs vanished, they did not end up in the failed queue as expected. |
This. The jobs will not be marked as expired heartbeats and will claim to still be working, but they'll be idle... We can also see via the logs that a particular job for that worker has finished but will be in this idle state anyway, not picking anything else up and still "working". irb (main) > Resque::Worker.all_workers_with_expired_heartbeats
[]
irb (main) > some_stalled_worker.working?
true Unfortunately we've resorted to migrating to Sidekiq for some business critical cron jobs. |
I also haven't upgraded anything significant to ruby 3.3 in production yet (I've been waiting for the 3.3.1 release, which will fix some bugs that affect my codebases). So I can't really offer anything but wild guesses at this point, but... Unfortunately, (if I remember correctly) the heartbeat runs in a separate thread in the worker process, so it's only good for knowing if the worker process itself has frozen. It can't tell you if the code in the main thread has hung indefinitely or entered an infinite loop. And (again, IIRC) the heartbeat only runs in the parent worker process, and not in the forked child process that works each job. The parent simply waits on the child with If (hypothetically) in ruby 3.2, some system calls or any other VM code might potentially hang without a timeout _while holding the global VM lock, and if ruby 3.3 improved that code to no longer hold the global VM lock but it still hangs without a timeout, then that might explain your symptoms. This is is just a wild guess of a hypothesis... but a quick Have you tried running with If you have the ability to point |
Hey is this still an issue that's being experience on ruby 3.3.1? does anyone tested in production? |
we have not tried to run 3.3.1 on any production apps. I assume it would still be an issue. might have some time to test on a staging app soon. |
We are hitting what appears to be this issue, running ruby 3.3.1. I haven't tried to instrument additional logging on the fork hooks yet, though it sounds like from OP's comment I won't see these logs anyway. We are getting workers that are registered as running job X, but have no logs from application code that we expect to see from that job, and are evidently idling doing nothing. The last logs from the hung workers are from the previous job, which we have evidence came to a successful completion without errors. In case it's worth any kind of hint, we had a similarly presenting issue previously that related to Resque workers hanging after failure, due to a combination of ddtrace and airbrake, and Airbrake's change in 6.2 to not use |
After adding logging to the various available hooks, I have a bit more info to @PatrickTulskie and @nevans questions: Context:
Specific scenario:
We can see the worker in question has been running for hours. {now: Time.zone.now.utc, state: worker.state, started: worker.started, hostname: worker.h
ostname, job: worker.job}
=>
{:now=>2024-05-16 02:18:18.731606145 UTC,
:state=>:working,
:started=>"2024-05-15 20:17:47 +0000",
:hostname=>"worker-auto-order-5688d5ccf-jph4w",
:job=>
{"queue"=>"auto-order",
"run_at"=>"2024-05-15T23:11:51Z",
"payload"=>{"class"=>"AutoOrderVpi", "args"=>[671467342, 8846, 15]}}} Logs from this worker filtered to the hook messages show the last two jobs (671467333, 671467337) forking, starting, and completing, but no logs whatsoever from the current job (671467342):
The unfiltered logs show that this
Notice that the last log line from this worker is immediately before the current reported The worker does not have an expired heartbeat Resque::Worker.all_workers_with_expired_heartbeats
=> [] And indeed is still sending active heartbeats heartbeats = Resque::Worker.all_heartbeats.select { |h| h.split(":").first == ENV["POD_NAME"] }; {now: Time.zone.now.utc, heatbeats: heartbeats }
=>
{:now=>2024-05-16 02:21:28.461478023 UTC,
:heatbeats=>{"worker-auto-order-5688d5ccf-jph4w:1:auto-order-fast,auto-order"=>"2024-05-16T02:20:48+00:00"}} ps output:
Unfortunately I can't get
That's about as much as I can get from this pod, which will likely get cycled before additional responses here. I'm not fully opposed to running a limited experiment with added capabilities on a limited set of pods to get rbspy and/or strace, though doing so is uncharted territory for me at this time. To summarize:
|
@ozydingo thanks for that. Interesting that it just goes silent even though it's sending heart beats. For those of you experiencing this issue, are you able to consistently reproduce it or is it sporadic and doesn't always happen? I'm just trying to think of how to create a test setup for tracing where we can reproduce the bug. Something like a worker that is just constantly getting jobs where it sleeps for n seconds or something until it breaks. |
Unfortunately it's inconsistent. Our system seems to go for hours without issue, then things start going south. And in the case of the worker I documented here, this is in a deployment of usually 5 worker pods processing a pretty constant churn of polling jobs (consistent with the previous job logs shown), so it doesn't seem to be related to any period of idleness. I wish I could give you more. And I'm definitely stumped as from what I can tell from the above, it seems we have hit lines 271 and 272 of |
I was able to get strace and rbspy working, and got another example to look at. Since it's a bit b uried, i'll highlight here the bit below about ps:
The parent process is "forked" instead of "processing", like the previous example. rbspy:
Line 285 appears to be the
Is there some exception bubbling up here causing strace:
Output hangs there with no closing paren and no newline. I don't know how to parse strace output so I'm just parroting it. The parent process has more verbose but maybe not so informative output. Here's a sample -- it just keeps vommitting out more if I let it run
There are some potentially more interesting snippets:
and
I'm not sure what those pids are, but as a reminder this container is only running |
rbspy for the parent process
|
Sorry I don't have time to do more than skim all of the details you've shared. But, with that caveat, I would personally experiment with 1) disabling or removing telemetry (ddtrace, airbrake, etc) on a percentage of your workers, and 2) setting Obviously you'll still need any disabled telemetry, but if it is related to DataDog or Airbrake, at least you'll have another clue about where to debug. I've debugged similar issues in the past and tracked them down to airbrake and/or newrelic_rpm (IIRC). I'm curious, for the others who had issues, what telemetry gems are you running (and what versions)? |
Those are good suggestions to experiment; unfortunately I think I've run out of time to keep this condition going in our prod application and will have to resort to reverting to ruby 3.2 for now to restore stability. fwiw:
|
We did not use airbrake or ddtrace, but did use newrelic_rpm 9.7.0 at the time the bug occurred. |
We downgraded to Ruby 3.2 last week and haven't had any issues since. So FWIW, while telemetry may be closely involved here, but this does appear to be uniquely a Resque x Ruby 3.3 issue. I'm happy to continue setting up tests using our stack in a staging / preview environment if I can assist getting to the bottom of this issue. |
Given ruby 3.4 preview was released, someone or maybe myself can give it a try and see it the issue still exists. |
we recently updated to ruby 3.3 and noticed that workers were becoming stuck processing a job. Sometimes a few workers became stuck right after a deploy, usually resque would clear these. As Resque processed jobs more and more workers would get stuck, the jobs they were processing should take seconds not hour+. Upon investigating we added logging to the before/after hooks. none of these logs were ever logged from a worker that got stuck.
Versions
It is also possible this is a resque-pool gem issue.
Downgrading back to ruby 3.2 seems to be the answer, we have not noticed anything stuck since
The text was updated successfully, but these errors were encountered: