Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Periodic High CPU with RQ 1.16.1 (not in RQ 1.12.0) #2078

Open
jaredbriskman opened this issue Apr 23, 2024 · 1 comment
Open

Periodic High CPU with RQ 1.16.1 (not in RQ 1.12.0) #2078

jaredbriskman opened this issue Apr 23, 2024 · 1 comment

Comments

@jaredbriskman
Copy link

Hello RQ folks,

We recently upgraded from using RQ 1.12.0 to RQ 1.16.1. Upon deploying this upgrade to our production environment, we noticed RQ having periodic very high CPU events, starting about (but not exactly) once every 6 hours, lasting for ~30 minutes each, and gradually decreasing in frequency to once every 48-72 hours after ~2 weeks.

When redeploying the RQ container, the periodicity seemed to reset, with CPU spikes immediately starting again once every 6ish hours, and slowly decreasing, leading us to suspect this behavior is tied to the initial start time of our RQ workers.

After rolling back to RQ 1.12.0 (With no other related code changes or rollbacks), the problem disappears entirely. This leads us to suspect the issue is somehow related to changes to RQ's internals instead of our code (or at least how those changes are interacting with our scenario.)

Unfortunately, I don't have a good way to reproduce the behavior besides our production environment, as it seems related to the somewhat high throughput of RQ jobs. Our staging environment with identical setup but much lower job ingress does not exhibit this behavior. I looked through the release notes, open issues and closed PRs, but nothing particularly stood out as a possible culprit.

When looking into profiling snapshots, it seems like during the ~30 minutes of high CPU usage, our RQ workers throughput slows (as they idle more, fighting for CPU time with whatever is taking up CPU), causing a backup of queued jobs, which they then successfully burn through after the mysterious CPU spike finishes. As far as we can tell, no jobs are being failed, and there's no changes in job influx during these spikes.

I realize it's a long shot, but does this behavior ring any bells for you in what might be causing it in between RQ 1.13.0-RQ1.16.1? (Or do you have any other suggestions for things to investigate?)

More brief details on our environment:
RQ 1.12.0 / 1.16.1 running in Docker, managed via supervisord as the docker entrypoint per https://python-rq.org/patterns/supervisor/
Redis version: 7.0.15
Python 3.10.5
We are also using flask-RQ2@18.3 and flask-scheduler@0.13.1
We are running 10 worker processes (via flask rq worker <queues>) and 1 scheduler process (via flask rq scheduler) in the same container.
A fairly constant load of ~25 jobs/second on average in production. CPU usage is normally fairly consistent at 30% normally, rising to ~90% during these CPU events.

Please let me know if there's anything else I can share that would be helpful. Thank you so much!

@selwin
Copy link
Collaborator

selwin commented Apr 27, 2024

Are you able to log commands sent to Redis and see if there’s anything abnormal in this period of high CPU usage?

In addition, are you able to use htop to check whether the high CPU usage is caused by RQ’s worker processes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants