-
Notifications
You must be signed in to change notification settings - Fork 411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Severe CPU degradation after vizier-pem update #1810
Comments
Hey @jack-hernandez, I don't believe this is a known issue. Could you please provide information on the types of services you're running as well as PEM logs and any other guidance to reproduce this issue would really help. |
Hi @kpattaswamy, most of our services running are PHP 7.4 using apache, supervisord or crontab. We install this on top of an Ubuntu 20.04 base image. We don't have Pixie running on production anymore but I captured some pem logs (attached) from our old cluster which does still have it running (though the cluster is not in use now, so not sure how helpful this will be). Just to also provide additional context to the symptoms we saw, we noticed an increase in CPU on each individual process running within our application services, particularly when PHP code was being executed. It looks as though the CPU degradation happens very gradually over time and will continue to climb until everything comes to a halt. The same happened when we removed Pixie, and noticed it took a few hours for everything to eventually settle back down. It's also worth noting that even a rollout restart of pods with high cpu utilisation didn't alleviate the problem, they would come straight back up with high CPU rather than climbing from zero again (they were also not being throttled by any CPU limits, or processing anything more compute heavy than we'd expect). We carried out other troubleshooting steps, working with AWS and other support partners to investigate our nodes, application changes and metrics, networking (specifically DNS), and our apache configuration, but the only thing that seems to have any real correlation with the behaviour we've seen is this change in the nri-bundle__vizier-pem-dhc2z__pem.log |
Hi there, we've encountered severe degradation of all of our production services running in EKS, with the main symptom being incredibly high and persistent CPU usage across all processes which looks as though it may have been caused by an update to the
vizier-pem
service we deploy as part of New Relic'snri-bundle
for Pixie. This component seems to automatically update itself when a new image version is released, which for us, happened on 13th December at 00:14 GMT with version 0.14.8. Immediately after this, the CPU across all of our services increased significantly:After combing through the logs on our EKS nodes, the only thing we noticed was a
[pem] <defunct>
zombie process running as a child of/app/src/vizier/services/agent/pem/pem
on one node. As such, we decided to remove Pixie altogether and found all of the services within our cluster gradually returned to normal, with CPU levels across all pods and nodes drastically reducing (note the screenshot below here is a different cluster with more nodes as we migrated away from the one above):Is anyone able to advise on whether or not this is a known issue and what exactly might be causing this change in behaviour? We've been running Pixie for almost 2 years now without issue so it's very concerning that the latest image version update has caused such problems for us.
nri-bundle
version 5.0.4The text was updated successfully, but these errors were encountered: