New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
profiler: pollProfilerService consumes 100% CPU when TLS validation fails #3158
Comments
I was able to redeploy with
which now makes sense: some recent efforts were made to strip down the image and the ca-certificates package was a casualty of that. I can fix that, but I'd think this is just one of many things that could go wrong in calling the API, and the client code should be written such that no fault can cause that loop to not sleep. |
Here's the issue: The implementation of |
The previous implementation relied on .createProfile() sleeping a bit, which in turn relies on gax.Invoke's retry behavior and the initial backoff in the retryer used. This works most times, except gax.Invoke has logic to ignore the retryer and immediately fail when the error is certificate validation. Consequently, .createProfile() returns immediately, and pollProfilerService() will consume 100% CPU. To avoid any possibility of the endless loop in pollProfilerService() from consuming 100% CPU, a ticker is added. This addresses this particular instance, and also any unanticipated or future cases that might arise with changes in gax, etc. It also seemed like a good idea to exit pollProfilerService() when the context is cancelled. In the current implementation the context is always ctx.Background() so I don't see how this could happen, but you never know how things might change in the future. Fixes googleapis#3158
@codyoss I'm afraid the direction #3178 is going does not address the issue, and to boot relies on some brittle behavior in parsing error strings and depending on some pretty specific behaviors of gax. Any chance someone else can take a look and weigh in here? It's important to me that the fix doesn't just fix this one thing that could result in spinning this loop at 100%, but all possible failures, since the likely result when this loop spins at 100% CPU is a complete production outage. It's just not acceptable to know the only thing preventing that from happening is the behavior of a remote server, some brittle error string parsing, and the hope that there are no other bugs or unanticipated behaviors elsewhere in the code. |
I will try to get #3178 in its current form merged to address the immediate issue. @bitglue We have to consider other language agents as well to ensure they all behave the same way. I will open a new issue to discuss the topic of improving agent reliability. I will ensure the discussion moves forward productively and will not be abandoned. We would like to ensure the agents are well behaved. |
Where is this new issue? How is the discussion moving forward? What is being done to safeguard against similar bugs which have not yet been discovered? |
The previous implementation relied on .createProfile() sleeping a bit, which in turn relies on gax.Invoke's retry behavior and the initial backoff in the retryer used. This works most times, except gax.Invoke has logic to ignore the retryer and immediately fail when the error is certificate validation. Consequently, .createProfile() returns immediately, and pollProfilerService() will consume 100% CPU. To avoid any possibility of the endless loop in pollProfilerService() from consuming 100% CPU, a ticker is added. This addresses this particular instance, and also any unanticipated or future cases that might arise with changes in gax, etc. It also seemed like a good idea to exit pollProfilerService() when the context is cancelled. In the current implementation the context is always ctx.Background() so I don't see how this could happen, but you never know how things might change in the future. Fixes googleapis#3158
Client
profiler
Environment
Debian buster, in Docker, in AWS
Go Environment
Code
e.g.
Expected behavior
profiling has negligible impact on performance
Actual behavior
application is 1300% slower
Screenshots
I was able to manually grab a cpu profile from a pprof http endpoint, which reveals the program is looping with great vigor in profiler.pollProfilerService():
With
DebugLogging: true
, the failure is obvious:Additional context
I discovered this by inadvertently removing the ca-certificates package from my container. Mea culpa.
But the bug here is the blocking operations in the loop which make it not normally consume 100% CPU are not guaranteed to happen under all conditions. The code in
pollProfilerService()
is effectively:Seems pretty dangerous. No failure (local configuration error, network problem, server outage, etc) should result in the profiler agent consuming 100% CPU.
The text was updated successfully, but these errors were encountered: