New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
profiler: main loop can consume 100% CPU #3657
Labels
api: cloudprofiler
Issues related to the Cloud Profiler API.
type: feature request
‘Nice-to-have’ improvement, new feature or different behavior or design.
Comments
product-auto-label
bot
added
the
api: cloudprofiler
Issues related to the Cloud Profiler API.
label
Feb 2, 2021
codyoss
added
type: feature request
‘Nice-to-have’ improvement, new feature or different behavior or design.
and removed
🚨
This issue needs some love.
triage me
I really want to be triaged.
labels
Feb 8, 2021
@bitglue I'll try to get to it in a next couple of weeks, sorry about the delay. |
@aalexand How's it going? It's really quite a trivial PR, I'd appreciate some progress. |
The original problem was fixed by #3178. This issue asked for a more general fix but given that there are no other conditions known to cause the spin loop, the ROI is fairly low. I propose to consider this WAI. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
api: cloudprofiler
Issues related to the Cloud Profiler API.
type: feature request
‘Nice-to-have’ improvement, new feature or different behavior or design.
Client
profiler
Environment
applicable to any environment
Go Environment
applicable to any go environment
Code
Expected behavior
profiling has negligible overhead and produces a reasonable volume of log output
Actual behavior
profiling consumes significant CPU and spews unreasonable volume to logs under some conditions
Additional context
pollProfilerService()
contains an endless loop with no blocking operations directly in it. Under normal circumstances it callsprofileAndUpload()
which will block for a while while profiling is performed, and block while waiting on network IO and uploading the results, and sleep for an amount of time determined by the Cloud Profiler server.But under abnormal circumstances, it's possible for
profileAndUpload()
to return immediately, thus running a CPU up to 100% and potentially generating tremendous log volumes as it iterates through the loop body as rapidly as possible. For many applications this means a failure of the application.Many errors are subject to retry logic in
gax.Retry
which backs off exponentially, and thus avoids this scenario. But not all errors: TLS validation errors are not retried, and thus a failure to install the necessary CAs or an attempted but failed attack on TLS results in application failure as the profiler goroutine effectively executes a denial of service attack on the application.I opened #3158 when I discovered this issue, and #3178 addresses the very specific issue of TLS validation failure. I would say the solution there is rather brittle. It relies on a string match to detect the error, and also assumes
gax.Retry
retries with backoff all errors not matching that string. While it seems to work now I think it's not impossible either of these things could change in the future, causing a regression.Additionally there are failure modes besides TLS validation failures which could result in a similar self-DoS senario. For example the Cloud Profiler could return a response indicating a profile duration and sleep of 0 seconds. While I doubt Google would intentionally do such a thing, it's entirely possible it could happen as a result of a configuration or implementation error. There could likewise be bugs in the
pprof
module which cause it to return immediately. There could be a bug anywhere in the call stack which fails to propagate an error upwards, thus avoiding thegax.Retry
backoff. This is not an exhaustive list of possibilities.In other words, there is a very large surface area of things which could change which would validate the very critical assumption: "
profileAndUpload()
will block for a while". The result of violating that assumption is high (likely catastrophic) CPU usage and/or a DDoS on the Cloud Profiler API.I proposed #3159 as a solution, which would guarantee that the main loop can't execute at unreasonable speeds regardless of what
profileAndUpload()
did or did not do. Unfortunately that seems to have been rejected for reasons I don't think are valid:I'm sure the cost of a ticker is negligible, and the patch does not remove the server's ability to control the profiling rate; it only enforces that the requested rate is reasonable.
I was told a new issue would be opened to consider more robust solutions, but now three months later this does not seem to have happened. I am still uncomfortable running an unpatched Cloud Profiler in a production environment because, for the reasons explained, it is not feasible to show that it will not bring down our application in a similar way as it did before.
So I would like to ask #3159 be reconsidered. This, or something like it which in a few lines makes it easy to see the main loop in
pollProfilerService()
will not spin a CPU into the ground, will make me comfortable running Cloud Profiler in production.The text was updated successfully, but these errors were encountered: