Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

profiler: main loop can consume 100% CPU #3657

Closed
bitglue opened this issue Feb 2, 2021 · 6 comments
Closed

profiler: main loop can consume 100% CPU #3657

bitglue opened this issue Feb 2, 2021 · 6 comments
Labels
api: cloudprofiler Issues related to the Cloud Profiler API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@bitglue
Copy link

bitglue commented Feb 2, 2021

Client

profiler

Environment

applicable to any environment

Go Environment

applicable to any go environment

Code

	cfg := profiler.Config{
		Service:        os.Getenv("EMPIRE_APPNAME") + "." + os.Getenv("EMPIRE_PROCESS"),
		ServiceVersion: os.Getenv("EMPIRE_RELEASE"),
		ProjectID:      projectID,
	}

	return profiler.Start(cfg)

Expected behavior

profiling has negligible overhead and produces a reasonable volume of log output

Actual behavior

profiling consumes significant CPU and spews unreasonable volume to logs under some conditions

Additional context

pollProfilerService() contains an endless loop with no blocking operations directly in it. Under normal circumstances it calls profileAndUpload() which will block for a while while profiling is performed, and block while waiting on network IO and uploading the results, and sleep for an amount of time determined by the Cloud Profiler server.

But under abnormal circumstances, it's possible for profileAndUpload() to return immediately, thus running a CPU up to 100% and potentially generating tremendous log volumes as it iterates through the loop body as rapidly as possible. For many applications this means a failure of the application.

Many errors are subject to retry logic in gax.Retry which backs off exponentially, and thus avoids this scenario. But not all errors: TLS validation errors are not retried, and thus a failure to install the necessary CAs or an attempted but failed attack on TLS results in application failure as the profiler goroutine effectively executes a denial of service attack on the application.

I opened #3158 when I discovered this issue, and #3178 addresses the very specific issue of TLS validation failure. I would say the solution there is rather brittle. It relies on a string match to detect the error, and also assumes gax.Retry retries with backoff all errors not matching that string. While it seems to work now I think it's not impossible either of these things could change in the future, causing a regression.

Additionally there are failure modes besides TLS validation failures which could result in a similar self-DoS senario. For example the Cloud Profiler could return a response indicating a profile duration and sleep of 0 seconds. While I doubt Google would intentionally do such a thing, it's entirely possible it could happen as a result of a configuration or implementation error. There could likewise be bugs in the pprof module which cause it to return immediately. There could be a bug anywhere in the call stack which fails to propagate an error upwards, thus avoiding the gax.Retry backoff. This is not an exhaustive list of possibilities.

In other words, there is a very large surface area of things which could change which would validate the very critical assumption: "profileAndUpload() will block for a while". The result of violating that assumption is high (likely catastrophic) CPU usage and/or a DDoS on the Cloud Profiler API.

I proposed #3159 as a solution, which would guarantee that the main loop can't execute at unreasonable speeds regardless of what profileAndUpload() did or did not do. Unfortunately that seems to have been rejected for reasons I don't think are valid:

I'm sure the cost of a ticker is negligible, and the patch does not remove the server's ability to control the profiling rate; it only enforces that the requested rate is reasonable.

I was told a new issue would be opened to consider more robust solutions, but now three months later this does not seem to have happened. I am still uncomfortable running an unpatched Cloud Profiler in a production environment because, for the reasons explained, it is not feasible to show that it will not bring down our application in a similar way as it did before.

So I would like to ask #3159 be reconsidered. This, or something like it which in a few lines makes it easy to see the main loop in pollProfilerService() will not spin a CPU into the ground, will make me comfortable running Cloud Profiler in production.

@bitglue bitglue added the triage me I really want to be triaged. label Feb 2, 2021
@product-auto-label product-auto-label bot added the api: cloudprofiler Issues related to the Cloud Profiler API. label Feb 2, 2021
@yoshi-automation yoshi-automation added the 🚨 This issue needs some love. label Feb 7, 2021
@codyoss codyoss added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. and removed 🚨 This issue needs some love. triage me I really want to be triaged. labels Feb 8, 2021
@nolanmar511 nolanmar511 assigned aalexand and unassigned nolanmar511 Feb 18, 2021
@bitglue
Copy link
Author

bitglue commented Mar 9, 2021

@codyoss @aalexand It's been 4 months...can I get some attention here please?

@aalexand
Copy link
Contributor

@bitglue I'll try to get to it in a next couple of weeks, sorry about the delay.

@bitglue
Copy link
Author

bitglue commented Mar 29, 2021

@aalexand How's it going? It's really quite a trivial PR, I'd appreciate some progress.

@bitglue
Copy link
Author

bitglue commented May 19, 2021

@aalexand @codyoss rerererebump. Six months since I opened the original issue. If you're just going to keep ignoring this, please just be honest and close the issue.

@timomitchel
Copy link

@aalexand @codyoss Hey guys, @bitglue passed away, so I'm following up on this issue. Let me know if you have an update. Thanks.

@aalexand aalexand removed their assignment Aug 26, 2022
@aalexand
Copy link
Contributor

The original problem was fixed by #3178. This issue asked for a more general fix but given that there are no other conditions known to cause the spin loop, the ROI is fairly low. I propose to consider this WAI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: cloudprofiler Issues related to the Cloud Profiler API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

No branches or pull requests

6 participants