profiler: main loop can consume 100% CPU #3657

bitglue · 2021-02-02T21:31:30Z

Client

profiler

Environment

applicable to any environment

Go Environment

applicable to any go environment

Code

	cfg := profiler.Config{
		Service:        os.Getenv("EMPIRE_APPNAME") + "." + os.Getenv("EMPIRE_PROCESS"),
		ServiceVersion: os.Getenv("EMPIRE_RELEASE"),
		ProjectID:      projectID,
	}

	return profiler.Start(cfg)

Expected behavior

profiling has negligible overhead and produces a reasonable volume of log output

Actual behavior

profiling consumes significant CPU and spews unreasonable volume to logs under some conditions

Additional context

pollProfilerService() contains an endless loop with no blocking operations directly in it. Under normal circumstances it calls profileAndUpload() which will block for a while while profiling is performed, and block while waiting on network IO and uploading the results, and sleep for an amount of time determined by the Cloud Profiler server.

But under abnormal circumstances, it's possible for profileAndUpload() to return immediately, thus running a CPU up to 100% and potentially generating tremendous log volumes as it iterates through the loop body as rapidly as possible. For many applications this means a failure of the application.

Many errors are subject to retry logic in gax.Retry which backs off exponentially, and thus avoids this scenario. But not all errors: TLS validation errors are not retried, and thus a failure to install the necessary CAs or an attempted but failed attack on TLS results in application failure as the profiler goroutine effectively executes a denial of service attack on the application.

I opened #3158 when I discovered this issue, and #3178 addresses the very specific issue of TLS validation failure. I would say the solution there is rather brittle. It relies on a string match to detect the error, and also assumes gax.Retry retries with backoff all errors not matching that string. While it seems to work now I think it's not impossible either of these things could change in the future, causing a regression.

Additionally there are failure modes besides TLS validation failures which could result in a similar self-DoS senario. For example the Cloud Profiler could return a response indicating a profile duration and sleep of 0 seconds. While I doubt Google would intentionally do such a thing, it's entirely possible it could happen as a result of a configuration or implementation error. There could likewise be bugs in the pprof module which cause it to return immediately. There could be a bug anywhere in the call stack which fails to propagate an error upwards, thus avoiding the gax.Retry backoff. This is not an exhaustive list of possibilities.

In other words, there is a very large surface area of things which could change which would validate the very critical assumption: "profileAndUpload() will block for a while". The result of violating that assumption is high (likely catastrophic) CPU usage and/or a DDoS on the Cloud Profiler API.

I proposed #3159 as a solution, which would guarantee that the main loop can't execute at unreasonable speeds regardless of what profileAndUpload() did or did not do. Unfortunately that seems to have been rejected for reasons I don't think are valid:

I'm sure the cost of a ticker is negligible, and the patch does not remove the server's ability to control the profiling rate; it only enforces that the requested rate is reasonable.

I was told a new issue would be opened to consider more robust solutions, but now three months later this does not seem to have happened. I am still uncomfortable running an unpatched Cloud Profiler in a production environment because, for the reasons explained, it is not feasible to show that it will not bring down our application in a similar way as it did before.

So I would like to ask #3159 be reconsidered. This, or something like it which in a few lines makes it easy to see the main loop in pollProfilerService() will not spin a CPU into the ground, will make me comfortable running Cloud Profiler in production.

The text was updated successfully, but these errors were encountered:

bitglue · 2021-03-09T14:12:01Z

@codyoss @aalexand It's been 4 months...can I get some attention here please?

aalexand · 2021-03-10T02:52:20Z

@bitglue I'll try to get to it in a next couple of weeks, sorry about the delay.

bitglue · 2021-03-29T15:10:37Z

@aalexand How's it going? It's really quite a trivial PR, I'd appreciate some progress.

bitglue · 2021-05-19T15:12:01Z

@aalexand @codyoss rerererebump. Six months since I opened the original issue. If you're just going to keep ignoring this, please just be honest and close the issue.

timomitchel · 2022-02-11T23:45:22Z

@aalexand @codyoss Hey guys, @bitglue passed away, so I'm following up on this issue. Let me know if you have an update. Thanks.

aalexand · 2022-08-26T23:52:07Z

The original problem was fixed by #3178. This issue asked for a more general fix but given that there are no other conditions known to cause the spin loop, the ROI is fairly low. I propose to consider this WAI.

bitglue added the triage me I really want to be triaged. label Feb 2, 2021

product-auto-label bot added the api: cloudprofiler Issues related to the Cloud Profiler API. label Feb 2, 2021

yoshi-automation added the 🚨 This issue needs some love. label Feb 7, 2021

codyoss added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. and removed 🚨 This issue needs some love. triage me I really want to be triaged. labels Feb 8, 2021

codyoss assigned nolanmar511 Feb 8, 2021

nolanmar511 assigned aalexand and unassigned nolanmar511 Feb 18, 2021

aalexand removed their assignment Aug 26, 2022

aalexand closed this as completed Aug 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

profiler: main loop can consume 100% CPU #3657

profiler: main loop can consume 100% CPU #3657

bitglue commented Feb 2, 2021

bitglue commented Mar 9, 2021

aalexand commented Mar 10, 2021

bitglue commented Mar 29, 2021

bitglue commented May 19, 2021

timomitchel commented Feb 11, 2022

aalexand commented Aug 26, 2022

profiler: main loop can consume 100% CPU #3657

profiler: main loop can consume 100% CPU #3657

Comments

bitglue commented Feb 2, 2021

bitglue commented Mar 9, 2021

aalexand commented Mar 10, 2021

bitglue commented Mar 29, 2021

bitglue commented May 19, 2021

timomitchel commented Feb 11, 2022

aalexand commented Aug 26, 2022