fix(profiler): Force gax to retry in case of certificate errors #3178

kalyanac · 2020-11-10T02:24:13Z

Tested:
Deleted certificates from a GCP Drawfork VM and ran sample program

~/go/src/github.com/GoogleCloudPlatform/golang-samples/profiler/profiler_quickstart$ go run main.go
2020/11/13 18:12:19 Cloud Profiler: Cloud Profiler Go Agent version: 20201104
2020/11/13 18:12:19 Cloud Profiler: profiler has started
2020/11/13 18:12:19 Cloud Profiler: creating a new profile via profiler service
2020/11/13 18:12:19 Cloud Profiler: failed to create profile, will retry: rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: x509: certificate signed by unknown authority"
2020/11/13 18:13:11 Cloud Profiler: creating a new profile via profiler service
2020/11/13 18:13:11 Cloud Profiler: failed to create profile, will retry: rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: x509: certificate signed by unknown authority"
2020/11/13 18:14:12 Cloud Profiler: creating a new profile via profiler service
2020/11/13 18:14:12 Cloud Profiler: failed to create profile, will retry: rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: x509: certificate signed by unknown authority"

Added the certificates and ran sample program

~/go/src/github.com/GoogleCloudPlatform/golang-samples/profiler/profiler_quickstart$ go run main.go
2020/11/13 18:11:42 Cloud Profiler: Cloud Profiler Go Agent version: 20201104
2020/11/13 18:11:42 Cloud Profiler: profiler has started
2020/11/13 18:11:42 Cloud Profiler: creating a new profile via profiler service
2020/11/13 18:11:42 Cloud Profiler: successfully created profile CPU
2020/11/13 18:11:52 Cloud Profiler: start uploading profile
2020/11/13 18:11:52 Cloud Profiler: creating a new profile via profiler service
2020/11/13 18:11:52 Cloud Profiler: successfully created profile CPU
2020/11/13 18:12:03 Cloud Profiler: start uploading profile
2020/11/13 18:12:03 Cloud Profiler: creating a new profile via profiler service
2020/11/13 18:12:03 Cloud Profiler: successfully created profile CPU

go test -v integration_test.go completed successfully.

profiler/profiler.go

aalexand · 2020-11-10T03:03:21Z

profiler/profiler.go

@@ -345,8 +352,12 @@ func (a *agent) createProfile(ctx context.Context) *pb.Profile {
 		}
 	}))

+	if err != nil {


nit:

if err == nil { log } return p, err

profiler/profiler.go

aalexand · 2020-11-10T03:04:20Z

profiler/profiler.go

-		p := a.createProfile(ctx)
+		p, err := a.createProfile(ctx)
+		if err != nil {
+			debugLog("Cloud Profiler Go Agent encountered permanent error, profiling will be disabled.")


Include the error in the message?

profiler/profiler.go

bitglue · 2020-11-10T15:23:18Z

I don't really like this solution.

Firstly, certificate errors aren't really permanent. An administrator, on seeing these errors, could install the appropriate certificates. There could be a certificate misconfiguration on the server side. Something funny with DNS or IP routing could be causing the request to go to the wrong host (MITM attack, either malicious or part of a corporate firewall, administrator error in using iptables, or some screw-up with BGP on the internet...). While these errors should be uncommon they shouldn't be unrecoverable. In fact they should be expected because TLS exists to mitigate them: that implies they can and will happen.

What this patch does is shut down profiling whenever TLS does its job. Some of these failure modes could shut down profiling for every Google customer until they restart their processes. Unless they've built retry behavior of their own (which they won't). And since "keep profiling until the process exits" is by far the most common use case, why would that retry behavior not be implemented in the library?

Secondly, I don't think the CPU spinning bug is because of the certificate error behavior in gax. The intention when a user calls .Start() is pretty clear:

do profiling
do it at a reasonable pace

Looking at the implementation of pollProfilerService(), it's clear that it captures the first requirement but not the second. Reasoning about whether the second requirement is met or not requires analyzing hundreds if not thousands of lines of code and a call stack several layers deep. Who knows what unanticipated behavior lurks in there?

A failure to anticipate that the call may return immediately if certificate validation fails is just one way this might go wrong. Is it the only way? Are the people making changes to gax aware of the implicit contract that's been created here and mindful of how they might break that contract, causing similar failures in the future?

Admittedly, exiting the loop when createProfile encounters an error is better -- given the choice between stopping profiling and pegging a CPU with and endless loop I'd choose the former. But again, this is relying on createProfile returning an error if it ever returns immediately, which is probably true but difficult to prove. And it's opening the door for a new class of bugs that will probably be along the lines of "profiling just randomly stops working sometimes and I have no idea why".

bitglue · 2020-11-10T15:33:46Z

In case you didn't see it, I've already proposed a solution here: #3159

A further improvement might be to retain the propagation of the error introduced in this patch, but instead of exiting the loop, log those errors or otherwise make them available. Right now they just go to the debug log which is probably disabled, but these errors certainly rise to a level of importance above "debug". And with my patch included it's guaranteed these errors won't be logged at a rate of more than 1 per minute (which is more than I can say for the previous implementation, which generated 2 days worth of log volume over the span of 10 minutes)

kalyanac · 2020-11-10T20:10:15Z

@bitglue I will address your concerns. I am implementing an exponential backoff on errors.

However, I think the application will need to be restarted in case of bad certificates. I ran few simple experiments:

Delete certificates, run application, wait for agent to backoff on error, install certificates - In this case, agent did not pick up new certificates and continued to backoff
Run application, delete certificates - In this case, agent continued to work as if certificates were available.

I did not investigate this further. I assume some components maybe caching the certificates.

bitglue · 2020-11-10T20:30:54Z

I don't think backoff on errors is quite the right solution, either. The problem is there's a loop which calls a function as fast as it can, and if that function returns immediately, the result is consuming 100% of a CPU.

That could be a catastrophic failure for many programs, so a little defensive programming seems appropriate here.

Now, it's probably the case that if the function is going to return immediately, it's going to return an error. But can you be 100% certain it's so? Evidently someone already thought so in the past, and they've been wrong at least once. What's to say the function won't return immediately without an error?

Currently, in the happy path, the createProfile API is called, and that takes a while to return a response. Usually the response is successful, and so the retry and backoff behavior in gax isn't triggered. Thus, the (intentionally high) latency of the createProfile API is the only thing preventing this loop from consuming 100% CPU.

Now think about that from my perspective: with one mistake, Google could change the behavior of the API to be much faster, instantly raising the CPU usage of my processes. This could easily bring down my application. Do I want to trust Google with the availability of my application? No, I do not.

The solution here is to have some simple code which very directly limits the rate of that loop under any circumstances. That's what the timer accomplishes.

kalyanac · 2020-11-10T20:49:54Z

In normal operating conditions, the agent cannot decide on its own how long to wait between createProfile calls. The backend controls the agent collection and backoff behavior to maintain the required sampling rate. We document the behavior at https://cloud.google.com/profiler/docs/concepts-profiling#collection

In cases where agent is unable to connect to the backend, this PR will ensure the agent won't monopolize the resources.

Please let us know if you find any other instances where the agent is misbehaving and we will fix it promptly.

bitglue · 2020-11-10T21:24:45Z

In normal operating conditions, the agent cannot decide on its own how long to wait between createProfile calls.

That is not what my patch does. My patch instructs the agent to wait at a minimum 1 minute. Maybe that's a bit too long: in the case of just 1 process and say 6 profile types, 10 seconds per iteration should be sufficient.

The agent certainly can check what its being told to do for reasonableness. Under no circumstances does it make sense to attempt to profile at a rate of millions per second.

If the server should desire a longer pause, my patch respects that. The call to .createProfile() can take as long as it likes. If it takes more than minimumIteration then by the time its returned the ticker will have already emitted a message on the channel and the loop will proceed immediately to the next iteration, just as it did previously. In other words, as long the server (and the code interpreting the server's response) is asking for a reasonable iteration rate, the behavior is unchanged by my patch.

Please let us know if you find any other instances where the agent is misbehaving and we will fix it promptly.

That's not the point. I do not want this agent blowing up a second time. I was very surprised when it blew up the first time, so I looked at the code and found it to be brittle because the implementation provides no guaranteed lower bound on iteration time. Your patch does not fully address this issue because it only addresses the case where .createProfile() returns an error.

Can I audit all the code to be reasonably confident there are no cases where .createProfile() will return very quickly without an error? No, I can not.

And like I said, I can think of a case where it would return too quickly: when the API is not behaving as currently intended. And while I hope there is nothing in this package, in its dependencies, or in all the code running on Google's servers, or all the networking between, that is simply too much surface area to simply trust on the weak guarantee that "it was designed that way and we hope it works".

kalyanac · 2020-11-10T23:34:02Z

gax retries and uses exponential backoff. Now we have exponential backoff to cover cases where gax does not retry. I believe this sufficiently address the root cause of this issue and similar cases.

Profile collection is done over a 10 second interval. So in case of only one agent, the maximum number of times it can call createProfile is ~6. So in case of successful API calls, the number of calls are limited by the number of profile types enabled.

To make this more robust and defensive against unexpected API behavior, we could limit the number of iterations in pollProfilerService() to ~10 per minute which would achieve the same outcome of limiting the calls while leaving some flexibility.

@aalexand WDYT? I can update the PR to enforce an upper limit on number of iterations over a one minute duration.

kalyanac · 2020-11-10T23:35:00Z

@bitglue Thanks for your detailed feedback and comments on this. I really appreciate your perspectives in making the agent more stable.

aalexand · 2020-11-11T01:48:03Z

profiler/profiler.go

 	for i := 0; config.numProfiles == 0 || i < config.numProfiles; i++ {
-		p := a.createProfile(ctx)
+		p, err := a.createProfile(ctx)
+		if err != nil {


I feel this got too complicated for the problem in hand. I'm not sure we even need to return err from createProfile, I think simply

p, err = a.client.CreateProfile(ctx, &req, grpc.Trailer(&md)) if err != nil { debugLog("failed to create profile, will retry: %v", err) if strings.Contains(err.Error(), "x509: certificate signed by unknown authority") { // gax.Invoke does not retry missing certificate error, but we want to retry it, // so pretend it's a different error. err = errors.New("retry the certificate error") } } return err

should be enough to fix the original problem. This is the only error that is not retried by gax.Invoke - see
https://github.com/googleapis/gax-go/blob/21fd469b63bc1a2bed9b99b614b4b9dc54a8aeae/v2/invoke.go#L83.

Sure, this might change tomorrow, but it's unlikely and I don't think this is worth the complexity of crafting another retry utility here.

This is a false dichotomy. There are solutions which are simple and robust. For example, something as simple as a time.Sleep(time.Second) in the loop would go a long way towards addressing my concerns while being 24 times simpler than what's already proposed here.

Also, I'd challenge "not worth it". For you, this is an annoying bug that you just want to close. For me, it's a bug that very nearly brought down a production website, and could do so again if any one of a very large number of things outside my control go wrong.

"not worth it" == "not a lot of benefit over the proposed fix", not "the bug is unimportant".

So what's the benefit of your proposal over my proposed fix?

I'd be happy if you wanted to do both. The important thing to me is that by reading nothing more than the code in profiler.go, and ideally just one or two functions in there, I can convince myself that every iteration of the loop will have some non-zero amount of idle time. There are a lot of ways that could be accomplished.

But so far this PR guarantees some idle time only under some conditions. I can't predict when those conditions will occur since they depend on gax, the go libraries, and the code running on Google servers, among other things. Since I can't verify the correctness of the code I can't be sure there isn't a possibility that the agent won't consume 100% CPU, and so I can't be comfortable running it in production.

Apologies for the delay. I will refactor this PR to what aalexand@ suggests above as it will address the root cause while keeping the code changes minimal and simple.

Having agents self manage delays is not something we are considering at this time to ensure the backend can appropriately control the sampling rate as a whole.

Let me be very clear.

My patch does not put the agent in control of the delay.

You do not seem to understand this point, so let me put it another way:

My patch retains the ability of the server to control the delay between profiles.

Profiles are always 10 seconds long, right? And only one profile can be taken at a time, right? So it should be impossible to profile more than 6 times per minute, right?

So if the agent finds that the main loop takes only 20 microseconds to execute, could this be anything other than a failure of the profiling implementation (and that could be the part that talks to the API, or the part that collects the profiling data)?

I don't see any reason to avoid checking the parameters to the agent for sanity. This is a normal thing to do, especially when those parameters are coming from a remote service.

kalyanac · 2020-11-13T01:14:23Z

@aalexand Ready for review.

profiler/profiler.go

bitglue · 2020-11-13T16:12:50Z

This does not fix the issue.

aalexand · 2020-11-25T20:50:14Z

profiler/profiler.go

@@ -215,6 +217,7 @@ func (o *allowUntilSuccess) do(f func() error) (err error) {
 // Config for details. Start should only be called once. Any
 // additional calls will be ignored.
 func Start(cfg Config, options ...option.ClientOption) error {
+	logger = log.New(os.Stderr, "Cloud Profiler: ", log.LstdFlags|log.Lmsgprefix)


Move this into start().

kalyanac · 2021-01-28T22:04:12Z

Removed use of Lmsgprefix flag because it is only available in v1.14+

kalyanac · 2021-01-28T22:09:04Z

Previous tests failed due to problems with Github actions.

kalyanac added 3 commits November 9, 2020 17:44

fix(profiler): exit profiler agent on certificate errors

b4bc13d

chore(profiler): add code comments

e5f42d1

Fix typo

b94cacf

kalyanac requested a review from a team as a code owner November 10, 2020 02:24

google-cla bot added the cla: yes This human has signed the Contributor License Agreement. label Nov 10, 2020

kalyanac requested a review from aalexand November 10, 2020 02:24

product-auto-label bot added the api: cloudprofiler Issues related to the Cloud Profiler API. label Nov 10, 2020

aalexand reviewed Nov 10, 2020

View reviewed changes

kalyanac added 2 commits November 10, 2020 11:43

Employ exponential backoff on errors instead of exiting.

3469efd

update comments

a42dd12

kalyanac changed the title ~~fix(profiler): exit agent profiling loop on certificate errors~~ fix(profiler): Use exponential backoff in pollProfilerService Nov 10, 2020

kalyanac marked this pull request as draft November 10, 2020 20:00

kalyanac and others added 4 commits November 10, 2020 12:10

Merge branch 'master' into master

c9bb5e3

Use exponential backoff on all errors returned from createProfile.

ecee895

Merge branch 'master' of https://github.com/kalyanac/google-cloud-go

e07097d

Update comments.

d627d00

Merge branch 'master' into master

7acc10c

kalyanac marked this pull request as ready for review November 10, 2020 20:50

aalexand reviewed Nov 11, 2020

View reviewed changes

kalyanac and others added 3 commits November 12, 2020 16:35

Merge branch 'master' into master

1fa8405

Force gax.Invoke() to retry cert errors

aae6c09

Merge branch 'master' of https://github.com/kalyanac/google-cloud-go

08816b7

revert changes in profiler_test.go

ba22df5

kalyanac changed the title ~~fix(profiler): Use exponential backoff in pollProfilerService~~ fix(profiler): Force gax to retry in case of certificate errors Nov 13, 2020

aalexand reviewed Nov 13, 2020

View reviewed changes

profiler/profiler.go Outdated Show resolved Hide resolved

bitglue mentioned this pull request Nov 13, 2020

profiler: pollProfilerService consumes 100% CPU when TLS validation fails #3158

Closed

kalyanac and others added 3 commits November 13, 2020 10:13

use prefix in debugLog

ea017b1

Merge branch 'master' into master

d5e7e8a

Merge branch 'master' into master

a6958be

aalexand reviewed Nov 25, 2020

View reviewed changes

kalyanac and others added 4 commits January 21, 2021 19:28

Merge branch 'master' into master

797c090

Merge branch 'master' into master

5abeb75

Init logger in start()

fd60f31

Merge branch 'master' into master

113e641

aalexand approved these changes Jan 24, 2021

View reviewed changes

kalyanac and others added 6 commits January 25, 2021 09:13

Merge branch 'master' into master

4308d15

Merge branch 'master' into master

85790f1

Merge branch 'master' into master

092feb1

Init logger in start()

6131922

Merge branch 'master' of https://github.com/kalyanac/google-cloud-go

db1dfa8

Merge branch 'master' into master

e768510

kalyanac merged commit 35dcd72 into googleapis:master Jan 28, 2021

bitglue mentioned this pull request Feb 2, 2021

profiler: main loop can consume 100% CPU #3657

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(profiler): Force gax to retry in case of certificate errors #3178

fix(profiler): Force gax to retry in case of certificate errors #3178

kalyanac commented Nov 10, 2020 •

edited

aalexand Nov 10, 2020

aalexand Nov 10, 2020

bitglue commented Nov 10, 2020 •

edited

bitglue commented Nov 10, 2020

kalyanac commented Nov 10, 2020

bitglue commented Nov 10, 2020

kalyanac commented Nov 10, 2020

bitglue commented Nov 10, 2020

kalyanac commented Nov 10, 2020

kalyanac commented Nov 10, 2020

aalexand Nov 11, 2020

bitglue Nov 11, 2020

bitglue Nov 11, 2020

aalexand Nov 11, 2020

bitglue Nov 11, 2020

bitglue Nov 11, 2020

kalyanac Nov 12, 2020

bitglue Nov 13, 2020

kalyanac commented Nov 13, 2020

bitglue commented Nov 13, 2020

aalexand Nov 25, 2020

kalyanac Jan 22, 2021

kalyanac commented Jan 28, 2021

kalyanac commented Jan 28, 2021

fix(profiler): Force gax to retry in case of certificate errors #3178

fix(profiler): Force gax to retry in case of certificate errors #3178

Conversation

kalyanac commented Nov 10, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bitglue commented Nov 10, 2020 • edited

bitglue commented Nov 10, 2020

kalyanac commented Nov 10, 2020

bitglue commented Nov 10, 2020

kalyanac commented Nov 10, 2020

bitglue commented Nov 10, 2020

kalyanac commented Nov 10, 2020

kalyanac commented Nov 10, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kalyanac commented Nov 13, 2020

bitglue commented Nov 13, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kalyanac commented Jan 28, 2021

kalyanac commented Jan 28, 2021

kalyanac commented Nov 10, 2020 •

edited

bitglue commented Nov 10, 2020 •

edited