Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(profiler): Force gax to retry in case of certificate errors #3178

Merged
merged 27 commits into from Jan 28, 2021
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
b4bc13d
fix(profiler): exit profiler agent on certificate errors
kalyanac Nov 10, 2020
e5f42d1
chore(profiler): add code comments
kalyanac Nov 10, 2020
b94cacf
Fix typo
kalyanac Nov 10, 2020
3469efd
Employ exponential backoff on errors instead of exiting.
kalyanac Nov 10, 2020
a42dd12
update comments
kalyanac Nov 10, 2020
c9bb5e3
Merge branch 'master' into master
kalyanac Nov 10, 2020
ecee895
Use exponential backoff on all errors returned from createProfile.
kalyanac Nov 10, 2020
e07097d
Merge branch 'master' of https://github.com/kalyanac/google-cloud-go
kalyanac Nov 10, 2020
d627d00
Update comments.
kalyanac Nov 10, 2020
7acc10c
Merge branch 'master' into master
kalyanac Nov 10, 2020
1fa8405
Merge branch 'master' into master
kalyanac Nov 13, 2020
aae6c09
Force gax.Invoke() to retry cert errors
kalyanac Nov 13, 2020
08816b7
Merge branch 'master' of https://github.com/kalyanac/google-cloud-go
kalyanac Nov 13, 2020
ba22df5
revert changes in profiler_test.go
kalyanac Nov 13, 2020
ea017b1
use prefix in debugLog
kalyanac Nov 13, 2020
d5e7e8a
Merge branch 'master' into master
kalyanac Nov 13, 2020
a6958be
Merge branch 'master' into master
aalexand Nov 25, 2020
797c090
Merge branch 'master' into master
kalyanac Jan 22, 2021
5abeb75
Merge branch 'master' into master
kalyanac Jan 22, 2021
fd60f31
Init logger in start()
kalyanac Jan 22, 2021
113e641
Merge branch 'master' into master
aalexand Jan 24, 2021
4308d15
Merge branch 'master' into master
kalyanac Jan 25, 2021
85790f1
Merge branch 'master' into master
kalyanac Jan 25, 2021
092feb1
Merge branch 'master' into master
kalyanac Jan 28, 2021
6131922
Init logger in start()
kalyanac Jan 22, 2021
db1dfa8
Merge branch 'master' of https://github.com/kalyanac/google-cloud-go
kalyanac Jan 28, 2021
e768510
Merge branch 'master' into master
kalyanac Jan 28, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
37 changes: 29 additions & 8 deletions profiler/profiler.go
Expand Up @@ -261,6 +261,7 @@ func start(cfg Config, options ...option.ClientOption) error {

func debugLog(format string, e ...interface{}) {
if config.DebugLogging {
log.SetPrefix("Cloud Profiler Go Agent: ")
aalexand marked this conversation as resolved.
Show resolved Hide resolved
log.Printf(format, e...)
}
}
Expand Down Expand Up @@ -313,10 +314,12 @@ func (r *retryer) Retry(err error) (time.Duration, bool) {
}

// createProfile talks to the profiler server to create profile. In
// case of error, the goroutine will sleep and retry. Sleep duration may
// be specified by the server. Otherwise it will be an exponentially
// increasing value, bounded by maxBackoff.
func (a *agent) createProfile(ctx context.Context) *pb.Profile {
// case of errors except certificate errors, the goroutine will sleep
// and retry. Sleep duration may be specified by the server. Otherwise
// it will be an exponentially increasing value, bounded by maxBackoff.
// Certificate errors are not retried by gax.Invoke() and must be
// handled by the caller. See https://github.com/googleapis/google-cloud-go/issues/3158.
func (a *agent) createProfile(ctx context.Context) (*pb.Profile, error) {
req := pb.CreateProfileRequest{
Parent: "projects/" + a.deployment.ProjectId,
Deployment: a.deployment,
Expand All @@ -326,7 +329,7 @@ func (a *agent) createProfile(ctx context.Context) *pb.Profile {
var p *pb.Profile
md := grpcmd.New(map[string]string{})

gax.Invoke(ctx, func(ctx context.Context, settings gax.CallSettings) error {
err := gax.Invoke(ctx, func(ctx context.Context, settings gax.CallSettings) error {
debugLog("creating a new profile via profiler service")
var err error
p, err = a.client.CreateProfile(ctx, &req, grpc.Trailer(&md))
Expand All @@ -345,8 +348,12 @@ func (a *agent) createProfile(ctx context.Context) *pb.Profile {
}
}))

debugLog("successfully created profile %v", p.GetProfileType())
return p
if err == nil {
debugLog("successfully created profile %v", p.GetProfileType())
return p, nil
}

return nil, err
}

func (a *agent) profileAndUpload(ctx context.Context, p *pb.Profile) {
Expand Down Expand Up @@ -603,9 +610,23 @@ func initializeConfig(cfg Config) error {
func pollProfilerService(ctx context.Context, a *agent) {
debugLog("Cloud Profiler Go Agent version: %s", version.Repo)
debugLog("profiler has started")
backoffCount := 1
for i := 0; config.numProfiles == 0 || i < config.numProfiles; i++ {
p := a.createProfile(ctx)
p, err := a.createProfile(ctx)
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this got too complicated for the problem in hand. I'm not sure we even need to return err from createProfile, I think simply

p, err = a.client.CreateProfile(ctx, &req, grpc.Trailer(&md))
if err != nil {
  debugLog("failed to create profile, will retry: %v", err)
  if strings.Contains(err.Error(), "x509: certificate signed by unknown authority") {
    // gax.Invoke does not retry missing certificate error, but we want to retry it,
    // so pretend it's a different error.
    err = errors.New("retry the certificate error")
  }
}
return err

should be enough to fix the original problem. This is the only error that is not retried by gax.Invoke - see
https://github.com/googleapis/gax-go/blob/21fd469b63bc1a2bed9b99b614b4b9dc54a8aeae/v2/invoke.go#L83.

Sure, this might change tomorrow, but it's unlikely and I don't think this is worth the complexity of crafting another retry utility here.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a false dichotomy. There are solutions which are simple and robust. For example, something as simple as a time.Sleep(time.Second) in the loop would go a long way towards addressing my concerns while being 24 times simpler than what's already proposed here.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I'd challenge "not worth it". For you, this is an annoying bug that you just want to close. For me, it's a bug that very nearly brought down a production website, and could do so again if any one of a very large number of things outside my control go wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"not worth it" == "not a lot of benefit over the proposed fix", not "the bug is unimportant".

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what's the benefit of your proposal over my proposed fix?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be happy if you wanted to do both. The important thing to me is that by reading nothing more than the code in profiler.go, and ideally just one or two functions in there, I can convince myself that every iteration of the loop will have some non-zero amount of idle time. There are a lot of ways that could be accomplished.

But so far this PR guarantees some idle time only under some conditions. I can't predict when those conditions will occur since they depend on gax, the go libraries, and the code running on Google servers, among other things. Since I can't verify the correctness of the code I can't be sure there isn't a possibility that the agent won't consume 100% CPU, and so I can't be comfortable running it in production.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the delay. I will refactor this PR to what aalexand@ suggests above as it will address the root cause while keeping the code changes minimal and simple.

Having agents self manage delays is not something we are considering at this time to ensure the backend can appropriately control the sampling rate as a whole.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me be very clear.

My patch does not put the agent in control of the delay.

You do not seem to understand this point, so let me put it another way:

My patch retains the ability of the server to control the delay between profiles.

Profiles are always 10 seconds long, right? And only one profile can be taken at a time, right? So it should be impossible to profile more than 6 times per minute, right?

So if the agent finds that the main loop takes only 20 microseconds to execute, could this be anything other than a failure of the profiling implementation (and that could be the part that talks to the API, or the part that collects the profiling data)?

I don't see any reason to avoid checking the parameters to the agent for sanity. This is a normal thing to do, especially when those parameters are coming from a remote service.

// On errors, use exponential backoff to prevent spinning on the CPU.
timeToSleep := initialBackoff * time.Duration(backoffCount)
if timeToSleep >= maxBackoff {
timeToSleep = initialBackoff
}
debugLog("profiler agent delaying retry for %v, encountered error: %v", timeToSleep, err)
sleep(ctx, timeToSleep)
backoffCount = backoffCount * 2
continue
}
a.profileAndUpload(ctx, p)
// Reset on every successful attempt.
backoffCount = 1
}

if profilingDone != nil {
Expand Down
5 changes: 4 additions & 1 deletion profiler/profiler_test.go
Expand Up @@ -104,7 +104,10 @@ func TestCreateProfile(t *testing.T) {

mpc.EXPECT().CreateProfile(ctx, gomock.Eq(&wantRequest), gomock.Any()).Times(1).Return(p, nil)

gotP := a.createProfile(ctx)
gotP, err := a.createProfile(ctx)
if err != nil {
t.Errorf("CreateProfile() got err, want no err")
}

if !testutil.Equal(gotP, p) {
t.Errorf("CreateProfile() got wrong profile, got %v, want %v", gotP, p)
Expand Down