Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

profiler: pollProfilerService consumes 100% CPU when TLS validation fails #3158

Closed
bitglue opened this issue Nov 7, 2020 · 5 comments · Fixed by #3178
Closed

profiler: pollProfilerService consumes 100% CPU when TLS validation fails #3158

bitglue opened this issue Nov 7, 2020 · 5 comments · Fixed by #3178
Assignees
Labels
api: cloudprofiler Issues related to the Cloud Profiler API. status: investigating The issue is under investigation, which is determined to be non-trivial.

Comments

@bitglue
Copy link

bitglue commented Nov 7, 2020

Client

profiler

Environment

Debian buster, in Docker, in AWS

Go Environment

root@82812248ba78:/go/src/github.com/remind101/r101-experiments# go version
go version go1.14.9 linux/amd64
root@82812248ba78:/go/src/github.com/remind101/r101-experiments# go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/root/.cache/go-build"
GOENV="/root/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/go/src/github.com/remind101/r101-experiments/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build695409103=/tmp/go-build -gno-record-gcc-switches"

Code

e.g.

	cfg := profiler.Config{
		Service:        os.Getenv("EMPIRE_APPNAME") + "." + os.Getenv("EMPIRE_PROCESS"),
		ServiceVersion: os.Getenv("EMPIRE_RELEASE"),
		ProjectID:      projectID,
	}

	return profiler.Start(cfg)

Expected behavior

profiling has negligible impact on performance

Actual behavior

application is 1300% slower

Screenshots

I was able to manually grab a cpu profile from a pprof http endpoint, which reveals the program is looping with great vigor in profiler.pollProfilerService():

(pprof) top
Showing nodes accounting for 6700ms, 54.03% of 12400ms total
Dropped 217 nodes (cum <= 62ms)
Showing top 10 nodes out of 160
      flat  flat%   sum%        cum   cum%
    1310ms 10.56% 10.56%     2950ms 23.79%  strconv.appendEscapedRune /usr/local/go/src/strconv/quote.go
     990ms  7.98% 18.55%     4070ms 32.82%  strconv.appendQuotedWith /usr/local/go/src/strconv/quote.go
     970ms  7.82% 26.37%     2860ms 23.06%  runtime.mallocgc /usr/local/go/src/runtime/malloc.go
     760ms  6.13% 32.50%      760ms  6.13%  runtime.memmove /usr/local/go/src/runtime/memmove_amd64.s
     560ms  4.52% 37.02%     1240ms 10.00%  runtime.scanobject /usr/local/go/src/runtime/mgcmark.go
     540ms  4.35% 41.37%      540ms  4.35%  unicode/utf8.EncodeRune /usr/local/go/src/unicode/utf8/utf8.go
     440ms  3.55% 44.92%      440ms  3.55%  strconv.IsPrint /usr/local/go/src/strconv/quote.go
     380ms  3.06% 47.98%      550ms  4.44%  runtime.heapBitsSetType /usr/local/go/src/runtime/mbitmap.go
     380ms  3.06% 51.05%      380ms  3.06%  runtime.memclrNoHeapPointers /usr/local/go/src/runtime/memclr_amd64.s
     370ms  2.98% 54.03%      370ms  2.98%  runtime.nextFreeFast /usr/local/go/src/runtime/malloc.go (inline)
(pprof) top -cum
Showing nodes accounting for 0.41s, 3.31% of 12.40s total
Dropped 217 nodes (cum <= 0.06s)
Showing top 10 nodes out of 160
      flat  flat%   sum%        cum   cum%
     0.02s  0.16%  0.16%      9.78s 78.87%  cloud.google.com/go/profiler.pollProfilerService /go/pkg/mod/cloud.google.com/go@v0.71.0/profiler/profiler.go
         0     0%  0.16%      9.64s 77.74%  cloud.google.com/go/profiler.(*agent).createProfile /go/pkg/mod/cloud.google.com/go@v0.71.0/profiler/profiler.go
     0.05s   0.4%  0.56%      9.23s 74.44%  github.com/googleapis/gax-go/v2.Invoke /go/pkg/mod/github.com/googleapis/gax-go/v2@v2.0.5/invoke.go
     0.04s  0.32%  0.89%      9.08s 73.23%  github.com/googleapis/gax-go/v2.invoke /go/pkg/mod/github.com/googleapis/gax-go/v2@v2.0.5/invoke.go
     0.04s  0.32%  1.21%      8.20s 66.13%  cloud.google.com/go/profiler.(*agent).createProfile.func1 /go/pkg/mod/cloud.google.com/go@v0.71.0/profiler/profiler.go
     0.07s  0.56%  1.77%      8.13s 65.56%  google.golang.org/genproto/googleapis/devtools/cloudprofiler/v2.(*profilerServiceClient).CreateProfile /go/pkg/mod/google.golang.org/genproto@v0.0.0-20201030142918-24207fddd1c3/googleapis/devtools/cloudprofiler/v2/profiler.pb.go
     0.03s  0.24%  2.02%      7.96s 64.19%  google.golang.org/grpc.(*ClientConn).Invoke /go/pkg/mod/google.golang.org/grpc@v1.33.1/call.go
     0.01s 0.081%  2.10%      7.93s 63.95%  google.golang.org/grpc.invoke /go/pkg/mod/google.golang.org/grpc@v1.33.1/call.go
     0.11s  0.89%  2.98%      7.92s 63.87%  google.golang.org/grpc.newClientStream /go/pkg/mod/google.golang.org/grpc@v1.33.1/stream.go
     0.04s  0.32%  3.31%      6.50s 52.42%  google.golang.org/grpc.(*clientStream).newAttemptLocked /go/pkg/mod/google.golang.org/grpc@v1.33.1/stream.go

With DebugLogging: true, the failure is obvious:

2020/11/07 15:32:17 failed to create profile, will retry: rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: x509: certificate signed by unknown authority"

Additional context

I discovered this by inadvertently removing the ca-certificates package from my container. Mea culpa.

But the bug here is the blocking operations in the loop which make it not normally consume 100% CPU are not guaranteed to happen under all conditions. The code in pollProfilerService() is effectively:

for {
    a_function_which_usually_sleeps_a_while_but_can_sometimes_return_immediately()
}

Seems pretty dangerous. No failure (local configuration error, network problem, server outage, etc) should result in the profiler agent consuming 100% CPU.

@bitglue bitglue added the triage me I really want to be triaged. label Nov 7, 2020
@product-auto-label product-auto-label bot added the api: cloudprofiler Issues related to the Cloud Profiler API. label Nov 7, 2020
@bitglue
Copy link
Author

bitglue commented Nov 7, 2020

I was able to redeploy with DebugLogging: true, and found the source of the error:

2020/11/07 15:32:17 failed to create profile, will retry: rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: x509: certificate signed by unknown authority"

which now makes sense: some recent efforts were made to strip down the image and the ca-certificates package was a casualty of that. I can fix that, but I'd think this is just one of many things that could go wrong in calling the API, and the client code should be written such that no fault can cause that loop to not sleep.

@bitglue bitglue changed the title profiler: pollProfilerService consuming nearly all CPU profiler: pollProfilerService consumes 100% CPU when TLS validation fails Nov 7, 2020
@bitglue
Copy link
Author

bitglue commented Nov 7, 2020

Here's the issue:

https://github.com/googleapis/gax-go/blob/21fd469b63bc1a2bed9b99b614b4b9dc54a8aeae/v2/invoke.go#L77-L85

The implementation of pollProfilerService() assumes .createProfile(ctx) will always take some time off-CPU because the provided retryer has an initial backoff of 1 minute. But the retryer isn't used at all in the case of certificate errors, and thus .createProfile returns immediately.

bitglue added a commit to remind101/google-cloud-go that referenced this issue Nov 7, 2020
The previous implementation relied on .createProfile() sleeping a bit,
which in turn relies on gax.Invoke's retry behavior and the initial
backoff in the retryer used. This works most times, except gax.Invoke
has logic to ignore the retryer and immediately fail when the error is
certificate validation. Consequently, .createProfile() returns
immediately, and pollProfilerService() will consume 100% CPU.

To avoid any possibility of the endless loop in pollProfilerService()
from consuming 100% CPU, a ticker is added. This addresses this
particular instance, and also any unanticipated or future cases that
might arise with changes in gax, etc.

It also seemed like a good idea to exit pollProfilerService() when the
context is cancelled. In the current implementation the context is
always ctx.Background() so I don't see how this could happen, but you
never know how things might change in the future.

Fixes googleapis#3158
@codyoss codyoss added status: investigating The issue is under investigation, which is determined to be non-trivial. and removed triage me I really want to be triaged. labels Nov 9, 2020
@bitglue
Copy link
Author

bitglue commented Nov 13, 2020

@codyoss I'm afraid the direction #3178 is going does not address the issue, and to boot relies on some brittle behavior in parsing error strings and depending on some pretty specific behaviors of gax. Any chance someone else can take a look and weigh in here?

It's important to me that the fix doesn't just fix this one thing that could result in spinning this loop at 100%, but all possible failures, since the likely result when this loop spins at 100% CPU is a complete production outage. It's just not acceptable to know the only thing preventing that from happening is the behavior of a remote server, some brittle error string parsing, and the hope that there are no other bugs or unanticipated behaviors elsewhere in the code.

@kalyanac
Copy link
Contributor

I will try to get #3178 in its current form merged to address the immediate issue.

@bitglue We have to consider other language agents as well to ensure they all behave the same way. I will open a new issue to discuss the topic of improving agent reliability. I will ensure the discussion moves forward productively and will not be abandoned. We would like to ensure the agents are well behaved.

@bitglue
Copy link
Author

bitglue commented Feb 1, 2021

I will open a new issue to discuss the topic of improving agent reliability. I will ensure the discussion moves forward productively and will not be abandoned. We would like to ensure the agents are well behaved.

Where is this new issue? How is the discussion moving forward? What is being done to safeguard against similar bugs which have not yet been discovered?

bitglue added a commit to remind101/google-cloud-go that referenced this issue Feb 9, 2021
The previous implementation relied on .createProfile() sleeping a bit,
which in turn relies on gax.Invoke's retry behavior and the initial
backoff in the retryer used. This works most times, except gax.Invoke
has logic to ignore the retryer and immediately fail when the error is
certificate validation. Consequently, .createProfile() returns
immediately, and pollProfilerService() will consume 100% CPU.

To avoid any possibility of the endless loop in pollProfilerService()
from consuming 100% CPU, a ticker is added. This addresses this
particular instance, and also any unanticipated or future cases that
might arise with changes in gax, etc.

It also seemed like a good idea to exit pollProfilerService() when the
context is cancelled. In the current implementation the context is
always ctx.Background() so I don't see how this could happen, but you
never know how things might change in the future.

Fixes googleapis#3158
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: cloudprofiler Issues related to the Cloud Profiler API. status: investigating The issue is under investigation, which is determined to be non-trivial.
Projects
None yet
3 participants