Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compute/metadata: metadata.OnGCE() leaks 2 http goroutines #5430

Closed
menghanl opened this issue Feb 2, 2022 · 18 comments · Fixed by #5439 or #7084
Closed

compute/metadata: metadata.OnGCE() leaks 2 http goroutines #5430

menghanl opened this issue Feb 2, 2022 · 18 comments · Fixed by #5439 or #7084
Assignees
Labels
api: compute Issues related to the Compute Engine API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@menghanl
Copy link

menghanl commented Feb 2, 2022

Client

N/A

Environment

This leak was detected in our github actions CI.
I only managed to reproduce on github actions. Never got it to leak on my local machine.

Ubuntu
20.04.3
LTS

Go Environment

$ go version
go version go1.17.6 linux/amd64

$ go env

expand to see go env

  GO111MODULE=""
  GOARCH="amd64"
  GOBIN=""
  GOCACHE="/home/runner/.cache/go-build"
  GOENV="/home/runner/.config/go/env"
  GOEXE=""
  GOEXPERIMENT=""
  GOFLAGS=""
  GOHOSTARCH="amd64"
  GOHOSTOS="linux"
  GOINSECURE=""
  GOMODCACHE="/home/runner/go/pkg/mod"
  GONOPROXY=""
  GONOSUMDB=""
  GOOS="linux"
  GOPATH="/home/runner/go"
  GOPRIVATE=""
  GOPROXY="https://proxy.golang.org,direct"
  GOROOT="/opt/hostedtoolcache/go/1.[17](https://github.com/menghanl/cloud-compute-metadata-leak/runs/5043476328?check_suite_focus=true#step:3:17).6/x64"
  GOSUMDB="sum.golang.org"
  GOTMPDIR=""
  GOTOOLDIR="/opt/hostedtoolcache/go/1.17.6/x64/pkg/tool/linux_amd64"
  GOVCS=""
  GOVERSION="go1.17.6"
  GCCGO="gccgo"
  AR="ar"
  CC="gcc"
  CXX="g++"
  CGO_ENABLED="1"
  GOMOD="/home/runner/work/cloud-compute-metadata-leak/cloud-compute-metadata-leak/go.mod"
  CGO_CFLAGS="-g -O2"
  CGO_CPPFLAGS=""
  CGO_CXXFLAGS="-g -O2"
  CGO_FFLAGS="-g -O2"
  CGO_LDFLAGS="-g -O2"
  PKG_CONFIG="pkg-config"
  GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build371730[26](https://github.com/menghanl/cloud-compute-metadata-leak/runs/5043476328?check_suite_focus=true#step:3:26)67=/tmp/go-build -gno-record-gcc-switches"

Code

https://github.com/menghanl/cloud-compute-metadata-leak

func TestML(t *testing.T) {
	defer leakcheck.Check(t) // This will fail if there are still leaked goroutines after 10 seconds
	if metadata.OnGCE() {
		t.Logf("on gce")
	}
}

Additional context

This was detected in our github actions CI: grpc/grpc-go#5171
We at first thought it was caused by other dependencies of gRPC.
But with many attempts, we finally managed to reproduce it with just compute/metadata.

It's flaky (1 failure in ~100 runs).
One failed run: https://github.com/menghanl/cloud-compute-metadata-leak/runs/5043476328?check_suite_focus=true

@menghanl menghanl added the triage me I really want to be triaged. label Feb 2, 2022
@product-auto-label product-auto-label bot added the api: compute Issues related to the Compute Engine API. label Feb 2, 2022
@menghanl menghanl changed the title compute/metadata: metadata.OnGCE() leaks http2 goroutines compute/metadata: metadata.OnGCE() leaks 2 http goroutines Feb 2, 2022
@codyoss codyoss self-assigned this Feb 3, 2022
@codyoss
Copy link
Member

codyoss commented Feb 3, 2022

Thanks for the report I will take a look and see if I can reproduce.

@codyoss codyoss added status: investigating The issue is under investigation, which is determined to be non-trivial. and removed triage me I really want to be triaged. labels Feb 3, 2022
@codyoss
Copy link
Member

codyoss commented Feb 3, 2022

Looking at the panic trace you provided I believe this is intentional behavior in the http transport layer of Go. I see net/http.(*Transport).dialConn as the culprit here. Like the standard lib http package we have a default client cached as a default client used by the metadata package. I assume what is happening here is the keepAlive keeping the connection around in the background for up to 30 seconds. Because the client is package global its connections are not cleaned up/gc'ed as a locally instantiated client would be. I suppose we can scope a client to that function to resolve this issue. But I don't think we are truly leaking a goroutine here, at least not directly.

I am fine making this change for this check, but the default client will remain for other metadata operations and this type of issue could still occur.

@codyoss codyoss added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. and removed status: investigating The issue is under investigation, which is determined to be non-trivial. labels Feb 3, 2022
codyoss added a commit to codyoss/google-cloud-go that referenced this issue Feb 3, 2022
Initalize a resolver and http.Client locally in function to force
GC to cleanup background resources such as cached connections. This
can falsely report as leaked goroutines otherwise. This kind of issue
will still persist for calling methods like ProjectID, but we are
intentionally caching a client and its connections for these occations
today. If this ends up causing users issues in the fututure we can
reevaluate at that time.

Fixes: googleapis#5430
codyoss added a commit that referenced this issue Feb 3, 2022
Initalize a resolver and http.Client locally in function to force
GC to cleanup background resources such as cached connections. This
can falsely report as leaked goroutines otherwise. This kind of issue
will still persist for calling methods like ProjectID, but we are
intentionally caching a client and its connections for these occations
today. If this ends up causing users issues in the fututure we can
reevaluate at that time.

Fixes: #5430
@codyoss
Copy link
Member

codyoss commented Feb 3, 2022

@menghanl
Copy link
Author

menghanl commented Feb 3, 2022

Thanks for the explanation.

But it seems the fix didn't work.
I updated the package to v1.2.0, and the goroutines still were not closed:
https://github.com/menghanl/cloud-compute-metadata-leak/actions/runs/1791114171

@menghanl
Copy link
Author

menghanl commented Feb 3, 2022

We already marked these goroutines as expected in our leakcheck package. We will be OK if this is an intended behavior.

Another question is, why is this flaky?
If the metadata IP cannot be connected to, why does the connection rely on keepalive to get closed? (Or it depends on the testing environment?)

@codyoss
Copy link
Member

codyoss commented Feb 3, 2022

Hmm yeah this is definitely from HTTP internals. We do launch to goroutines in this method but we do so with a derfered cancel context so those are not the issue. The flakiness might be due to the GC not running before leakcheck run perhaps so the HTTP conns are still open? This is just a guess. Someone that knows the HTTP package better than I would need to chime in. But I have seen something like this in the past and this was the cause. Thanks for the report either way, sorry the change did not solve the issue for yall.

@dfawley
Copy link
Contributor

dfawley commented Feb 3, 2022

IIUC, GC will never clean up your client, because it is a global variable, so it is always referenced. The change referenced still causes a global client to be created at init time.

I think there are three options:

  1. Change your API so that there is no longer a need for a global HTTP client. E.g. create a metadata.Client type that the user instantiates and has a Close method on it to clean up properly. This could live alongside the global API as long as the global client is only lazily initialized.

  2. Create & destroy HTTP clients as needed instead of having a global.

  3. Do nothing. This isn't a "leak" from the perspective that these clients will grow over time. However, it is a resource that exists for the life of the binary which may only be needed for a single operation.

@codyoss
Copy link
Member

codyoss commented Feb 3, 2022

I would not have guessed the global client is the culprit after the change made as it is not referenced in this method. But maybe the http package launches some goroutines when the client is created so the global client is still leaking goroutines?! I think since this is not a real leak we will go with #3 for now unless this becomes more of an issue.

@dfawley
Copy link
Contributor

dfawley commented Feb 3, 2022

But maybe the http package launches some goroutines when the client is created so the global client is still leaking goroutines?!

Hmm, no, I think you're right. I saw "dial" written so many times in its initialization that I assumed it was doing something, but it looks like nothing is actually dialing a connection due to the global's initialization.

In this case, I'm not sure. If nothing is using the global client for some reason, then you may be right that it's the local clients leaking instead. Our leak checker waits 10 seconds before failing, and I'm not sure whether there's a periodic GC sweep that should come along by then and clean things up, or if runtime.Stack will make a GC happen, or we need to trigger it manually via runtime.GC(). An easy way to rule out the global would be to run this test with the global left as nil. (Of course, it's flaky, so it needs to be run many, many times to have confidence in any solution.)

@menghanl
Copy link
Author

menghanl commented Feb 7, 2022

Update on this:

  1. the goroutines are leaked for more than 30 seconds. I increased the leakcheck timeout to 120 seconds, and the test still failed. So the guess that the connection stays for 30 seconds due to keepalive is not correct.
  2. GC doesn't help. I added a manual runtime.GC() before each iteration (every 50 ms), it didn't free the goroutines (after 120 seconds)

@codyoss codyoss reopened this Feb 7, 2022
@codyoss
Copy link
Member

codyoss commented Feb 7, 2022

I will try to look into this issue some more.

@yoshi-automation yoshi-automation added 🚨 This issue needs some love. and removed 🚨 This issue needs some love. labels May 8, 2022
@codyoss
Copy link
Member

codyoss commented Jun 15, 2022

I still believe this to be related to keepalives. I did some more testing today and removing keep alives all-together makes it so leakcheck is not tripped. I don't see a bug in our code though, which makes me believe there is something else possibly at play here in internals. Still going to investigate some more though to make sure there is nothing we can do.

@codyoss
Copy link
Member

codyoss commented Jun 15, 2022

Note: The only time this seems to happen is when first the DNS lookup fails and then the HTTP request also fails due to a context cancel. It seems that when the code execution gives that goroutine enough process the cancelation is when this error occurs there is a leak in resources.

I locally added some logging when either the DNS lookup or HTTP request fail. All passing test cases showed the DNS lookup failed and the execution ending. All failing cases showed the DNS loopup failing and subsequent failing of the HTTP request due to a canceled context. It seems like the backing connection in leaked in these cases. If I had to guess it is due to the fact that another request with that transport is never made to clean up the underlying transport connection, so the keepalive timeout is not enforced for the living connection.

AdamMagaluk added a commit to viamrobotics/rdk that referenced this issue Aug 9, 2022
To avoid an issue with google go sdk leaking goroutines
we need to ensure it has a proper credential file which avoids
it looking up metadata from the GCE metadata endpoints.

See: googleapis/google-cloud-go#5430
AdamMagaluk added a commit to viamrobotics/rdk that referenced this issue Aug 10, 2022
To avoid an issue with google go sdk leaking goroutines
we need to ensure it has a proper credential file which avoids
it looking up metadata from the GCE metadata endpoints.

See: googleapis/google-cloud-go#5430
edaniels pushed a commit to viamrobotics/rdk that referenced this issue Aug 10, 2022
To avoid an issue with google go sdk leaking goroutines
we need to ensure it has a proper credential file which avoids
it looking up metadata from the GCE metadata endpoints.

See: googleapis/google-cloud-go#5430
@edaniels
Copy link

edaniels commented Nov 4, 2022

Any update on this? Can we just have OnGCE be properly canceled/closed so as to not avoid a leak? It could reuse the same context that the client has and use a waitgroup/channel to wait for the check to stop. In the worst case, an env var or option could be passed to skip this check given there's a metadataHostEnv check that trusts that the user is on GCE; we could have the opposite.

@codyoss
Copy link
Member

codyoss commented Nov 8, 2022

@codyoss retest this

codyoss added a commit to codyoss/google-cloud-go that referenced this issue Nov 18, 2022
This helps clean up idled connections held by the keep-alive.
The default transport set this to 90 seconds but since we
declare our own the default zero value means they are never
closed out if they are under the threshold for connections to
keep alive(2).

Fixes: googleapis#5430
gcf-merge-on-green bot pushed a commit that referenced this issue Nov 18, 2022
This helps clean up idled connections held by the keep-alive. The default transport set this to 90 seconds but since we declare our own the default zero value means they are never closed out if they are under the threshold for connections to keep alive(2).

Fixes: #5430
@edaniels
Copy link

@codyoss that doesn't feel like a sufficient solution for leaks when it comes to goroutine leak testing. How do you anticipate we get around this?

@codyoss
Copy link
Member

codyoss commented Nov 18, 2022

The "leak" was an idle conn from my testing. So the fix does make sure that eventually gets cleaned up. The original issue opened here I believe had a two minute wait for the leak check which this does get cleaned up before. If you have a tighter timeout than the underlying connection cleanup maybe it is worth skipping this case in your test.

I believe there is another open requesting the envvar you were asking for, but it has not gotten much traction since opening: #4920

@edaniels
Copy link

Our wait time is less than a second and there's no case to skip for us. This GCE code is run no matter what environment you're on unfortunately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: compute Issues related to the Compute Engine API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
5 participants