Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compute/metadata: OnGCE returns true if request to metadata server returns ECONNREFUSED #5381

Closed
wperron opened this issue Jan 19, 2022 · 13 comments
Assignees
Labels
api: compute Issues related to the Compute Engine API. type: cleanup An internal cleanup or hygiene concern.

Comments

@wperron
Copy link

wperron commented Jan 19, 2022

Client

Metadata & Profiler

Environment

Buildkite runners running on GCE. The metadata server responds to a ping, but not to a curl (returns ECONNREFUSED error). I can also reproduce locally by simply adding the hostname and IP to my /etc/hosts file (obviously pointing to nothing). I'm using the profiler package, previously didn't have any issue until #5063 was merged. When the profiler starts, the OnGCE call returns true, and before the bugfix, the call to Zone() would simply return an empty string, now it's returning an error (as it should). I think the issue stems from the OnGCE call, since the metadata server can't be reached it should probably return false.

Go Environment

$ go version
go version go1.17.3 linux/amd64

Expected behavior

I expect OnGCE to return false so that the profiler doesn't attempt to get instance metadata when it obviously can't.

Actual behavior

OnGCE returns true, causing the profiler to attempt retrieving instance metadata.

Additional Context:

this check here is likely the issue; The test above it will not only fail in the IP doesn't resolve, but also if the connection is refused, whereas the LookupHost test only fails if the DNS lookup operation fails, it doesn't check to see if the server responds correctly. Since it only takes one of the two tests to pass for this function to return true, we get into this case where the OnGCE call returns true even if the server doesn't respond.

@wperron wperron added the triage me I really want to be triaged. label Jan 19, 2022
@product-auto-label product-auto-label bot added the api: compute Issues related to the Compute Engine API. label Jan 19, 2022
@codyoss codyoss self-assigned this Jan 19, 2022
@codyoss codyoss added status: investigating The issue is under investigation, which is determined to be non-trivial. and removed triage me I really want to be triaged. labels Jan 19, 2022
@codyoss
Copy link
Member

codyoss commented Jan 19, 2022

@wperron Thanks for the detailed report, I will try to take a look later today.

@wperron
Copy link
Author

wperron commented Jan 19, 2022

Hey @codyoss thanks for the quick response! I've got the code open right next to me, if you agree with that assessment, just lemme know I can have a PR up by the end of the day 👍

@codyoss
Copy link
Member

codyoss commented Jan 19, 2022

@wperron I will have to take a deeper look later but if you think you have a solution feel free to open a PR. I will say we do intentionally race both checks because in some environments different options will resolve faster. The code tries to limit the start-up cost associated detecting environment details.

@wperron
Copy link
Author

wperron commented Jan 19, 2022

I was thinking of following up the DNS lookup with an HTTP request to see if it responds without errors, but you're right it would add a step to that check.

I wish there was a way to globally say "hey this isn't on GCE, trust me" and have the OnGCE function just trust that, kinda like the opposite of the GCE_METADATA_HOST env variable. Would make it so much easier and faster

@codyoss
Copy link
Member

codyoss commented Jan 20, 2022

@wperron Can you elaborate more on your environment? You mention that you are running on GCE? If so what environment(vm, functions, cloud run, app engine). I am curious to why we dectect you are on GCE, but you get a connection refused.

@wperron
Copy link
Author

wperron commented Jan 20, 2022

@wperron Can you elaborate more on your environment? You mention that you are running on GCE? If so what environment(vm, functions, cloud run, app engine). I am curious to why we dectect you are on GCE, but you get a connection refused.

Our setup is a bit convoluted, the specific case causing an issue is during our CI jobs; we run Buildkite on top of GKE, so each job is a docker container started by the buildkite agent running in a k8s pod.

@codyoss
Copy link
Member

codyoss commented Jan 20, 2022

In that case is this just a misconfigured docker network perhaps?(I am not an expert in docker networking by any means) I feel like to the question on am I on GCE, the answer should be yes in this case, but networking is not allowing that traffic to flow. What do you think?

@wperron
Copy link
Author

wperron commented Jan 20, 2022

You probably have a lot more context that I do here so I'll defer to your judgement, but I can provide maybe a little more context into the issue we're facing:

We're running the profiler in parts of our code which checks if it's running on GCE on startup, if it is, it then gets the instance name, zone and project ID. Previously, that wasn't an issue because the calls the Zone() and InstanceName() returned an empty string and a nil error when hit with a network error, allowing the program to continue, but that's no longer true since a recent bugfix that bubbles up the error.

Being able to successfully query the metadata server is a soft dependency for the profiler to be able to start, so I would say if that request results in a connection refused, the profiler probably shouldn't be trying to access it later.

If the behavior of OnGCE here is the expected one, then maybe we can instead fix the issue in the profiler somehow. Maybe allowing to continue on error?

I'll try to see on my end if we can tweak the Docker networking to allow access to the metadata server, or blackhole the IP, but I don't how much wiggle room I have here as this would impact every CI run in the org.

@codyoss
Copy link
Member

codyoss commented Jan 20, 2022

I think the best solution, for the time being, is to either let traffic flow through the docker network or as you mentioned just make sure things don't resolve. In the not too distant future we do plan on tweaking our ongce check in a way that I think would be beneficial to you. I know we have had at least one other request in the past for a variable to disable the check and force it to return false.(Can't seem to find the issue right now though)

Let me know if you get it working, and please share if you find the solution as others might have run into this too 😄

@wperron
Copy link
Author

wperron commented Jan 20, 2022

So for now I applied a replace in our dependencies to get that patch I opened a PR for and it gets the job done. Obviously I don't want to keep a floating patch around for too long so if there's a solution merged upstream that'd be ideal 😅 I'll try to see what can be done on our build system but so far I don't think it'll yield much.

@wperron
Copy link
Author

wperron commented Jan 21, 2022

For anyone else who might be reading this, adding a quick echo '127.0.0.1 metadata.google.internal' >> /etc/hosts at the beginning of our CI script also managed to fix the issue.

@codyoss
Copy link
Member

codyoss commented Jan 21, 2022

Thanks for sharing, I will bring this back to my team and discuss. If we were to make a change here I think we would want to make sure it happens for all langs. But like I mentioned in an earlier comment as well this is going to all be refactored in the not too distant future and at that point this should also no longer be an issue.

@codyoss codyoss added type: cleanup An internal cleanup or hygiene concern. and removed status: investigating The issue is under investigation, which is determined to be non-trivial. labels Jan 21, 2022
@codyoss
Copy link
Member

codyoss commented Jun 29, 2022

Closing this issue in favor of #4920 as the solution would be the same if we implement it.

@codyoss codyoss closed this as not planned Won't fix, can't repro, duplicate, stale Jun 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: compute Issues related to the Compute Engine API. type: cleanup An internal cleanup or hygiene concern.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants