Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod not able to connect to url login.microsoftonline.com:443 #3604

Open
mastrauckas opened this issue May 11, 2024 · 23 comments
Open

Pod not able to connect to url login.microsoftonline.com:443 #3604

mastrauckas opened this issue May 11, 2024 · 23 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/external upstream bugs

Comments

@mastrauckas
Copy link

mastrauckas commented May 11, 2024

kind version: kind v0.22.0 go1.20.13 windows/amd64

docker version: 26.1.1

OS: Windows 11 23H2

Kubernetes version:

Client Version: v1.30.0
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.2

My application is attempting to do OAuth when trying to get secrets from Azure Key Vault on startup. However, it is not able to connect to URL login.microsoftonline.com:443, so it crashes and goes into a CrashLoopBackOff loop. Eventually, it will be able to connect after many CrashLoopBackOff attempts.

@mastrauckas mastrauckas added the kind/bug Categorizes issue or PR as related to a bug. label May 11, 2024
@mastrauckas
Copy link
Author

mastrauckas commented May 13, 2024

It appears login.microsoftonline.com is not resolving.

If I do:

nslookup login.microsoftonline.com

I get:

image

However, if I do

nslookup login.microsoftonline.com 9.9.9.9

I get:

image

So he default DNS server can't resolve login.microsoftonline.com for whatever reason.

@BenTheElder
Copy link
Member

Most of the questions in the bug template are missing in your post and you don't specify where / how you're executing nslookup but DNS within the cluster is either:

  • the in-cluster DNS server (coreDNS) as deployed by kubeadm IF you're inside a pod that isn't hostnetwork
  • Docker's embedded DNS resolver which listens within the node containers (and any container not on the default "bridge" bridge network) and then resolves them in dockerd on the host with your hosts's resolver settings. https://docs.docker.com/network/#dns-services

@mastrauckas
Copy link
Author

mastrauckas commented May 14, 2024

Thank you for your response @BenTheElder. My fault if I didn't fill out all the information.

The command I ran to get into a pod to troubleshoot was:

 kubectl run troubleshooting --rm -i --tty --image nicolaka/netshoot -- /bin/bash

Once in the pod, I ran nslookup login.microsoftonline.com and nslookup login.microsoftonline.com 9.9.9.9 as shown above; where the Quad9 nameserver worked and the internal nameserver didn't.

I also tried

docker run --rm -it nicolaka/netshoot /bin/bash

Once in the container, I did nslookup login.microsoftonline.com and curl -I login.microsoftonline.com and both worked just fine.

I also did docker exec -it experimental-cluster-worker bash, where experimental-cluster-worker is one of my nodes

Once in the node container, I did

apt update
apt install dnsutils
nslookup login.microsoftonline.com

Once again, inside the node itself I was able to do a nslookup on login.microsoftonline.com without an issue.

It appears the problem only occurs when using the default name server for the pod.

@BenTheElder
Copy link
Member

docker run --rm -it nicolaka/netshoot /bin/bash

This would need to be with --net=kind (or similar, the default bridge network does not have docker's embedded DNS, non-default networks do by default)

The internal nameserver is coreDNS with configurable search paths, if your host has a lot of search paths you might try this config option

dnsSearch: []
(We're considering making this default)

@mastrauckas
Copy link
Author

mastrauckas commented May 15, 2024

docker run --rm -it nicolaka/netshoot /bin/bash

This would need to be with --net=kind (or similar, the default bridge network does not have docker's embedded DNS, non-default networks do by default)

The internal nameserver is coreDNS with configurable search paths, if your host has a lot of search paths you might try this config option

dnsSearch: []

(We're considering making this defaul

I shouldn’t have to do anything special, login.microsoftonline.com is a normal public domain hosted by Microsoft, so coreDNS should just forward the DNS request to its forwarder. I tried other domains, and they work just fine. I wonder if this is just an odd coreDNS issue because the subdomain is login, instead of www?

@BenTheElder
Copy link
Member

the DNS resolvers shouldn't see "www" vs "login" differently

I shouldn’t have to do anything special,

Er, I'm not saying special to this particular domain, I'm saying your host environment may be resulting in a large number of search paths which can create flakiness.

You can also test this by testing for login.microsoftonline.com. (FQDN) instead.

@mastrauckas
Copy link
Author

mastrauckas commented May 15, 2024

the DNS resolvers shouldn't see "www" vs "login" differently

I agree, it shouldn't.

I shouldn’t have to do anything special,

Er, I'm not saying special to this particular domain, I'm saying your host environment may be resulting in a large number of search paths which can create flakiness.

You can also test this by testing for login.microsoftonline.com. (FQDN) instead.

It's a plain deployment with 1 control-plane and 5 or 6 workers, the only pod is the test pod created with kubectl run troubleshooting --rm -i --tty --image nicolaka/netshoot -- /bin/bash.

I haven't even added Secrets or ConfigMaps.

@stmcginnis
Copy link
Contributor

It's a plain deployment with 1 control-plane and 5 or 6 workers

I believe the point wasn't about the kind pod configuration, rather the host environment it is running in. So your dev machine, the network you are connected to, any proxy relays, etc.

@BenTheElder
Copy link
Member

BenTheElder commented May 15, 2024

Right, the behavior of the DNS is related to the host machine's network, the docker install, the resolver configuration on the host, etc.

I've made some suggestions about how to test for excluding some of these (namely the search paths).

If I run the same pod in kind on my host then resolving the domain works fine.

@mastrauckas
Copy link
Author

mastrauckas commented May 16, 2024

It's a plain deployment with 1 control-plane and 5 or 6 workers

I believe the point wasn't about the kind pod configuration, rather the host environment it is running in. So your dev machine, the network you are connected to, any proxy relays, it’s

It's a plain deployment with 1 control-plane and 5 or 6 workers

I believe the point wasn't about the kind pod configuration, rather the host environment it is running in. So your dev machine, the network you are connected to, any proxy relays, etc

This command nslookup login.microsoftonline.com

Works on my machine in a docker container and not in a docker container, it works on vms on my machine, it works on multiple machines in my network. All machines all use the same DNS server, which is 9.9.9.9.

I also have a very basic home network. I even connected my computer straight to my cable modem, also connected my computer to my cell phone, all having the same behavior, which works just fine, except when using the pods default DNS Server.

It even works in the K8s pod if I usenslookup login.microsoftonline.com 9.9.9.9 instead of just nslookup login.microsoftonline.com.

So I really don’t get why you believe it’s my computer/network.

@mastrauckas
Copy link
Author

I've made some suggestions about how to test for excluding some of these (namely the search paths).

I appreciate your suggestions but not sure how they worked

Your suggestions

This would need to be with --net=kind (or similar, the default bridge network does not have docker's embedded DNS, non-default networks do by default)
I tried

docker run -it --rm --net=kind nicolaka/netshoot bash

The nslookup command worked completely fine in the docker container with --net=kind.

I tried

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
name: experimental-cluster
networking:
  dnsSearch:
  - 9.9.9.9

This also didn't work. Then again, I am not even sure if I'm even doing the dnsSearch correct.

I also tried everything this morning on a different computer all with the same result.

@mastrauckas
Copy link
Author

mastrauckas commented May 16, 2024

I also tried doing the same test with k3d and it has the same results. So k3d and kind have the same issue when using the default dns server coreDNS.

@stmcginnis
Copy link
Contributor

stmcginnis commented May 16, 2024

This is definitely a kubernetes configuration issue and not anything kind is doing. What that configuration issue is though, still not clear. :)

That is indeed not correct for dnsSearch. These entries are DNS search entries, so basically what that configuration is telling Kubernetes to do is when looking for the host name foo, also try to see if it can be found as foo.9.9.9.9.

Rather than running a container to test this, it would be good to run a pod within the context of the k8s cluster to make sure it's getting the same environment. This page has some useful tips for troubleshooting DNS issues: https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/

@BenTheElder
Copy link
Member

So I really don’t get why you believe it’s my computer/network.

Because it doesn't replicate on my computer/network and it's a networking issue. It's not an inherent coreDNS behavior.

This also didn't work. Then again, I am not even sure if I'm even doing the dnsSearch correct.

This isn't a valid dnsSearch setting, my suggestion was to configure an empty list instead to avoid passing in any from the host. DNS search is the list of suffixes attempted for domains that are not fully qualified.

The better test mentioned here #3604 (comment) is to avoid search paths entirely by using a fully qualified domain name and see if that works. https://en.wikipedia.org/wiki/Fully_qualified_domain_name

so try login.microsoftonline.com. (note the trailing dot) where login.microsoftonline.com was not working.

Also, if by any chance you're using alpine / muslc for your application image, consider using a glibc based base image (e.g. debian), there have historically been DNS issues with muslc's resolver in Kubernetes clusters which is not specific to kind.

@mastrauckas
Copy link
Author

This is definitely a kubernetes configuration issue and not anything kind is doing. What that configuration issue is though, still > not clear. :)

At first, I thought it was a kind issue, which is why the issue was created, Then I started to retract that; thinking it may be a coreDNS issue.

I believe the evidence is showing its unlikely just a me issue, since I have been able to recreate this issue on multiple computers on my network and outside my network. With that being said, until the issue is known, it could be a configuration issue on my network.

Rather than running a container to test this, it would be good to run a pod within the context of the k8s cluster to make sure >it's getting the same environment. This page has some useful tips for troubleshooting DNS issues: >https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/

So, I did everything in the post with no issues.

So running

kubectl exec -ti dnsutils -- cat /etc/resolv.conf

gave me:

search default.svc.cluster.local svc.cluster.local cluster.local 9.9.9.9
nameserver 10.96.0.10
options ndots:5

I even ran:

kubectl exec -i -t dnsutils -- nslookup login.microsoftonline.com

and it worked as expected.

I will continue to do more debugging.

@mastrauckas
Copy link
Author

After looking over the post a second time that @stmcginnis mentioned, I thought I would look at the coreDNS logs after getting the error trying to do a lookup on login.microsoftonline.com

I went into the pod as I did before with

kubectl run troubleshooting --rm -i --tty --image nicolaka/netshoot -- /bin/bash

and then I went to view the logs for coreDNS, which I haven't done previously, with command

kubectl logs --namespace=kube-system -l k8s-app=kube-dns

Which shows a DNS error, shown below:

.:53
[INFO] plugin/reload: Running configuration SHA512 = 591cf328cccc12bc490481273e738df59329c62c0b729d94e8b61db9961c2fa5f046dd37f1cf888b953814040d180f52594972691cd6ff41be96639138a43908
CoreDNS-1.11.1
linux/amd64, go1.20.7, ae2bbc2
[ERROR] plugin/errors: 2 login.microsoftonline.com. A: dns: overflow unpacking uint32
.:53
[INFO] plugin/reload: Running configuration SHA512 = 591cf328cccc12bc490481273e738df59329c62c0b729d94e8b61db9961c2fa5f046dd37f1cf888b953814040d180f52594972691cd6ff41be96639138a43908
CoreDNS-1.11.1
linux/amd64, go1.20.7, ae2bbc2

So now I have an error to work with!

@BenTheElder
Copy link
Member

BenTheElder commented May 17, 2024

coredns/coredns#3305 suggests that this is a bug in the upstream DNS server coredns/coredns#3305 (comment)

@BenTheElder
Copy link
Member

This seems to be somewhat environment dependent, based on UDP packet size? In any case, we can track it here but it doesn't look like there are good options for the kind project to directly affect this.

We could force DNS over TCP but that's going to be a breaking change, it looks like you could enable this as a workaround like coredns/coredns#3305 (comment)

@mastrauckas
Copy link
Author

This seems to be somewhat environment dependent, based on UDP packet size? In any case, we can track it here but it doesn't look like there are good options for the kind project to directly affect this.

We could force DNS over TCP but that's going to be a breaking change, it looks like you could enable this as a workaround like coredns/coredns#3305 (comment)

Sometime this weekend, I may try to use tshark to try to troubleshoot the issue.

Either way, in my opinion, kind shouldn't do anything because it's not a kind issue.

@BenTheElder BenTheElder added the kind/external upstream bugs label May 17, 2024
@BenTheElder
Copy link
Member

I think there's a workaround available in https://github.com/coredns/coredns/pull/6277/commits but it's not in a ralease consumable by kuebadm yet because coredns/coredns#6661 ?

If we can help sort that out, we can get it into a future kubernetes release and then into kind.

@mastrauckas
Copy link
Author

I think there's a workaround available in https://github.com/coredns/coredns/pull/6277/commits but it's not in a ralease consumable by kuebadm yet because coredns/coredns#6661 ?

If we can help sort that out, we can get it into a future kubernetes release and then into kind.

I will check this out. I got cough up doing having to handle something this weekend so was not able to troubleshoot. I will look over what you sent.

@aojea
Copy link
Contributor

aojea commented May 22, 2024

if you can upload the pcap that should be useful, it seems the issue is because a malformed answer that should be visible in the pcap

@BenTheElder
Copy link
Member

@aojea I think coreDNS already has a mitigation merged, but there is not a released container image so kubeadm cannot upgrade. context above.

When coreDNS can release an image for the current tag, we need to get the image mirrored into registry.k8s.io, upgrade kubeadm, and then kind can ship a patched image.

In the meantime your options are somewhat limited unfortunately, you could patch the coreDNS deployment to your own coreDNS image with the new code maybe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/external upstream bugs
Projects
None yet
Development

No branches or pull requests

4 participants