Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster in new vpc with weavenet and coredns : pods can't reach each other or services #179

Open
cpekyaman opened this issue May 15, 2018 · 7 comments

Comments

@cpekyaman
Copy link

I've recently created a cluster in a new vpc.
My choices for stack parameters were (kubernetes version seems to be 1.9.5):

Name Value
AvailabilityZone us-east-2a
BastionInstanceType t2.micro  
ClusterDNSProvider CoreDNS
DiskSizeGb 30
InstanceType m4.large
K8sNodeCapacity 2
NetworkingProvider weave

It seems that the pods can neither reach each other or access through cluster local services.
Everything at the node / vpc level seems normal (security groups etc.).
I can use kubectl port-forward to access pods without a problem.
I can ping actual cluster nodes from inside the pods.

Only thing I can found is the warnings in kube-proxy logs (failed to retrieve node info):

W0515 20:16:42.377124       1 server_others.go:289] Flag proxy-mode="" unknown, assuming iptables proxy
I0515 20:16:42.378192       1 server_others.go:138] Using iptables Proxier.
W0515 20:16:42.387391       1 server.go:586] Failed to retrieve node info: nodes "ip-10-0-19-97" not found
W0515 20:16:42.387538       1 proxier.go:463] invalid nodeIP, initializing kube-proxy with 127.0.0.1 as nodeIP
W0515 20:16:42.387556       1 proxier.go:468] clusterCIDR not specified, unable to distinguish between internal and external traffic

I've searched through kubernetes issues mentioning that warning but, I'm still not sure if it has anything to do with the problem I have.

Are there anything in particular i need to check (for example, at weave network layer) ?
I'll try creating a stack with calico and kube dns but I want to see if there is something I can diagnose or fix with this setup.

Thanks.

@detiber detiber self-assigned this May 16, 2018
@detiber detiber added this to the May 2018 milestone May 16, 2018
@detiber
Copy link
Contributor

detiber commented May 23, 2018

I'm testing a fix for the kube-proxy error message here: #173. I'm hoping to be able to attempt to replicate this issue tomorrow and see if the fix in #173 fixes this as well.

@detiber
Copy link
Contributor

detiber commented May 24, 2018

I just finished deploying a cluster with a new VPC with the changes in #173 using weave and I am not able to replicate this issue. I used http://docs.heptio.com/content/tutorials/aws-qs-helm-wordpress.html as a test case.

@cicciodifranco
Copy link

Hi
Have you tried to debug your services as described here? mind that you can't ping a service!
I've recently creaed a kubernetes cluster with this template (kube-dns/weave) and we found the same issiue but kube-proxy work well!!

@detiber
Copy link
Contributor

detiber commented May 24, 2018

Just to be sure, I tried to spin up a cluster with the currently released Quickstart and am not able to replicate this issue there either. Again, I'm using http://docs.heptio.com/content/tutorials/aws-qs-helm-wordpress.html as a test case.

@cpekyaman
Copy link
Author

Hi,

I was very busy at work this week so, I couldn't check anything related to this issue.
I'm currently using the stack created with calico and kubedns without a problem.

I hope I will have time to try creating the stack with a few more combinations (weave + corends, weave + kubedns etc.) and try to replicate the issue again tomorrow.
This time, first I'll use the procedure in the link @cicciodifranco referenced, if I succesfully replicate the issue (last time I didn't debug it very systematically I guess, I may have missed something).

@detiber detiber removed this from the May 2018 milestone May 25, 2018
@cpekyaman
Copy link
Author

Sorry for late update, I couldn't spend any time on this.

I've created a stack again with weave + coredns and again same problem happened.

This time, I followed the steps in Debug Services document @cicciodifranco mentioned before. I've also used temporary pods from tutum/curl and tutum/dnsutils for testing access and service lookup.

Only additional info I got:

  • dns resolution with nslookup works without a problem
  • Since there are no pod scheduling and running problems, I assume there is no node <=> master communication problems (I also checked with curl -k https://<master_ip>:6443 from nodes)
  • kube-proxy logs has changed a little (maybe the new version ?):
I0603 15:59:31.202252       1 feature_gate.go:226] feature gates: &{{} map[]}
W0603 15:59:31.265248       1 server_others.go:290] Can't use ipvs proxier, trying iptables proxier
I0603 15:59:31.268074       1 server_others.go:140] Using iptables Proxier.
W0603 15:59:31.311495       1 proxier.go:311] clusterCIDR not specified, unable to distinguish between internal and external traffic
  • There were a bunch of errors regarding kube-proxy and weave pods. It seems weave pods restarted 4 times and kube-proxy 1 time (this maybe because some system level stuff was not ready). At some point pods become ok.
  • The document says if conntrack binary is not found, kube-proxy may not work correctly. dpkg showed conntrack as not installed (on the node), but nothing changed after installing conntrack and restarting nodes. I think document is written for the case kube-proxy is an actual process on the node; it is actually a pod in this cluster.
  • In the document it says i should either see iptables rules for my service or see some errors related to connecting master node. I don't see neither (iptables rules on the node contains weave, docker and kube related general rules only).

I will try a few more things whenever I can spare time and try to give more updates.
If you have any more suggestions about debugging the problem, I can try those too.

@timothysc timothysc added this to the Jan 2019 milestone Jan 11, 2019
@johnSchnake
Copy link
Contributor

Also unable to reproduce this. I used the debug steps https://kubernetes.io/docs/tasks/debug-application-cluster/debug-service/ but used an Ubuntu pod for testing (using a busy box image does cause dns problems which seem beyond the scope here).

@johnSchnake johnSchnake removed this from the Jan 2019 - v1.13.2 refresh milestone Feb 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants