-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
network between kubernetes PODs is down after one flanneld is stopped and datastore can't be reached #636
Comments
bad network from container on node01 to container on node04, iptables trace in /var/log/syslog on node04:
trace on node01, no ICMP reply:
good network from node01 host to container on node04, iptables trace in /var/log/system on node04:
trace on node01, got ICMP reply:
|
I got it. When flanneld on node04 stopped and couldn't start because kubernetes apiserver couldn't work without Etcd up, there was nobody(it was flanneld on node04) to automatically inject ARP table of flannel vxlan interface on node04 with node01's POD IPs to node01's flannel vxlan interface's MAC. So all PODs on nodes except node04 couldn't be reached from node04 due to ARP miss. This can be confirmed by this command on node04:
Then ping from 172.16.0.3 to 172.16.3.10 works. |
I feel it's better flanneld checks bridge fdb and subnet lease before it exits due to broken k8s apiserver. If the fdb and subnet lease are valid, flanneld can do its best to keep injecting ARP table. |
The vxlan code was significantly changed in the last couple of releases so I don't think this is till a problem. |
Thank you very much!!! That's so awesome!!! I just verified, flanneld now injects permanet ARP table entries for each pod subnets of other nodes, so exit of flanneld won't affect the communication among pods any more. I have 8 nodes, the picture was captured from a node with pod subnet 172.29.2.0/24.
|
I'm experimenting kinds of failure in kubernetes cluster, I found a strange problem.
My steps:
Ping from node01 to containers on node02 and node03 still work.
I suppose flanneld is for network control plane, its exit shouldn't interrupt the data plane, because the flannel.1 vxlan interface and cni0 bridge on node04 still exist after flanneld on node04 exit.
I tried to set rp_filter to 0 on all interfaces of node01/node04, didn't help.
The host OS is latest Ubuntu 16.04, the ubuntu/xenial box shipped by Vagrant.
Kubernetes v1.5.4 and Flannel v0.7.0.
The text was updated successfully, but these errors were encountered: