New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Major disconnection issue with docker overlay network #27268
Comments
Could you provide some more information;
Is there anything useful in the daemon logs of that node? |
docker version: Client: Server: (docker info and docker network inspect commands are bellow) So for example, from the docker network inspect command you can see i have container called: "dockeruser_tasksmanager_1". This container cannot ping to 10.0.7.7 which is container named: "dockeruser_kafka_1". Any other container in the system can ping dockeruser_kafka_1 And dockeruser_tasksmanager_1 container can successfully ping to any container but the kafka Sorry for the long list... Let me know if there is a better way to do it. docker info: Containers: 290 docker network inspect: [ |
@groyee Just to confirm, you are not running swarm mode but you are using docker/swarm. Is that correct? Which means you are setting up the cluster using |
@groyee Also please provide daemon logs from the node which was running the container from which you attempted the unsuccessful ping. |
Yes, this is correct. No swarm mode. I am using Swarm stand alone + Consul, with --cluster-store and --cluster-advertise. Here are the daemon logs since today. Please let me know if you need earlier daemon logs: daemon logs from node running dockeruser_tasksmanager_1 container (the container that cannot ping) docker-user@useralerts2-prod:~$ sudo journalctl -u docker.service --since today daemon logs from node running dockeruser_kafka_1 container (the container to which i cannot ping from the above container) docker-user@kafka1-prod:~$ sudo journalctl -u docker.service --since today |
Here is one more interesting log: I am running this on the node that cannot ping: docker-user@useralerts2-prod:~$ sudo journalctl -u docker.service | grep kafka1-prod I am running this command: sudo journalctl -u docker.service | grep kafka1-prod I see that the last message is: Oct 10 23:33:07 useralerts2-prod docker[1146]: time="2016-10-10T23:33:07.146403267Z" level=info msg="2016/10/10 23:33:07 [INFO] serf: EventMemberJoin: kafka1-prod 192.168.0.22\n" It looks like it is OK. From this log I would expect that containers running on node useralerts2-prod would be able to ping fine to containers running on node kafka1-prod. |
@groyee Since you provided the logs only since today I just want to make sure the problem happened today. Was the container |
It happened 2 or 3 days ago (I believe) and since then this is the situation. I can provide the full daemon logs since the last several days but this is going to be a very big log. Should I post it here? |
You can do an attachment. Or you can post a link to a Gist |
@groyee There are a lot of node flapping happening in the serf gossip cluster from what I can see from the logs. Is there a problem in the underlying network in terms of congestion? In general I see network congestion from the logs and it is affecting a number of different functions in docker which require a reliable network. Certain excerpts: Gossip flapping
Node discovery
Image Pull
|
We are using the standard Azure network. Pinging Azure internal IP always works. This issue happens only with the overlay network. Also, I don't really understand how come this node can ping to any other container in the swarm cluster and how come restarting the container or the docker daemon or even the entire machine doesn't help. |
can it ping any other container on the same node as other container which it failed to ping? It looks like there is some problem either with communicating with that node in port 7946/udp and 7946/tcp or 4789/udp. Can you try to use tools like |
Also there seems to be a general problem with image pulls timing out which should not be using the overlay network. Not sure at the moment if they are related or not. |
Any other container, running on any other node (except useralerts2-prod) can successfully ping to dockeruser_kafka_1 container (10.0.7.7). Please let me know if you still want me to run some tests. |
Regarding you question: _
_ I just checked it, and it can ping to other containers running on kafka1-prod 10.0.7.7 and 10.0.7.31 are both running on node kafka1-prod. And again, any container, on any other node can ping 10.0.7.7 Crazy :-) |
If it was a one time issue or at least we had some workaround we could live with that for now. The problem is that it happens every day to a different container and there is no workaround. Currently what I do is delete the node from Azure and create a new one every time it happens. |
@groyee I see that the gossip Query was building up from |
You mentioned in your bug report that you have 200 containers. Is it spread across all the 100VMs? Also do you bring them down and up too often? |
@groyee can you try a reversed ping?(ping the container from the one that it cant reach) And see if it recovers? I think there are a couple of issues like this |
We have about 200 containers spread across ~90 VMs. I ran nc and nc -u to port 7946 and it works fine. We do bring containers up and down based on the system load. Sort of auto scaling. WOW. I just did reversed ping and it recovered!!! Can you please explain me what is going on here? |
So, I guess the question is what now? Is it a docker issue? Is it a libnetwork issue? Is it something else? I guess I could write some script that will run on each container and ping all other containers every few seconds but I don't think this is a good solution for production. Should I drop the docker overlay network and use --net=host? I know it's probably a bad idea but if it at least be stable without disconnections then it can be a temporary solution. |
Yeah that can be explained. What is happening in your case is let's say Container A is trying to ping Container B and failing, in your case the node running Container A does not know enough information about how to forward traffic to Container B. But it seems like Container B knows how to reach Container A since reverse ping is working. Once you send a ping from Container B to Container A, the node running Container A auto-learns how to reach Container B when it receives the packet from Container B so when it sends a response it knows exactly how to send the response. |
What would you suggest regarding pulling images? Also please see an issue I opened several weeks ago. |
you can consider setting "max concurrent downloads" on the daemon https://docs.docker.com/engine/reference/commandline/dockerd/ (either by setting that flag, or using a daemon.json configuration file)
|
@groyee I've had these sorts of issues for a long time now, for me it happened also during deploys (we have just a dozen machines), that would generate high download traffic (did not think about this till now)... and might be related to Serf and UDP packet loses (just an assumption). What I've done to minimize the occurrences of this was to setup my own serf agent that joins docker's serf cluster and a cron job that does a |
I see that the default value is --max-concurrent-downloads=3. Do you think setting it to 1 will make the difference? Also, when I do docker-compose pull on the swarm master, where from does it push the image to the nodes? From it's local machine? If so, maybe I should change the --max-concurrent-uploads on the swarm? |
In Swarm, each node pulls the image individually, so that option has to be set on each node / daemon. Swarm (nor Swarm mode) does not push images to the nodes |
I see. So it still means that if I have 100 VMs then all of them at once will start download a container and this can choke the network. I will try setting -max-concurrent-downloads=1 but I am doubtful it will change anything. |
For "classic" Swarm, yes. Swarm mode does rolling updates, so you can specify how many nodes / service instances should update in parallel. |
|
The reason we are doing pull is because there is a docker defect (or at least was in the previous versions) that sometimes it would give you the following error: ERROR: for dockeruser_webapi_2 Cannot create container for service webapi: Unable to find a node that satisfies the following conditions Only after doing pull this issue was resolved. |
So I think I can confirm now that this issue has nothing to do with image pulls. For the last several days we didn't do even one pull, no network spikes and this issue still happens. Currently we have 1 container that every time I restart it I need to do reverse ping from hosts where it cannot ping. After the reverse ping it works but then when I restart the container again I need to do the same operation again. Just to be on the safe side, I tried again to reboot the VM but it doesn't help. Please let me know what logs you need. We really need to fix this issue. |
@groyee Sorry been busy with other issues. When you restarted the container in that VM did you have the daemon in debug mode. Can you get the daemon logs from that node after you tried a few unsuccessful ping and after some time so that we can a) see that miss notifications were generated b) it was queried on the cluster but somehow timed out? |
No, unfortunately it wasn't. I can do it again as it happens every time. Is there a permanent way to boot docker in debug mode in that host? |
you can use a daemon.json configuration file, and enable debug in that https://docs.docker.com/engine/reference/commandline/dockerd/
|
Thanks! So I think I found some interesting logs. First, I attached the docker daemon debug log. (I hope this is debug. I changed /lib/systemd/system/docker.service file to be: ExecStart=/usr/bin/dockerd -D -H fd://) After I restarted the docker daemon and the container, this container couldn't ping to two containers. One of them is Kafka and the other one is zookeeper. I looked at the logs in live to see if there is some new message when I do the reversed ping, but I didn't see anything in the docker logs. However, when I did tail -f /var/log/syslog. The moment I did reverse ping from the other hosts I saw these two lines: Oct 16 16:39:19 webapi1-prod kernel: [89667.035914] vxlan1: 02:42:0a:00:07:07 migrated from 192.168.0.23 to 192.168.0.22 The ip you see here is the Azure internal ip of the nodes. One of them is where zookeeper is running and the other is where kafka is running. |
@groyee I think I see what is going on. For confirming my theory can you tell me if kafka and zookeeper containers were restarted after they were initially up and running and that restart actually resulted in them being scheduled in different hosts like say (kafka was running in 192.168.0.23 and after a restart ran in 192.168.0.22 and zookeeper probably was originally running on 192.168.0.22 and migrated to 192.168.0.23)? Do you see this problem even if none of your containers are restarted after starting them in a fresh cluster? |
What you said is correct. Both kafka and zookeeper were restarted and swarm scheduled them on different hosts after the restart. That being said, other containers have no problems with that. For example, if I bring a new container to the cluster now, just a simple ubuntu image, it pings fine to everybody. I am not sure I understand your last question. I think this happens only when container restarts, either by me or by itself. |
@groyee do you mean the containers are configured with |
No, sorry, my wording wasn't accurate. We don't use on-node-failure feature. When the container restarts by itself it always restarts it on the same host. When we do it manually: docker-compose scale=0 and then docker-compose scale=X then swarm can schedule it where it wants. |
@groyee There are quite a bit of errors in
|
Yes, this host ran out of space because of this defect: #21925 But I don't think it is related. Right now it has plenty of space. All I need to do is restart the container. I attached new debug logs since the last docker daemon restart. It's like docker keeps somewhere a cache of the containers addresses and then from some reason it fails to renew the cache. This is the only explanation I can think of that can explain why restarting container or this VM doesn't help but creating a new VM and then running the same container works fine (until the next time it happens). Here are few line logs from the log file that look suspicious to me: Oct 19 02:44:44 webapi2-prod docker[19538]: time="2016-10-19T02:44:44.676200046Z" level=error msg="Could not open netlink handle during vni population for ns /var/run/docker/netns/3-d025aa804d: failed to set into network namespace 13 while creating netlink socket: invalid argument" |
@groyee In the failure case, does the container move to a different host? @sanimej Could this problem be related to issue #25215? I see the following error.
|
Since we don't use on-node-failure feature I believe that in the failure case it doesn't move to a different host. I assume... But again, it can easily be moved to a different host when we do docker-compose scale=0 and then docker-compose scale=X For example. Zookeeper doesn't do much and it has no persistent volume so swarm can schedule it where it wants in the cluster every time we remove the container and install a new one. Your second question to @sanimej is interesting. We had many many issues related to #25215. I upgraded all our servers to v1.12.2 but I can't say if the problem started before or after the upgrade. It could very well be that the problem started before I upgraded. |
There might be 2 ways a container moves to a host with
|
I can't say for sure |
@groyee We introduced a concept called |
We would love to give it a try, I am just trying to understand if we can do it without any downtime. Does it mean that I need to delete the current overlay network and create a new one? Also, it means that I need to upgrade to 1.13 not only on the failing hosts but also on the swarm itself, right? |
Let me close this ticket for now, as it looks like it went stale. |
We are using docker 1.12.2rc3 in our production environment (This issues happened also with previous versions). We have about 100 VM with 200 containers. Everything is managed by Docker Swarm standalone (not swarm mode).
Al containers are communicating through the same overlay network. Sometimes, every few hours some container (randomly) cannot communicate with some other random container running on a different host. When I do ping, I get:
Obviously all containers are up and I can successfully ping the same container from any other container. I couldn't find any logic. The only thing I can tell is that once it happens I have no workaround. Deleting the container or restarting it doesn't help. I did notices that when it happens it happens to all containers running on the same host. So if I have 5 different containers on host A, suddenly they all connot ping to some container running on a different host. At first I thought that maybe this container disconnected from the overlay network but it's not and I can communicate with all other containers except this one. So removing the container from the overlay network and reattaching it doesn't help.
This is a major problem in our production and we have no solution. We have containers running kafka, elastic, redis, mysql, couchbase.... Every few hours, sometimes days, some containers just stop communicating with some other and once it happens it will never ping it again, no matter how many times I restart any of these containers.
The text was updated successfully, but these errors were encountered: