-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Performance degradation on cluster with overlay networking (cluster-store related) #1750
Comments
Cannot see anything in |
Oh right, is it caused by the content addressable image? |
Testing on 1.10-rc3 standalone on a physical box and it's fast.
So it's obviously not caused by the Engine in its standalone mode. Still checking. |
Thanks @chanwit. We should also add integration test on latency. |
Hi @chanwit, I'm guessing you are using container networking when creating those containers? Can you try destroying and re-creating the droplets (if only a few)? Or otherwise find the faulty droplet. I had the same issue when trying out container networking for docker Let us know if it's still slow after that, but I think that might be related to the instance performance. |
Here is my test result. I don't see obvious issue. I'm running directly on latest code. A client from vm4 starts a container directly on vm3 daemon.
A client from vm4 starts a container thru swarm manager (vm2). The container is started on vm3 daemon.
|
@abronan I see it's may be about networking setup. I'll double check again with a fresh cluster. |
@dongluochen could you try setup a real, overlay networking cluster and double checking this? |
@abronan it's back to normal after re-creating the whole cluster like use said.
I'm pinning point to the cluster-store as it's only problem I'm aware of. |
Both this and #1752 are related AFAIK. |
Thanks @chanwit. We may need to do some tests to detect if there is degradation problem over long running clusters. If you get into this problem again, I think it helpful to collect network traces to see where the latency comes from. |
@dongluochen it's from |
@dongluochen Let me know if you are able to confirm this :-) |
@chanwit I'm also experiencing something similar on docker @abronan What do you mean by droplets? I tried re-creating both my swarm-managers from scratch and still have the same problem:
I saw docker/compose#3041 for compose, which seems similar. I tried restarting the daemons as well and still no dice. I'm using docker Edit: After removing every swarm node and all the swarm managers - I was able to get it working again (full redeploy). I'm going through adding each node 1 at a time to find the one that might have caused the issue. |
@chanwit I meet same problem. With docker daemon
I think it caused by libnetwork too slow when calling Update: I change my store for |
Same here with 1.10 and 1.11 and compose v2 format (overlay network is default), same problems as described here and in related issues. The newer swarm gets slower and slower over time - restarting all agents including master speeds things up for a while and is required when our consul backend was restarted. The older one "survives" consul restarts without any problems. I tried the same-setup on 1.12.1 and swarm 1.2.5 today and things got worse 😞 Now, I need to restart every engine swarm if the networking runs into problems. Bunch of related issues:
|
Hi there, I'm not sure if it's the right place to post this, but I really need some help with the follows issues: I've create a swarm cluster as follow
Add a node by
I run "docker node ls' and apparently everthing is ok, then I've created a compose file to be deployed with "docker stack deploy", with successful return, it created a stack with its own network and its services.
The connections issues start to appears when try to use the web app throught tcp port 80, I've basically tried curl request like this:
and from inside of the "frontend" containers
However always I run the tests, the response of the "backends" I get works like a "round robin", that means one request work fine, and the next get stucked waiting for response however the request reach the timeout and the petitions die; these timeout are translates to server errors (5XX) to the clients. On an intent to get the stack work fine, I've run every component (backend, frontend, rabbit, celery) on separated container and make its network connections through physical network interface of the hosts instances; well with this setup all work fine, but the pearks of scaling has been gone. I've checked the udp connections between hosts and these works fine, also have change the deploy mode from "replicated" to "global", but any of this setup to swarm has get the cluster to work fine. I would appreciate any help or advice with the issue. Thanks a lot. |
Hi @sombralibre, I think you might want to open a new issue on the docker engine repository instead for more visibility (https://github.com/docker/docker). Cheers. |
@abronan I'll do it. thanks. |
Just FYI: We're having much less problems with consul 0.7.x and docker/swarm 1.2.6 |
I'm not sure if it's caused by the Engine or Swarm. But it feels clearly slower when deploying Docker 1.10-rc3 with
swarm
from the master.All nodes wired through a cluster store, Consul.
24 seconds to start a new container thru Swarm is too slow IMHO.
9 second for running directly is also strange.
Memory spec is 512MB on each DigitalOcean node. This might be the cause?
Is there anyone able to confirm this?
Directly without Swarm:
Running through Swarm:
The text was updated successfully, but these errors were encountered: