Performance degradation on cluster with overlay networking (cluster-store related) #1750

chanwit · 2016-02-03T13:20:29Z

I'm not sure if it's caused by the Engine or Swarm. But it feels clearly slower when deploying Docker 1.10-rc3 with swarm from the master.
All nodes wired through a cluster store, Consul.

24 seconds to start a new container thru Swarm is too slow IMHO.
9 second for running directly is also strange.
Memory spec is 512MB on each DigitalOcean node. This might be the cause?

Is there anyone able to confirm this?

Directly without Swarm:

root@ocean-master:~# time `docker run -d -p 80 smebberson/alpine-nginx`
real    0m9.405s
user    0m0.024s
sys     0m0.020s

root@ocean-master:~# docker info
Containers: 7
 Running: 7
 Paused: 0
 Stopped: 0
Images: 2
Server Version: 1.10.0-rc3
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 40
 Dirperm1 Supported: true
Execution Driver: native-0.2
Logging Driver: json-file
Plugins:
 Volume: local
 Network: overlay bridge null host
Kernel Version: 3.16.0-4-amd64
Operating System: Debian GNU/Linux 8 (jessie)
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 494.5 MiB
Name: ocean-master
ID: VPPB:2H2F:MVKH:GTEG:YFX2:6SVQ:EXLO:YMYW:4I2Q:CF3H:L27T:HIMG
WARNING: No memory limit support
WARNING: No swap limit support
WARNING: No oom kill disable support
WARNING: No cpu cfs quota support
WARNING: No cpu cfs period support
Labels:
 provider=digitalocean
Cluster store: consul://128.199.244.241:8500
Cluster advertise: 104.236.37.137:2376

Running through Swarm:

debian-1gb-sgp1-02:~
# time `docker run -d -p 80 smebberson/alpine-nginx`

real    0m24.513s
user    0m0.104s
sys     0m0.036s

# docker-machine ls
NAME           ACTIVE      DRIVER         STATE     URL                          SWARM                   DOCKER        ERRORS
consul         -           digitalocean   Running   tcp://xxx.xxx.xxx.xxx:2376                           v1.10.0-rc2
ocean-1        -           digitalocean   Running   tcp://xx.xx.xx.xxx:2376      ocean-master            v1.10.0-rc3
ocean-2        -           digitalocean   Running   tcp://xxx.xxx.xxx.xxx:2376   ocean-master            v1.10.0-rc3
ocean-master   * (swarm)   digitalocean   Running   tcp://xxx.xxx.xx.xxx:2376    ocean-master (master)   v1.10.0-rc3

The text was updated successfully, but these errors were encountered:

chanwit · 2016-02-03T13:26:26Z

Cannot see anything in top. CPU and memory usage is quite low.
Hope it's a misleading by base performance of a DigitalOcean node :(

chanwit · 2016-02-03T13:28:32Z

Oh right, is it caused by the content addressable image?

chanwit · 2016-02-03T14:55:55Z

Testing on 1.10-rc3 standalone on a physical box and it's fast.

root@os-1:/home/debian# time `docker run -d nginx`

real    0m0.082s
user    0m0.020s
sys     0m0.004s

So it's obviously not caused by the Engine in its standalone mode.

Still checking.

dongluochen · 2016-02-03T18:47:17Z

Thanks @chanwit. We should also add integration test on latency.

abronan · 2016-02-03T18:50:27Z

Hi @chanwit,

I'm guessing you are using container networking when creating those containers? Can you try destroying and re-creating the droplets (if only a few)? Or otherwise find the faulty droplet.

I had the same issue when trying out container networking for docker 1.9. I found out that one of the droplet was faulty and had its networking traffic stalled somehow. After destroying/re-creating that specific droplet, the problem was gone and the container would be created very quickly. Not sure what was the cause, probably bad VM interference for that specific droplet.

Let us know if it's still slow after that, but I think that might be related to the instance performance.

dongluochen · 2016-02-03T19:01:21Z

Here is my test result. I don't see obvious issue. I'm running directly on latest code.

A client from vm4 starts a container directly on vm3 daemon.

dchen@vm4:~$ time docker -H vm3:4444 run -d -p 80 smebberson/alpine-nginx
aab0c5999c797f76e1452c180b864968f8c1982c5c08046f2183c712d94ffa21

real    0m0.286s
user    0m0.016s
sys 0m0.028s

A client from vm4 starts a container thru swarm manager (vm2). The container is started on vm3 daemon.

dchen@vm4:~$ time docker -H vm2:2372 run -d -p 80 smebberson/alpine-nginx
e328f38349c5b33c7d30985a9bf65701d36928f514fc9409eacd829e15357828

real    0m0.314s
user    0m0.016s
sys 0m0.020s

chanwit · 2016-02-04T00:10:01Z

@abronan I see it's may be about networking setup. I'll double check again with a fresh cluster.

chanwit · 2016-02-04T00:26:29Z

@dongluochen could you try setup a real, overlay networking cluster and double checking this?
Standalone cluster seems to not affect by this slowness.

chanwit · 2016-02-04T00:54:14Z

@abronan it's back to normal after re-creating the whole cluster like use said.

# time `docker run -d -p 80 smebberson/alpine-nginx`

real    0m1.661s
user    0m0.124s
sys     0m0.004s

I'm pinning point to the cluster-store as it's only problem I'm aware of.
FYI, it's caused the problem because I have been running the cluster store for 3 days before testing.

chanwit · 2016-02-04T00:57:45Z

Both this and #1752 are related AFAIK.

dongluochen · 2016-02-04T01:08:36Z

Thanks @chanwit. We may need to do some tests to detect if there is degradation problem over long running clusters. If you get into this problem again, I think it helpful to collect network traces to see where the latency comes from.

chanwit · 2016-02-04T01:13:55Z

@dongluochen it's from --cluster-store, related to parameters in this moby/moby#18204.

chanwit · 2016-02-04T01:16:09Z

@dongluochen Let me know if you are able to confirm this :-)

Z3r0Sum · 2016-06-10T01:45:30Z

@chanwit I'm also experiencing something similar on docker 1.11.1. I have 12 nodes in my swarm and things were fine when there was only 2-3 nodes. The docker network ls command even hangs. Creating new networks doesn't appear to be possible. When I go to create new containers on the custom network, it too times out.

@abronan What do you mean by droplets? I tried re-creating both my swarm-managers from scratch and still have the same problem:

docker -H :4000 network create --driver=overlay selenium-net
Error response from daemon: Error response from daemon: pool configuration failed because of Unexpected response code: 413 (Value exceeds 524288 byte limit)

I saw docker/compose#3041 for compose, which seems similar. I tried restarting the daemons as well and still no dice. I'm using docker 1.11.1 on all my nodes.

Edit: After removing every swarm node and all the swarm managers - I was able to get it working again (full redeploy). I'm going through adding each node 1 at a time to find the one that might have caused the issue.

dongmx · 2016-06-30T03:35:30Z

@chanwit I meet same problem. With docker daemon --cluster-store etcd://127.0.0.1:2379 , running 200 containers which networks are provide by remote driver.

#time docker network ls
NETWORK ID          NAME                DRIVER
ea29194b2793        bridge              bridge
862f4af63e54        host                host
e22c4c251bde        none                null
d2907c5a339e        test                calico

real    0m24.776s
user    0m0.010s
sys 0m0.011s

I think it caused by libnetwork too slow when calling GET /v1.23/networks. And swarm always call it. So the daemon hang there, docker ps docker run docker stop all become very slow

Update: I change my store for etcd to zk, problem fixed.

schmunk42 · 2016-09-05T21:22:43Z

Same here with 1.10 and 1.11 and compose v2 format (overlay network is default), same problems as described here and in related issues.
We've a production swarm with 1.9 which we use only with compose v1 format which runs at the same (fast) speed over months.

The newer swarm gets slower and slower over time - restarting all agents including master speeds things up for a while and is required when our consul backend was restarted. The older one "survives" consul restarts without any problems.

I tried the same-setup on 1.12.1 and swarm 1.2.5 today and things got worse 😞 Now, I need to restart every engine swarm if the networking runs into problems.
Here's a log snippet from consul 0.6.4 (docker image) shorty before it exits, presumably because of too many requests.

Bunch of related issues:

sombralibre · 2017-02-22T23:45:01Z

Hi there,

I'm not sure if it's the right place to post this, but I really need some help with the follows issues:

I've create a swarm cluster as follow

docker swarm init --advertise-addr 10.0.0.240 --listen-addr 10.0.0.240

Add a node by

docker swarm join --token SWMTKN-1-44vris9xsytcrms6kg-3p2yktgdubc828u4gzc6rxdas --advertise-addr 10.0.0.241 --listen-addr 10.0.0.241  10.0.0.240:2377

I run "docker node ls' and apparently everthing is ok, then I've created a compose file to be deployed with "docker stack deploy", with successful return, it created a stack with its own network and its services.

version: "3"
services:
 rabbitmq:
  image: rabbitmq:3
  networks:
   - " AlkaOverlay"
 backend:
  image: "IMAGEB:TAGB"
  networks:
   - " AlkaOverlay"
  ports:
   - "1080:80"
  environment:
   - "DJANGO_SETTINGS_MODULE=alka.settings_staging"
  command: "/srv/www/bin/gunicorn.sh"
  deploy:
   mode: "replicated"
   replicas: 2
 celery:
  image: "IMAGEB:TAGB"
  networks:
   - " AlkaOverlay"
  environment:
   - "DJANGO_SETTINGS_MODULE=alka.settings_staging"
  command: "/srv/www/bin/celery.sh"
 frontend:
  image: "IMAGEF:TAGF"
  networks:
   - " AlkaOverlay"
  ports:
   - "80:80"
   - "443:443"
  deploy:
   mode: "replicated"
   replicas: 2

networks:
 AlkaOverlay:
  driver: overlay
  ipam:
   driver: default
   config:
    - subnet: 10.0.253.0/24

The connections issues start to appears when try to use the web app throught tcp port 80, I've basically tried curl request like this:

for t in {1..10};do echo "LOOP $t"; timeout 3s curl 10.0.0.240:1080 ;done

for t in {1..10};do echo "LOOP $t"; timeout 3s curl 10.0.0.241:1080 ;done

and from inside of the "frontend" containers

for t in {1..10};do echo "LOOP $t"; timeout 3s curl backend ;done

However always I run the tests, the response of the "backends" I get works like a "round robin", that means one request work fine, and the next get stucked waiting for response however the request reach the timeout and the petitions die; these timeout are translates to server errors (5XX) to the clients.

On an intent to get the stack work fine, I've run every component (backend, frontend, rabbit, celery) on separated container and make its network connections through physical network interface of the hosts instances; well with this setup all work fine, but the pearks of scaling has been gone.

I've checked the udp connections between hosts and these works fine, also have change the deploy mode from "replicated" to "global", but any of this setup to swarm has get the cluster to work fine.

I would appreciate any help or advice with the issue. Thanks a lot.

abronan · 2017-02-23T05:06:56Z

Hi @sombralibre, I think you might want to open a new issue on the docker engine repository instead for more visibility (https://github.com/docker/docker). Cheers.

sombralibre · 2017-02-23T14:08:17Z

@abronan I'll do it. thanks.

schmunk42 · 2017-02-23T16:16:41Z

Just FYI: We're having much less problems with consul 0.7.x and docker/swarm 1.2.6

chanwit mentioned this issue Feb 3, 2016

use golang:1.5.3-alpine in Dockerfile #1746

Merged

amitshukla added this to the 1.1.0 milestone Feb 3, 2016

chanwit mentioned this issue Feb 4, 2016

Swarm introduces linear performance degradation as #networks or #containers increases #1752

Closed

chanwit changed the title ~~Performance degradation when running containers~~ Performance degradation on cluster with overlay networking (cluster-store related) Feb 4, 2016

dongluochen added area/network area/store labels Feb 4, 2016

dongluochen modified the milestones: 1.1.1, 1.1.0 Feb 4, 2016

vieux self-assigned this Feb 10, 2016

dongluochen modified the milestones: 1.2.0, 1.1.1 Mar 23, 2016

amitshukla modified the milestones: 1.3.0, 1.2.0 Mar 23, 2016

dougtweedy mentioned this issue Jun 3, 2016

Swarm/Compose/Consul/Overlay networking is causing reproducible timeouts on swarm manager #2323

Closed

chanwit closed this as completed Feb 3, 2020

chanwit unassigned vieux Feb 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance degradation on cluster with overlay networking (cluster-store related) #1750

Performance degradation on cluster with overlay networking (cluster-store related) #1750

chanwit commented Feb 3, 2016

chanwit commented Feb 3, 2016

chanwit commented Feb 3, 2016

chanwit commented Feb 3, 2016

dongluochen commented Feb 3, 2016

abronan commented Feb 3, 2016

dongluochen commented Feb 3, 2016

chanwit commented Feb 4, 2016

chanwit commented Feb 4, 2016

chanwit commented Feb 4, 2016

chanwit commented Feb 4, 2016

dongluochen commented Feb 4, 2016

chanwit commented Feb 4, 2016

chanwit commented Feb 4, 2016

Z3r0Sum commented Jun 10, 2016 •

edited

dongmx commented Jun 30, 2016 •

edited

schmunk42 commented Sep 5, 2016 •

edited

sombralibre commented Feb 22, 2017

abronan commented Feb 23, 2017

sombralibre commented Feb 23, 2017

schmunk42 commented Feb 23, 2017

Performance degradation on cluster with overlay networking (cluster-store related) #1750

Performance degradation on cluster with overlay networking (cluster-store related) #1750

Comments

chanwit commented Feb 3, 2016

chanwit commented Feb 3, 2016

chanwit commented Feb 3, 2016

chanwit commented Feb 3, 2016

dongluochen commented Feb 3, 2016

abronan commented Feb 3, 2016

dongluochen commented Feb 3, 2016

chanwit commented Feb 4, 2016

chanwit commented Feb 4, 2016

chanwit commented Feb 4, 2016

chanwit commented Feb 4, 2016

dongluochen commented Feb 4, 2016

chanwit commented Feb 4, 2016

chanwit commented Feb 4, 2016

Z3r0Sum commented Jun 10, 2016 • edited

dongmx commented Jun 30, 2016 • edited

schmunk42 commented Sep 5, 2016 • edited

sombralibre commented Feb 22, 2017

abronan commented Feb 23, 2017

sombralibre commented Feb 23, 2017

schmunk42 commented Feb 23, 2017

Z3r0Sum commented Jun 10, 2016 •

edited

dongmx commented Jun 30, 2016 •

edited

schmunk42 commented Sep 5, 2016 •

edited