Skip to content
This repository has been archived by the owner on Feb 1, 2021. It is now read-only.

Performance degradation on cluster with overlay networking (cluster-store related) #1750

Closed
chanwit opened this issue Feb 3, 2016 · 20 comments

Comments

@chanwit
Copy link
Contributor

chanwit commented Feb 3, 2016

I'm not sure if it's caused by the Engine or Swarm. But it feels clearly slower when deploying Docker 1.10-rc3 with swarm from the master.
All nodes wired through a cluster store, Consul.

24 seconds to start a new container thru Swarm is too slow IMHO.
9 second for running directly is also strange.
Memory spec is 512MB on each DigitalOcean node. This might be the cause?

Is there anyone able to confirm this?

Directly without Swarm:

root@ocean-master:~# time `docker run -d -p 80 smebberson/alpine-nginx`
real    0m9.405s
user    0m0.024s
sys     0m0.020s

root@ocean-master:~# docker info
Containers: 7
 Running: 7
 Paused: 0
 Stopped: 0
Images: 2
Server Version: 1.10.0-rc3
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 40
 Dirperm1 Supported: true
Execution Driver: native-0.2
Logging Driver: json-file
Plugins:
 Volume: local
 Network: overlay bridge null host
Kernel Version: 3.16.0-4-amd64
Operating System: Debian GNU/Linux 8 (jessie)
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 494.5 MiB
Name: ocean-master
ID: VPPB:2H2F:MVKH:GTEG:YFX2:6SVQ:EXLO:YMYW:4I2Q:CF3H:L27T:HIMG
WARNING: No memory limit support
WARNING: No swap limit support
WARNING: No oom kill disable support
WARNING: No cpu cfs quota support
WARNING: No cpu cfs period support
Labels:
 provider=digitalocean
Cluster store: consul://128.199.244.241:8500
Cluster advertise: 104.236.37.137:2376

Running through Swarm:

debian-1gb-sgp1-02:~
# time `docker run -d -p 80 smebberson/alpine-nginx`

real    0m24.513s
user    0m0.104s
sys     0m0.036s
# docker-machine ls
NAME           ACTIVE      DRIVER         STATE     URL                          SWARM                   DOCKER        ERRORS
consul         -           digitalocean   Running   tcp://xxx.xxx.xxx.xxx:2376                           v1.10.0-rc2
ocean-1        -           digitalocean   Running   tcp://xx.xx.xx.xxx:2376      ocean-master            v1.10.0-rc3
ocean-2        -           digitalocean   Running   tcp://xxx.xxx.xxx.xxx:2376   ocean-master            v1.10.0-rc3
ocean-master   * (swarm)   digitalocean   Running   tcp://xxx.xxx.xx.xxx:2376    ocean-master (master)   v1.10.0-rc3
@chanwit
Copy link
Contributor Author

chanwit commented Feb 3, 2016

Cannot see anything in top. CPU and memory usage is quite low.
Hope it's a misleading by base performance of a DigitalOcean node :(

@chanwit
Copy link
Contributor Author

chanwit commented Feb 3, 2016

Oh right, is it caused by the content addressable image?

@chanwit
Copy link
Contributor Author

chanwit commented Feb 3, 2016

Testing on 1.10-rc3 standalone on a physical box and it's fast.

root@os-1:/home/debian# time `docker run -d nginx`

real    0m0.082s
user    0m0.020s
sys     0m0.004s

So it's obviously not caused by the Engine in its standalone mode.

Still checking.

@amitshukla amitshukla added this to the 1.1.0 milestone Feb 3, 2016
@dongluochen
Copy link
Contributor

Thanks @chanwit. We should also add integration test on latency.

@abronan
Copy link
Contributor

abronan commented Feb 3, 2016

Hi @chanwit,

I'm guessing you are using container networking when creating those containers? Can you try destroying and re-creating the droplets (if only a few)? Or otherwise find the faulty droplet.

I had the same issue when trying out container networking for docker 1.9. I found out that one of the droplet was faulty and had its networking traffic stalled somehow. After destroying/re-creating that specific droplet, the problem was gone and the container would be created very quickly. Not sure what was the cause, probably bad VM interference for that specific droplet.

Let us know if it's still slow after that, but I think that might be related to the instance performance.

@dongluochen
Copy link
Contributor

Here is my test result. I don't see obvious issue. I'm running directly on latest code.

A client from vm4 starts a container directly on vm3 daemon.

dchen@vm4:~$ time docker -H vm3:4444 run -d -p 80 smebberson/alpine-nginx
aab0c5999c797f76e1452c180b864968f8c1982c5c08046f2183c712d94ffa21

real    0m0.286s
user    0m0.016s
sys 0m0.028s

A client from vm4 starts a container thru swarm manager (vm2). The container is started on vm3 daemon.

dchen@vm4:~$ time docker -H vm2:2372 run -d -p 80 smebberson/alpine-nginx
e328f38349c5b33c7d30985a9bf65701d36928f514fc9409eacd829e15357828

real    0m0.314s
user    0m0.016s
sys 0m0.020s

@chanwit
Copy link
Contributor Author

chanwit commented Feb 4, 2016

@abronan I see it's may be about networking setup. I'll double check again with a fresh cluster.

@chanwit
Copy link
Contributor Author

chanwit commented Feb 4, 2016

@dongluochen could you try setup a real, overlay networking cluster and double checking this?
Standalone cluster seems to not affect by this slowness.

@chanwit
Copy link
Contributor Author

chanwit commented Feb 4, 2016

@abronan it's back to normal after re-creating the whole cluster like use said.

# time `docker run -d -p 80 smebberson/alpine-nginx`

real    0m1.661s
user    0m0.124s
sys     0m0.004s

I'm pinning point to the cluster-store as it's only problem I'm aware of.
FYI, it's caused the problem because I have been running the cluster store for 3 days before testing.

@chanwit chanwit changed the title Performance degradation when running containers Performance degradation on cluster with overlay networking (cluster-store related) Feb 4, 2016
@chanwit
Copy link
Contributor Author

chanwit commented Feb 4, 2016

Both this and #1752 are related AFAIK.

@dongluochen
Copy link
Contributor

Thanks @chanwit. We may need to do some tests to detect if there is degradation problem over long running clusters. If you get into this problem again, I think it helpful to collect network traces to see where the latency comes from.

@chanwit
Copy link
Contributor Author

chanwit commented Feb 4, 2016

@dongluochen it's from --cluster-store, related to parameters in this moby/moby#18204.

@chanwit
Copy link
Contributor Author

chanwit commented Feb 4, 2016

@dongluochen Let me know if you are able to confirm this :-)

@Z3r0Sum
Copy link

Z3r0Sum commented Jun 10, 2016

@chanwit I'm also experiencing something similar on docker 1.11.1. I have 12 nodes in my swarm and things were fine when there was only 2-3 nodes. The docker network ls command even hangs. Creating new networks doesn't appear to be possible. When I go to create new containers on the custom network, it too times out.

@abronan What do you mean by droplets? I tried re-creating both my swarm-managers from scratch and still have the same problem:

docker -H :4000 network create --driver=overlay selenium-net
Error response from daemon: Error response from daemon: pool configuration failed because of Unexpected response code: 413 (Value exceeds 524288 byte limit)

I saw docker/compose#3041 for compose, which seems similar. I tried restarting the daemons as well and still no dice. I'm using docker 1.11.1 on all my nodes.

Edit: After removing every swarm node and all the swarm managers - I was able to get it working again (full redeploy). I'm going through adding each node 1 at a time to find the one that might have caused the issue.

@dongmx
Copy link

dongmx commented Jun 30, 2016

@chanwit I meet same problem. With docker daemon --cluster-store etcd://127.0.0.1:2379 , running 200 containers which networks are provide by remote driver.

#time docker network ls
NETWORK ID          NAME                DRIVER
ea29194b2793        bridge              bridge
862f4af63e54        host                host
e22c4c251bde        none                null
d2907c5a339e        test                calico

real    0m24.776s
user    0m0.010s
sys 0m0.011s

I think it caused by libnetwork too slow when calling GET /v1.23/networks. And swarm always call it. So the daemon hang there, docker ps docker run docker stop all become very slow

Update: I change my store for etcd to zk, problem fixed.

@schmunk42
Copy link

schmunk42 commented Sep 5, 2016

Same here with 1.10 and 1.11 and compose v2 format (overlay network is default), same problems as described here and in related issues.
We've a production swarm with 1.9 which we use only with compose v1 format which runs at the same (fast) speed over months.

The newer swarm gets slower and slower over time - restarting all agents including master speeds things up for a while and is required when our consul backend was restarted. The older one "survives" consul restarts without any problems.

I tried the same-setup on 1.12.1 and swarm 1.2.5 today and things got worse 😞 Now, I need to restart every engine swarm if the networking runs into problems.
Here's a log snippet from consul 0.6.4 (docker image) shorty before it exits, presumably because of too many requests.

Bunch of related issues:

@sombralibre
Copy link

Hi there,

I'm not sure if it's the right place to post this, but I really need some help with the follows issues:

I've create a swarm cluster as follow

docker swarm init --advertise-addr 10.0.0.240 --listen-addr 10.0.0.240

Add a node by

docker swarm join --token SWMTKN-1-44vris9xsytcrms6kg-3p2yktgdubc828u4gzc6rxdas --advertise-addr 10.0.0.241 --listen-addr 10.0.0.241  10.0.0.240:2377

I run "docker node ls' and apparently everthing is ok, then I've created a compose file to be deployed with "docker stack deploy", with successful return, it created a stack with its own network and its services.

version: "3"
services:
 rabbitmq:
  image: rabbitmq:3
  networks:
   - " AlkaOverlay"
 backend:
  image: "IMAGEB:TAGB"
  networks:
   - " AlkaOverlay"
  ports:
   - "1080:80"
  environment:
   - "DJANGO_SETTINGS_MODULE=alka.settings_staging"
  command: "/srv/www/bin/gunicorn.sh"
  deploy:
   mode: "replicated"
   replicas: 2
 celery:
  image: "IMAGEB:TAGB"
  networks:
   - " AlkaOverlay"
  environment:
   - "DJANGO_SETTINGS_MODULE=alka.settings_staging"
  command: "/srv/www/bin/celery.sh"
 frontend:
  image: "IMAGEF:TAGF"
  networks:
   - " AlkaOverlay"
  ports:
   - "80:80"
   - "443:443"
  deploy:
   mode: "replicated"
   replicas: 2

networks:
 AlkaOverlay:
  driver: overlay
  ipam:
   driver: default
   config:
    - subnet: 10.0.253.0/24

The connections issues start to appears when try to use the web app throught tcp port 80, I've basically tried curl request like this:

for t in {1..10};do echo "LOOP $t"; timeout 3s curl 10.0.0.240:1080 ;done

for t in {1..10};do echo "LOOP $t"; timeout 3s curl 10.0.0.241:1080 ;done

and from inside of the "frontend" containers

for t in {1..10};do echo "LOOP $t"; timeout 3s curl backend ;done

However always I run the tests, the response of the "backends" I get works like a "round robin", that means one request work fine, and the next get stucked waiting for response however the request reach the timeout and the petitions die; these timeout are translates to server errors (5XX) to the clients.

On an intent to get the stack work fine, I've run every component (backend, frontend, rabbit, celery) on separated container and make its network connections through physical network interface of the hosts instances; well with this setup all work fine, but the pearks of scaling has been gone.

I've checked the udp connections between hosts and these works fine, also have change the deploy mode from "replicated" to "global", but any of this setup to swarm has get the cluster to work fine.

I would appreciate any help or advice with the issue. Thanks a lot.

@abronan
Copy link
Contributor

abronan commented Feb 23, 2017

Hi @sombralibre, I think you might want to open a new issue on the docker engine repository instead for more visibility (https://github.com/docker/docker). Cheers.

@sombralibre
Copy link

@abronan I'll do it. thanks.

@schmunk42
Copy link

Just FYI: We're having much less problems with consul 0.7.x and docker/swarm 1.2.6

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

9 participants