Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The container name \"/etcd-fix-perm\" is already in use by container #2632

Open
barnettZQG opened this issue Aug 4, 2021 · 10 comments
Open

Comments

@barnettZQG
Copy link

RKE version: v1.3.0-rc1.0.20210503155726-c25848db1e86

Docker version: (docker version,docker info preferred) 19.03.5

Operating system and kernel: (cat /etc/os-release, uname -r preferred) CentOS Linux 7 (Core)

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO) VirtualBox

cluster.yml file: Irrelevant to the issues.

Steps to Reproduce:

Results:

time="2021-08-04T17:43:16+08:00" level=debug msg="FIXME: Got an status-code for which error does not match any expected type!!!: -1" module=api status_code=-1
time="2021-08-04T17:43:16+08:00" level=warning msg="Failed to create Docker container [etcd-fix-perm] on host [192.168.3.110]: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?"
time="2021-08-04T17:43:16+08:00" level=warning msg="Failed to create Docker container [etcd-fix-perm] on host [192.168.3.110]: Error response from daemon: Conflict. The container name \"/etcd-fix-perm\" is already in use by container \"f1b562493fa1d399976dc19ba31a156ea03194a0088d7eaaa1c0398275cfef1b\". You have to remove (or rename) that container to be able to reuse that name."
time="2021-08-04T17:43:16+08:00" level=warning msg="Failed to create Docker container [etcd-fix-perm] on host [192.168.3.110]: Error response from daemon: Conflict. The container name \"/etcd-fix-perm\" is already in use by container \"f1b562493fa1d399976dc19ba31a156ea03194a0088d7eaaa1c0398275cfef1b\". You have to remove (or rename) that container to be able to reuse that name."
time="2021-08-04T17:43:17+08:00" level=error msg="InstallKubernetes failure, Message: [etcd] Failed to bring up Etcd Plane: Failed to create [etcd-fix-perm] container on host [192.168.3.110]: Failed to create Docker container [etcd-fix-perm] on host [192.168.3.110]: Error response from daemon: Conflict. The container name \"/etcd-fix-perm\" is already in use by container \"f1b562493fa1d399976dc19ba31a156ea03194a0088d7eaaa1c0398275cfef1b\". You have to remove (or rename) that container to be able to reuse that name."

We can repeat this issue many times, each time with an etcd-fix-perm container.

According to the log, there will be three retries to create the container. The first time, the API will report an error, but the container has already been created. Therefore, the second and third attempts will report an error.

From the perspective of code implementation, whether the container can tolerate errors. such as:

https://github.com/rancher/rke/blob/master/docker/docker.go#L442

for i := 1; i <= RetryCount; i++ {
      created, err = dClient.ContainerCreate(ctx, imageCfg, hostCfg, nil, nil, containerName)
      if err != nil && !strings.Contains(err.Error(), "is already in use") {
	      logrus.Warningf("Failed to create Docker container [%s] on host [%s]: %v", containerName, hostname, err)
	      continue
      }
      return created, nil
}
@barnettZQG
Copy link
Author

The other problem here is that every time creates etcd-fix-perm container failure.

@superseb
Copy link
Contributor

superseb commented Aug 4, 2021

So this is to work around an API error? How does this reproduce every time, is there an instance or configuration that triggers this consistently? It doesn't reproduce when I test things out but if we have a solid reproducer that would help (also for validating the fix)

@barnettZQG
Copy link
Author

This issue is easy to repeat on hosts with poor disk performance.

Subsequent retries are invalid after the first container creation API response is incorrect for any reason.

So my idea is to make container creation idempotent again.

@styk-tv
Copy link

styk-tv commented Sep 7, 2021

I have performed about 200 rancher redeployments (with automation) on two separate clusters. In the last month. I have several conclusions:

  1. Avoid using Ubuntu LTS 20.04 (on airgapped installs) mostly due to Kernel bug issues related to VXLAN. 5.4 is not working for sure but even with 5.8 advertised fixed there were issues in some components that are related to full chain kernel upgrade. I eventually got it up and running on LTS 20.04 built by Ubuntu for enterprise customers with Kernel 5.11 (sorry open source) There are ZERO problems with cloud images from AWS or Google simply because by the time they make it to public use, they are patched by cloud vendors to be immaculate.

  2. Installation with RKE is evil on slow hardware. It takes on average 5-15 failed attempts. Nothing is physically wrong with the hardware, tried two different setups with older sata drives as well as attached storage. Retries reach eventual consistency moving few steps up (out of 520+ steps) going up. I would officially call it "Rancher's Discrimination Against Slower Hardware". Its a mistake we are all guilty of. We trust the cloud runs and sign things off as perfect but CloudImage of a vendor is not the same thing as CloudImage available for download by an OS or hardware that slower disk etc.

  3. It might be related to fresh local node docker populating all at once with containers from GCR (locking other disk ops). I can't really be certain.

All it would take IMHO, is to unify timeout for certain operations and expose it as a single variable into clusterfile. "RKE is simply giving up too fast" but it doesn't have to. Please add to tests some old boxes, for the sake of all of us

I would always observe clusters going up with watch -n5 docker ps on all nodes and most of the timeout issues occur before ETCD installations, I would say 80% of all the timeouts. However like i've said before, once you get to this point retries of rke up leads to small steps progress and eventual consistency and success.

Someone, somewhere will eventually make an utility that measures how far reaching the bias really is: ram speed, disk speed, cpu flags. It would make for an interesting experiment to keep track of this data and generate crash report that can be pushed to Rancher for analysis. I wouldn't hesitate to send that back to Rancher if it ment someone is looking at it to improve the product.

Peace

@ElinUrevich8
Copy link

Hey There Guys!
This issue reproduces inconsistency at our environment.
Is there any workaround/method to increase this timeouts in order to avoid failing due to slow hardware?
Thanks in advance :)

@HectorB-2020
Copy link

Hello @barnettZQG and @styk-tv
I'm afraid we're observing this issue even in 2023 on rke v1.4.7 and on pretty fast virtual servers. But in our case it discovers existing container named nginx-proxy on a number of Workers and fails

OK, here is the setup for 30+ VMs.

  • rke tool v1.4.7
  • k8s 1.24.15-rancher1-1
  • CentOS 7.9
  • kernel 3.10.0-1160.el7
  • HW - VMWare based servers, settings vary: 6-14 CPU cores, 32-64 GB RAM
  • purely air-gapped environment

@superseb mostly answering your question how often it is reproduced.
In our case we hit 100% reproducible failure rate.

Initially everything starts clear and bright. It starts normally, click to see.
DEBU[0000] Loglevel set to [debug]
INFO[0000] Running RKE version: v1.4.7
DEBU[0000] audit log policy found in cluster.yml
INFO[0000] Initiating Kubernetes cluster
DEBU[0000] Loading data.json from local source
DEBU[0000] data.json SHA256 checksum: 5f13312d74be24e7121c58dc299f16a728dac04d4635644e7655db30bdc76f68 
DEBU[0000] No DNS provider configured, setting default based on cluster version [1.24.15-rancher1-1] 
DEBU[0000] DNS provider set to [coredns]
DEBU[0000] Checking if cluster version [1.24.15-rancher1-1] needs to have kube-api audit log enabled 
DEBU[0000] Cluster version [1.24.15-rancher1-1] needs to have kube-api audit log enabled 
DEBU[0000] Enabling kube-api audit log for cluster version [v1.24.15-rancher1-1] 
DEBU[0000] No input provided for maxUnavailableWorker, setting it to default value of 10 percent 
DEBU[0000] No input provided for maxUnavailableControlplane, setting it to default value of 1 
DEBU[0000] Checking network mode for ingress for cluster version [v1.24.15-rancher1-1] 
DEBU[0000] Cluster version [v1.24.15-rancher1-1] needs to have ingress network mode set to hostPort 
DEBU[0000] Checking ingress default backend for cluster version [v1.24.15-rancher1-1] 
DEBU[0000] Cluster version [v1.24.15-rancher1-1] needs to have ingress default backend disabled 
...
INFO[0351] [/etc/kubernetes/audit-policy.yaml] Successfully deployed audit policy file to Cluster control nodes 
INFO[0351] [reconcile] Reconciling cluster state
INFO[0351] [reconcile] This is newly generated cluster
DEBU[0351] Encryption is disabled in both current and new spec; no action is required 
INFO[0351] Pre-pulling kubernetes images
It even managed to successfully start nginx-proxy on a few hosts: srv55, srv56, srv57, srv63
INFO[0606] Starting container [nginx-proxy] on host [srv63.company.net], try #1 
INFO[0607] Starting container [nginx-proxy] on host [srv56.company.net], try #1 
INFO[0607] Starting container [nginx-proxy] on host [srv55.company.net], try #1 
INFO[0607] [worker] Successfully started [nginx-proxy] container on host [srv63.company.net] 
DEBU[0607] [worker] Creating log link for Container [nginx-proxy] on host [srv63.company.net] 
...
DEBU[0607] Checking if image [srv64.company.net:5000/rancher/rke-tools:v0.1.89] exists on host [srv63.company.net], try #1 
INFO[0607] Starting container [nginx-proxy] on host [srv57.company.net], try #1 
...
INFO[0607] [worker] Successfully started [nginx-proxy] container on host [srv56.company.net] 
DEBU[0607] [worker] Creating log link for Container [nginx-proxy] on host [srv56.company.net] 
...
INFO[0607] [worker] Successfully started [nginx-proxy] container on host [srv55.company.net] 
DEBU[0607] [worker] Creating log link for Container [nginx-proxy] on host [srv55.company.net] 
...
INFO[0607] [worker] Successfully started [nginx-proxy] container on host [srv57.company.net] 
DEBU[0607] [worker] Creating log link for Container [nginx-proxy] on host [srv57.company.net] 
...
DEBU[0607] [worker] Successfully created log link for Container [nginx-proxy] on host [srv63.company.net] 
...
INFO[0608] Starting container [nginx-proxy] on host [srv45.company.net], try #1 
...
DEBU[0608] [worker] Successfully created log link for Container [nginx-proxy] on host [srv56.company.net] 
...
DEBU[0608] [worker] Successfully created log link for Container [nginx-proxy] on host [srv55.company.net] 
...
INFO[0608] Starting container [nginx-proxy] on host [srv22.company.net], try #1 
...
DEBU[0608] [worker] Successfully created log link for Container [nginx-proxy] on host [srv57.company.net] 
...
INFO[0608] Starting container [nginx-proxy] on host [srv46.company.net], try #1 
...
INFO[0608] [worker] Successfully started [nginx-proxy] container on host [srv45.company.net] 
DEBU[0608] [worker] Creating log link for Container [nginx-proxy] on host [srv45.company.net] 

But then something happens on the other nodes. They complain about Docker daemon at unix:///var/run/docker.sock. Every time number of nodes with allegedly failed Docker varies. But usually they more or less the same. I would not say they have poor HW.
BTW what is this FIXME about?

DEBU[0656] FIXME: Got an status-code for which error does not match any expected type!!!  error="Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?" module=api status_code=-1
WARN[0656] Failed to create Docker container [nginx-proxy] on host [srv33.company.net]: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? 
DEBU[0656] FIXME: Got an status-code for which error does not match any expected type!!!  error="Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?" module=api status_code=-1
WARN[0656] Failed to create Docker container [nginx-proxy] on host [srv28.company.net]: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? 
DEBU[0656] FIXME: Got an status-code for which error does not match any expected type!!!  error="Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?" module=api status_code=-1

And later the same hosts suddenly complain that they already have nginx-proxy.

WARN[0656] Failed to create Docker container [nginx-proxy] on host [srv38.company.net]: Error response from daemon: Conflict. The container name "/nginx-proxy" is already in use by container "6b98c978...". You have to remove (or rename) that container to be able to reuse that name. 
WARN[0656] Failed to create Docker container [nginx-proxy] on host [srv51.company.net]: Error response from daemon: Conflict. The container name "/nginx-proxy" is already in use by container "5b767430...". You have to remove (or rename) that container to be able to reuse that name. 
WARN[0656] Failed to create Docker container [nginx-proxy] on host [srv33.company.net]: Error response from daemon: Conflict. The container name "/nginx-proxy" is already in use by container "037c79cc...". You have to remove (or rename) that container to be able to reuse that name. 
WARN[0656] Failed to create Docker container [nginx-proxy] on host [srv33.company.net]: Error response from daemon: Conflict. The container name "/nginx-proxy" is already in use by container "037c79cc...". You have to remove (or rename) that container to be able to reuse that name. 
WARN[0656] Failed to create Docker container [nginx-proxy] on host [srv51.company.net]: Error response from daemon: Conflict. The container name "/nginx-proxy" is already in use by container "5b767430...". You have to remove (or rename) that container to be able to reuse that name. 
WARN[0656] Failed to create Docker container [nginx-proxy] on host [srv38.company.net]: Error response from daemon: Conflict. The container name "/nginx-proxy" is already in use by container "6b98c978...". You have to remove (or rename) that container to be able to reuse that name. 
WARN[0656] Failed to create Docker container [nginx-proxy] on host [srv39.company.net]: Error response from daemon: Conflict. The container name "/nginx-proxy" is already in use by container "feb8b241...". You have to remove (or rename) that container to be able to reuse that name. 
WARN[0656] Failed to create Docker container [nginx-proxy] on host [srv28.company.net]: Error response from daemon: Conflict. The container name "/nginx-proxy" is already in use by container "639ecd3e...". You have to remove (or rename) that container to be able to reuse that name. 
WARN[0656] Failed to create Docker container [nginx-proxy] on host [srv39.company.net]: Error response from daemon: Conflict. The container name "/nginx-proxy" is already in use by container "feb8b241...". You have to remove (or rename) that container to be able to reuse that name. 
...
WARN[0656] Failed to create Docker container [nginx-proxy] on host [srv44.company.net]: Error response from daemon: Conflict. The container name "/nginx-proxy" is already in use by container "52ee5621cf0538349a801f9db4d9e1933cf8e2e181d010d7c168fed083013371". You have to remove (or rename) that container to be able to reuse that name. 
WARN[0656] Failed to create Docker container [nginx-proxy] on host [srv32.company.net]: Error response from daemon: Conflict. The container name "/nginx-proxy" is already in use by container "a7d936b0e58ceac67e37fcf465c3a69c4cec5f3df139e6a03ac971b020a49b26". You have to remove (or rename) that container to be able to reuse that name. 
WARN[0656] Failed to create Docker container [nginx-proxy] on host [srv44.company.net]: Error response from daemon: Conflict. The container name "/nginx-proxy" is already in use by container "52ee5621cf0538349a801f9db4d9e1933cf8e2e181d010d7c168fed083013371". You have to remove (or rename) that container to be able to reuse that name. 
WARN[0656] Failed to create Docker container [nginx-proxy] on host [srv32.company.net]: Error response from daemon: Conflict. The container name "/nginx-proxy" is already in use by container "a7d936b0e58ceac67e37fcf465c3a69c4cec5f3df139e6a03ac971b020a49b26". You have to remove (or rename) that container to be able to reuse that name. 

Sometimes one node can successfully start nginx-proxy while the next attempt the same node fails with this container. However for the sake of accuracy, one set of nodes usually fail and approximately other set of nodes start nginx-proxy normally.
And this is only about nginx-proxy, I've never seen other container names in this fatal error.


All right, we've dismantled the cluster with rke remove and assembled it with rke up.
We intentionally didn't remove any remaining artifacts on nodes. Means all previously pulled images stayed on nodes.

They cluster has started. Though none of nodes are Ready. We hit exactly the same issue like described there:

@superseb
Copy link
Contributor

superseb commented Aug 9, 2023

@HectorB-2020 Thanks for the data, this will require some investigation. As this seems to be mostly I/O bound, what type disk and IOPS do you have on the VMs? If there was a 100% reproduce rate on this issue, this issue would have a lot more attention so we need to figure out how you can reproduce it so often.

We probably also want Docker daemon debug logs at the time it reports the error to see what it actually receives and responds with, then decide how we can make RKE more resilient to this. Just continuing if it's already in use is too easy I think, we can't be sure that the container is correct with what we wanted. We probably want to delete it and let the loop try again (possibly with a sleep to give the machine some time to recover). Any data is appreciated.

@HectorB-2020
Copy link

@superseb, I was trying to examine Docker logs but didn't find anything suspicious there. At least I didn't find evidence of stopped Docker daemon if I interpret this error correctly.
Cannot connect to the Docker daemon at unix:///var/run/docker.sock.

Surely you know better than I. But in my view it's unlikely that disk IO plays any signification role here. We have quite powerful VMs based on VMWare enterprise hypervisor. And these VMs don't anything else to run, they are purely empty, waiting for k8s for settle on them.

For me it more looks like a bottleneck in network or in our private registry running on Docker Registry v2.
Imagine 30+ powerful nodes start pulling a bunch of images all at once from that poor registry. I've got a feeling that they either saturate network or actually disk IO on that poor Docker Registry v2. The latter one is implausible I'm afraid. They all pull the same images meaning that the first connected node forces loading data from disk into memory cache. The rest of nodes keep the same data warm in the memory cache. The VM designated for the registry has 16 GB of RAM, which lets it cope with amount of images needed to be sent to nodes, I guess.

Another hunch is network topology that could be a potential culprit. The fact that about the same nodes usually successfully pull images while the other set of nodes usually fail allows me to surmise that the successful node could share the same rack and ToR switch in that data center, while the other less successful nodes reside in other racks. Unfortunately I'm not aware of the network topology.

@HectorB-2020
Copy link

One thing's been a mystery for me since very beginning: why is it always nginx-proxy?
@barnettZQG, the original reporter, mentions etcd-fix-perm. Can it be a result of changes introduced into rke code since 2021?

@superseb
Copy link
Contributor

The reason I point out disk I/O is because that is the only way I have been able to reproduce the issue. If I restrict read and write on the disk with the Docker root directory, I can reproduce this fairly reliably.

This is the log with some added logging and sleeps on a limited I/O host:

WARN[0056] Failed to create Docker container [cluster-state-deployer] on host [68.183.3.68]: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?, sleeping 30 seconds 
WARN[0087] Failed to create Docker container [cluster-state-deployer] on host [68.183.3.68]: Error response from daemon: Conflict. The container name "/cluster-state-deployer" is already in use by container "b63aea395be7f74f6f9c487a5e06beb01c07b435412583fade45762c0c8ad1f2". You have to remove (or rename) that contain
er to be able to reuse that name.                                      
INFO[0087] Removing container [cluster-state-deployer] on host [68.183.3.68], try #1 
INFO[0087] Successfully removed Docker container [cluster-state-deployer] on host [68.183.3.68] 
WARN[0087] Failed to create Docker container [cluster-state-deployer] on host [68.183.3.68]: Error response from daemon: Conflict. The container name "/cluster-state-deployer" is already in use by container "b63aea395be7f74f6f9c487a5e06beb01c07b435412583fade45762c0c8ad1f2". You have to remove (or rename) that contain
er to be able to reuse that name., sleeping 30 seconds  
INFO[0133] Successfully created Docker container [cluster-state-deployer] on host [68.183.3.68] 
INFO[0133] Starting container [cluster-state-deployer] on host [68.183.3.68], try #1 
INFO[0134] [state] Successfully started [cluster-state-deployer] container on host [68.183.3.68]

Needs some more investigation but we should be able to implement a solution.

@superseb superseb removed this from the RKE v1.x - Backlog - Rancher v2.x milestone Sep 19, 2023
@superseb superseb added this to the v1.5.1 milestone Oct 2, 2023
@jiaqiluo jiaqiluo modified the milestones: v1.5.1, v1.5.x Mar 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants