New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Driver devicemapper failed to remove root filesystem. Device is busy #27381
Comments
Is this happening with any container? What is running in the container, and what options do you use to start the container? (e.g. are you using bind-mounted directories, are you using |
We run all containers in pretty much the same way and it happens randomly on any one of them.
|
I've just noticed that this happens only on servers with this filesystem |
This hits us hard in production :/ Any hints how to remove the dead containers? |
@thaJeztah Strange that this will happen only with ext4 and not xfs. I am not aware of any such thing. In general people have reported device being busy and there can be so many reasons for that. @ceeko first of all make sure that docker daemon is running into a slave mount namespace of its own and not host mount namespace. So that mount points don't leak and chances of getting such errors are less. If you are using a systemd driven docker, there should be docker unit file and it should have MountFlags=slave. |
@rhvgoyal The Btw it may be worth updating the storage driver docs to recommend this as a best practice in production since I couldn't find any reference. Thank you for your help. |
This was changed a while back; 2aee081#diff-ff907ce70a8c7e795bde1de91be6fa68 (#22806), per the discussion, this may be an issue if deferred removal is not enabled; #22806 (comment) Should we change the default back? @rhvgoyal |
@thaJeztah I think it might be a good idea to change default back to MountFlags=slave. We have done that. Ideally deferred removal and deferred deletion features should have taken care of this and there was no need to use MountFlags=slave. But deferred deletion alone is not sufficient. Old kernels are missing a feature where one can remove a directory from a mount namespace even if it is mounted on in a different mount namespace. And that's one reason container removal can fail. So till old kernels offer that feature, it might be a good idea to run docker daemon in a slave mount namespace. |
@rhvgoyal the errors started to appear again even with |
We have just experienced the same error on
|
I confirm that the error still occurs on 1.12.2 even with |
Here's more info from the logs re 1 container which could not be removed:
|
Following message suggests that directory removal failed.
And in older kernel it can fail because directory is mounted on in some other mount namespace. If you disable Core of the issue here is that container is either still running or some of its mount points have leaked into other some mount namespace. And if we can figure out which mount namespace it has leaked into and how it got there, we could try fixing it. Once you run into this issue, you can try doing And then see which pids have mounts related to containers leaked into them. And that might give some idea. |
I have tried four containers which are all dead and cannot be removed due to device being busy and got nothing :/
|
Now I'm getting actually a slightly different error message:
|
Same thing. this directory can't be deleted because it is mounted on in some other mount namespace. Try to search in /proc//mounts and grep for this id |
Here we go:
|
what processes these pids map to? Try |
These are all nginx worker processes shutting down after a config reload (see edited comment above). I'm wondering why they block the mounts since the containers do not bind any volumes. |
So nginx process is running in another container? Or it is running on host? |
Can you do following.
And paste output. here. |
nginx runs on the host. docker-pid
nginx-pid
|
You docker-pid and host both seem to be sharing same mount namespace. And that means docker daemon is running in host mount namespace. And that probably means that nginx started at some point after container start and it seems to be running in its own mount namespace. And at that time mount points leaked into nginx mount namespace and that's preventing deletion of container. Please make sure MountFlags=slave is working for you. Once it is working, /proc//ns/mnt will give different output for docker daemon and bash shell running in host mount namespace. |
@NeckBeardPrince Please don't waste our time with such pointless commentary. Other than that, there are a couple of ways of getting around this issue that have been posted here. |
systemd unit file does not ship with
|
Last time I had this problem, it was
Example for finding the proc holding the mounts.... # container with the problem
docker rm efad7...
Error response from daemon: Driver devicemapper failed to remove root filesystem efad7...: remove /var/lib/docker/devicemapper/mnt/9bd66290ee...: device or resource busy
# Grep after parts of the mountpoint
grep docker /proc/*/mountinfo | grep 9bd66290ee
/proc/9736/mountinfo:776 427 253:24 / /var/lib/docker/devicemapper/mnt/9bd66290e...
/proc/9910/mountinfo:776 427 253:24 / /var/lib/docker/devicemapper/mnt/9bd66290e...
# Find who the pid's belongs to
ps aux | grep -E "9736|9910"
mysql 9736 0.0... /usr/bin/mysqld_safe --basedir=/usr
mysql 9910 9.8 ... /usr/libexec/mysqld --base...
# Do some extra research on one of the pids
grep docker /proc/9736/mountinfo | wc -l
70
grep docker /proc/9736/mountinfo | grep -o "/run/docker/netns/" | wc -l
17
grep docker /proc/9736/mountinfo | grep -o "/var/lib/docker/containers/" | wc -l
18
grep docker /proc/9736/mountinfo | grep -o "/var/lib/docker/devicemapper/mnt/" | wc -l
33 After restarting mariadb, it let go of the mountpoints, however, it grabbed a lot of them when it started. grep docker /proc/16367/mountinfo | wc -l
52 |
Most of the removal failures are due to mount point (hence device ) being busy in some other mount namespaces. I think following proposed PR will help with this problem if kernel is new enough. If you are running old kernel, then we have written a plug-in call oci-umount to reduce mount leaking problems. |
@rhvgoyal Do you have a plan on which release of docker to include this PR? We are still dealing with the |
CentOS Linux release 7.4.1708 (Core) LOOKS LIKE IT IS FINALLY FIXED |
We are running Docker version 17.09.0-ce and still face the same issue. |
We are occasionally hitting this issue on Oracle Linux:, with docker version 17.03.1-ce (From Oracle's repos)
The above is all fixed by the project's TDA, so we can't change any of it for they time being. 90% of our other environments are Centos 7.3/7.4, and we've not seen the issue there. |
Just managed to solve an instance of this issue with Docker 17.05 on arch Linux on 4.11.9
This made the container finally disappear (not sure why though). |
@MonsieurWave as incredible as it looks, the "ls" trick worked perfectly for me when everything else did not ! |
The |
We've been using Docker on CentOS 7.x (currently at 7.4) for over a year now. When we first installed Docker, everything and everyone said you had to use devicemapper with direct-lvm for the best performance and stability. https://docs.docker.com/engine/userguide/storagedriver/device-mapper-driver/ still says you have to use devicemapper on CentOS with Docker EE. Fortunately, we use Docker CE, so we could switch to overlay2. I feel like the Docker folks slipped in the change in the default from devicemapper to overlay2 on CentOS in v1.13.0/1 with little fanfare or discussion. Is there any solid information on performance/stability of overlay2 versus devicemapper (direct-lvm) on CentOS 7? My googling hasn't found much.... |
We had a very bad time with stock CentOS 7.2 kernels (their 3.10.x frankenstein). Lots of crashes. We were running Kubernetes in a dev env, so the churn of our containers was very high, but even in relatively quiet installations we found the stock CentOS+overlay combo very unstable. Running a 4.10+ upstream kernel with overlay2 is much better. Haven't tried a newer CentOS release. You will need to either use an underlying filesystem that is ext4 or XFS formatted with "-n ftype=1". Docker will run if you have an improperly formatted XFS, but the results will be unpredictable. |
Yeah, I've long since switched to overlay2, and recommend anyone who is still using devicemapper who can use overlay2 to switch, since even this issue aside, I've read that devicemapper is a very poor storage driver for docker in general. |
Restarting ntpd fixed the issue I was having... so confusing. Is there any "recommended" daemon.json configuration for docker on Centos7? |
Some improvements are coming down the pipeline. Specifically the issue with these other system services appears to be a race condition with setting up mount namespaces (for those other system services) and docker's attempt to keep it's own mounts private... the intention is for Docker to keep it's mounts from leaking into containers, unfortunately it's causing leakages elsewhere and actually end up holding private references to those mountpoints which means they can't be unmounted in those namespaces except either manually or when the process restarts. In addition there's been some recent changes to deal with race conditions with using MS_PRIVATE mount propagation in both runc and docker. |
I got exactly same thing as @ceecko with docker 12.1.1 , no chance to update now. Is it fixed later somewhere? Quick fix is to kill processes and restart docker service, but.. |
These versions completely fix the issue for me, including
|
@esabol we have evaluated switching to overlay2 after we upgraded to CentOS 7.4. Sadly it is too much work. The partitions we could use for storing the data is XFS and before 7.4, CentOS default XFS formatting option missed one parameter (I forgot which one) to be able to support overlay2 on top. So it means we would have to reformat the partition in order to be able to use overlay2 on top of XFS. That's when the switch to overlay2 is going to cost us too much work to avoid downtime, and the latest 7.4 kernel + Docker 17.09 and the above recommendations for the LVM configuration helped a lot avoiding the problem most of the time. Note: |
#34573 fix released in 17.09.1-ce, 17.12.0-ce versions |
@jcberthon We recently bit the bullet and made the transition to overlay2, and I'm so glad we did! Performance improved 40% in the benchmarks of our unit tests that do |
This is fixed in 17.12.1 Thanks all. |
before the fiexed release, rebooting the physical node will solve the problem |
@ravilr @KevinTHU regarding your comment #27381 (comment) and #27381 (comment) I've observed that changing the docker unit file on RHEL to |
@MohdAhmad have never try that, but I think this maybe ok, as PrivateTmp=true in docker unit file is for docker only, maybe fix this problem better even. |
I find the same issue. Because I open the folder,close the window to solve it. |
Description
Cannot remove containers, docker reports
Driver devicemapper failed to remove root filesystem. Device is busy
. This leaves containers inDead
state.Steps to reproduce the issue:
docker rm container_id
Describe the results you received:
Error message is displayed:
Error response from daemon: Driver devicemapper failed to remove root filesystem ce2ea989895b7e073b9c3103a7312f32e70b5ad01d808b42f16655ffcb06c535: Device is Busy
Describe the results you expected:
Container should be removed.
Additional information you deem important (e.g. issue happens only occasionally):
This started to occur after upgrade from 1.11.2 to 1.12.2 and happens occasionally (10% of removals)
Output of
docker version
:Output of
docker info
:Additional environment details (AWS, VirtualBox, physical, etc.):
All environments we run servers in - AWS, gcloud, physical, etc.
The text was updated successfully, but these errors were encountered: