Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker does not free up disk space after container, volume and image removal #32420

Closed
neerolyte opened this issue Apr 6, 2017 · 59 comments
Closed

Comments

@neerolyte
Copy link

Similar to #21925 (but didn't look like I should post on there).

Description

I have some docker hosts running CI builds, nightly all docker data is removed from them, but /var/lib/docker/overlay2 keeps consuming more space.

If I remove all docker data, e.g I just did:

docker rm -vf $(docker ps -aq)
docker rmi -f $(docker images -aq)
docker volume prune -f
docker system prune -a -f

There's still a few GB tied up at /var/lib/docker/overlay2:

[root@*** docker]# du -sh /var/lib/docker/overlay2/
5.7G	/var/lib/docker/overlay2/

These files are not left over from a prior upgrade as I upgraded and rm -rf /var/lib/docker/* yesterday.

Steps to reproduce the issue:

Unfortunately I don't have a simple set of steps to reproduce this that are fast and shareable - fortunately I can reliably check our CI nodes each morning and they are in this state, so with some help we can probably get to a repro case.

Describe the results you received:

More space is consumed by /var/lib/docker/overlay2 over time despite all attempts to clean up docker using its inbuilt commands.

Describe the results you expected:

Some way to clean out image, container and volume data.

Additional information you deem important (e.g. issue happens only occasionally):

There's obviously some reference between /var/lib/docker/image and /var/lib/docker/overlay2, but I don't understand exactly what it is.

With docker reporting no images:

[root@*** docker]# docker images -aq
[root@*** docker]#

I can see an ID for one of the base images we built a lot of stuff on top of:

[root@*** docker]# find image/ | grep 89afeb2e357b
image/overlay2/distribution/diffid-by-digest/sha256/89afeb2e357b60b596df9a1eeec0b32369fddc03bf5f54ce246d52f97fa0996c

If I run something in that image, the output is weird:

[root@*** docker]# time docker run -it --rm ringo/scientific:6.8 true
Unable to find image 'ringo/scientific:6.8' locally
6.8: Pulling from ringo/scientific
89afeb2e357b: Already exists
Digest: sha256:cb016e92a510334582303b9904d85a0266b4ecdb176b68ccb331a8afe136daf4
Status: Downloaded newer image for ringo/scientific:6.8

real	0m3.305s
user	0m0.026s
sys	0m0.022s

Weird things about that output:

  • says it's not local
  • but then 89afeb2e357b already exists
  • says it's "Downloading newer image" but then runs it a lot faster than it could if it had actually downloaded the image

If I then delete all images again:

[root@*** docker]# docker rmi -f $(docker images -qa)
Untagged: ringo/scientific:6.8
Untagged: ringo/scientific@sha256:cb016e92a510334582303b9904d85a0266b4ecdb176b68ccb331a8afe136daf4
Deleted: sha256:dfb081d8a404885996ba1b2db4cff7652f8f8d18acab02f9b001fb17a4f71603
[root@*** docker]#

Disable the current overlay2 dir with docker stopped:

[root@*** docker]# systemctl stop docker
[root@*** docker]# mv /var/lib/docker/overlay2{,.disabled}
[root@*** docker]# systemctl start docker

It does indeed error out looking for the overlay2 counterpart:

[root@*** docker]# time docker run -it --rm ringo/scientific:6.8 true
Unable to find image 'ringo/scientific:6.8' locally
6.8: Pulling from ringo/scientific
89afeb2e357b: Already exists
Digest: sha256:cb016e92a510334582303b9904d85a0266b4ecdb176b68ccb331a8afe136daf4
Status: Downloaded newer image for ringo/scientific:6.8
docker: Error response from daemon: lstat /var/lib/docker/overlay2/bb184df27a8fc64cb5a00a42cfe106961cc5152e6d0aba88b491e3b56315fbac: no such file or directory.
See 'docker run --help'.

real	0m3.053s
user	0m0.021s
sys	0m0.020s

Output of docker version:

[root@*** docker]# docker version
Client:
 Version:      17.03.1-ce
 API version:  1.27
 Go version:   go1.7.5
 Git commit:   c6d412e
 Built:        Mon Mar 27 17:05:44 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.03.1-ce
 API version:  1.27 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   c6d412e
 Built:        Mon Mar 27 17:05:44 2017
 OS/Arch:      linux/amd64
 Experimental: false

Output of docker info:

[root@*** docker]# docker info
Containers: 0
 Running: 0
 Paused: 0
 Stopped: 0
Images: 0
Server Version: 17.03.1-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 4ab9917febca54791c5f071a9d1f404867857fcc
runc version: 54296cf40ad8143b62dbcaa1d90e520a2136ddfe
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 3.10.0-514.10.2.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 7.638 GiB
Name: ***
ID: FXGS:5RTR:ASN7:KKB3:TVTN:PFWV:RHDY:XYMG:7RWK:CPG4:YNVB:TBIC
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.):

oVirt VM in a company cloud running stock CentOS 7 and SELinux. Docker installed from docker.com packages.

@neerolyte
Copy link
Author

I should also mention that based on #24023 I've switched to running overlay2 (instead of the default of overlay 1 on CentOS 7). The issue exists against both overlay and overlay2, so I think it's docker internals and not storage driver specific.

I've configured overlay2 by modifying daemon.json:

[root@*** ~]# cat /etc/docker/daemon.json
{
	"storage-driver": "overlay2",
	"storage-opts": [
		"overlay2.override_kernel_check=1"
	]
}

and yes I cleaned docker before starting up docker with the new driver by running rm -rf /var/lib/docker/*.

@thaJeztah
Copy link
Member

Some quick questions;

  • are you using docker-in-docker in your CI?
  • if so, are you sharing the /var/lib/docker directory between the "docker in docker" container and the host?

Also note, that removing containers with -f can result in layers being left behind; when using -f, docker currently ignores "filesystem in use" errors, and removes the container, even if it failed to remove the actual layers. This is a known issue currently, and being looked into to change that behavior

@neerolyte
Copy link
Author

are you using docker-in-docker in your CI?

No. Fairly vanilla builds - mostly with Rocker but nothing special at run time (we haven't even switched to data volumes yet as we only just upgraded to a version of docker that has the docker volume command).

Also note, that removing containers with -f can result in layers being left behind; when using -f, docker currently ignores "filesystem in use" errors, and removes the container, even if it failed to remove the actual layers. This is a known issue currently, and being looked into to change that behavior

Generally we shouldn't actually have any containers to remove (as all containers are run with docker run --rm ...), but I'll put some extra code around the container clean up to see if it's ever actually cleaning anything up and flag that as a problem.

Got a link to the relevant bugs?

@neerolyte
Copy link
Author

Ok, just double checked what we're doing and the logic has been:

  • every CI build, docker rm -vf any containers before starting the build (for anything that might be dangling) - I've switched this over to just stopping them as it's hard to detect problems here without it looking like test suite failures
  • nightly clean up everything

For the container bit the "clean up everything" has been doing (well something slightly more complicated as we were feature detecting the -v flag on docker rm - but I've just deleted that code as all our nodes have the -v flag now):

containers=($(docker ps -a -q))
if [[ "${#containers[@]}" -gt 0 ]]; then
	echo "Removing containers: ${containers[@]}"
	if ! docker rm -vf "${containers[@]}"; then
		ok=false
	fi
fi

This doesn't look to ever find anything in nightly runs.

We also do:

images=($(docker images -q -a))
if [[ "${#images[@]}" -gt 0 ]]; then
	# because "docker rmi -f" might remove an image in the list while removing
	# its parent, we use "docker inspect" to check if the image actually still
	# exists before requesting removal (so we can separate out genuine docker
	# errors from issues with the removal order)
	for image in "${images[@]}"; do
		if docker inspect "$image" > /dev/null 2>&1; then
			echo "Removing image: $image"
			if ! docker rmi -f "$image"; then
				ok=false
			fi
		fi
	done
fi

# double check there's no images left
if ! [[ -z "$(docker images -q -a)" ]]; then
	echo "WARNING: there are still some images left behind..."
	ok=false
fi

Which obviously does find stuff, but never errors - maybe the problem is with our image clean up though?

@neerolyte
Copy link
Author

@thaJeztah

Also note, that removing containers with -f can result in layers being left behind; when using -f, docker currently ignores "filesystem in use" errors, and removes the container, even if it failed to remove the actual layers. This is a known issue currently, and being looked into to change that behavior

Could you please let me know of any actual bug numbers relating to this?

@Kenji-K
Copy link

Kenji-K commented May 29, 2017

@thaJeztah Any update on this?
@neerolyte did you find a workaround?

@neerolyte
Copy link
Author

@Kenji-K Not exactly. I stop the docker service nightly and rm -rf /var/lib/docker now, so at least it's "stable".

@cpuguy83
Copy link
Member

Don't use docker rm -f... the -f causes Docker to ignore errors and remove the container from memory anyway.
This should be better in 17.06 (fewer errors and -f no longer ignores errors).

@thaJeztah
Copy link
Member

Ah @cpuguy83 beat me to it, but 17.06 includes a change to not remove the container if it fails to unmount/remove the filesystem; see #31012

@neerolyte
Copy link
Author

@cpuguy83 @thaJeztah So just to clarify - is there actually some known safe way to clean up container and image data in any version of docker atm?

Because atm I'm stopping the service and just rm'ing stuff under the hood - but even with that I end up with overlay mounts dangling every now and then and have to actually reboot the box.

@cpuguy83
Copy link
Member

@neerolyte if you have mounts hanging around (and are on a kernel > 3.15), most likely you have run a container where the docker root has been mounted into a container and it is holding onto the mount.

@neerolyte
Copy link
Author

I assume "most likely you have run a container where the docker root has been mounted into a container and it is holding onto the mount" would require running Docker-in-docker or something similar?

I'm not doing any complex containers at all, all data is housed entirely inside the container, we're not even using volumes.

The kernel is a little older because that's all we can get on CentOS7 - 3.10.0-514.21.1.el7.x86_64 - do you have any reference to the specific bug (sometimes RedHat backport in to EL)?

@cpuguy83
Copy link
Member

@neerolyte I don't have a reference to a specific bug, but it's fixed in the upstream kernel at around 3.15. Supposed to be fixed in the upcoming RHEL 7.4 kernel.

One potential way to fix the issue is to use MountFlags=slave in the systemd unit for dockerd.
It may not fix all cases, but probably some (or most).

Also using deferred device removal/deletion (devicemapper config). This doesn't really fix it as it will still get the busy error, but it will not return the error to the user and instead keep trying to remove it periodically in the background until it is successful.

@neerolyte
Copy link
Author

Supposed to be fixed in the upcoming RHEL 7.4 kernel.

Ok I'll recheck when that's available.

Also using deferred device removal/deletion (devicemapper config).

I'm using overlay2, not devicemapper.

@milipili
Copy link

milipili commented Sep 5, 2017

I have the problem as well. It seems that when Docker encounters a "no space left on device error", it is no longer able to reclaim space.

@milipili
Copy link

milipili commented Sep 5, 2017

The only possible solution is to stop the service, then to delete the /var/lib/docker/* manually. Seriously this "product" never works correctly...

@boaz0
Copy link
Member

boaz0 commented Sep 5, 2017

@milipili did you try cpuguy83's comment?

@jostyee
Copy link

jostyee commented Sep 6, 2017

@ripcurld0 We're running Ubuntu 16.04 w/ 4.9.0-040900-generic kernel, still seeing this issue.

@markine
Copy link

markine commented Sep 6, 2017

@jostyee FYI Ubuntu 16.04.3 LTS with 4.4.0-1030-aws kernel with aufs instead of overlay/overlay2 seems to run stable.

...
Server Version: 1.12.6
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 152
 Dirperm1 Supported: true
...

@jostyee
Copy link

jostyee commented Sep 6, 2017

@markine @ripcurld0 Sorry my bad, it was a non-related issue for us, Overlay2 is fine here.

@milipili
Copy link

milipili commented Sep 7, 2017

@ripcurld0 I don't know. We don't use CentOS or RHEL (ubuntu 16.04 latest patches), It happens with both AuFS and devicemapper (overlay don't know) every time the partition goes out of space. We never use -f (and anyway if it is not safe for some reason, it should not be available). So yeah, nuking docker from time to time is currently our only option. But recently in another project with Docker Swarm we have to restart docker because the nodes were no longer able to communicate (even after rm / create the service). So I am guessing we're getting used to downtime......

@sheerun
Copy link

sheerun commented Dec 3, 2017

Same issue, but directory is /var/lib/docker/overlay ... I'm unable to clean in any way, even tried restarting. docker 1.12.6 (for kubernetes)

@sheerun
Copy link

sheerun commented Dec 3, 2017

rm -rf /var/lib/docker/tmp/* helped a bit (18GB of files of names like GetImageBlob537029535)

@wannymiarelli
Copy link

I found a lot of unused images causing this issue, resolved running

docker rmi $(docker images -q)

@thaJeztah
Copy link
Member

@wannymiarelli that's expected; images, containers, and volumes take up space. Have a look at docker system prune for easier cleanup than the command you showed.

@wannymiarelli
Copy link

@thaJeztah sure! just saying that using docker rmi cleaned up the overlay2 folder correctly as @jostyee i'am sorry, actually i have no issue with the overlay folder

@neerolyte
Copy link
Author

At some point this situation has stabilised a lot for us, but I'm not entirely sure when.

We're running Docker version 17.09.0-ce, build afdb6d4 now, with a significantly bigger workload than when I lodged this issue and I'm only having to nuke /var/lib/docker every few weeks (it's done automatically based on freespace) instead of nightly.

We only clean up containers and untagged images regularly, I suspect our only blowout now is when we're changing something in the underlying stack (which generates new image tags).

I'd still appreciate docs somewhere on what different parts under /var/lib/docker/overlay2/ are doing, but think it's reasonable for that to be in a separate issue.

TLDR - happy for this to close.

@thaJeztah
Copy link
Member

Thanks @neerolyte, let me go ahead and close this one 👍

@cpuguy83
Copy link
Member

@chr0n1x The only mounts that should exist in /var/lib/docker are mounts for running containers. If docker is not running there then you should be able to safely unmount them.

@chr0n1x
Copy link

chr0n1x commented Oct 17, 2018

@cpuguy83 good to know, thanks for clearing that up for me

@j-kaplan
Copy link

@cpuguy83 We just tried upgrading 17.3.2 to 17.12.1-ce and we're still seeing this issue even after cleaning up layers leaked by prior versions.

@cpuguy83
Copy link
Member

@j-kaplan please explain "we are still seeing this issue"

@j-kaplan
Copy link

j-kaplan commented Oct 24, 2018

@cpuguy83 It appears that docker is somehow holding onto data on the filesystem and the only way to clear the space is with a reboot.

# df -ah
Filesystem      Size  Used Avail Use% Mounted on
sysfs              0     0     0    - /sys
proc               0     0     0    - /proc
udev            3.9G     0  3.9G   0% /dev
devpts             0     0     0    - /dev/pts
tmpfs           798M  708K  797M   1% /run
/dev/vda1        78G   64G   14G  83% /
securityfs         0     0     0    - /sys/kernel/security
tmpfs           3.9G     0  3.9G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           3.9G     0  3.9G   0% /sys/fs/cgroup
cgroup             0     0     0    - /sys/fs/cgroup/unified
cgroup             0     0     0    - /sys/fs/cgroup/systemd
pstore             0     0     0    - /sys/fs/pstore
cgroup             0     0     0    - /sys/fs/cgroup/freezer
cgroup             0     0     0    - /sys/fs/cgroup/perf_event
cgroup             0     0     0    - /sys/fs/cgroup/net_cls,net_prio
cgroup             0     0     0    - /sys/fs/cgroup/rdma
cgroup             0     0     0    - /sys/fs/cgroup/memory
cgroup             0     0     0    - /sys/fs/cgroup/blkio
cgroup             0     0     0    - /sys/fs/cgroup/pids
cgroup             0     0     0    - /sys/fs/cgroup/hugetlb
cgroup             0     0     0    - /sys/fs/cgroup/devices
cgroup             0     0     0    - /sys/fs/cgroup/cpuset
cgroup             0     0     0    - /sys/fs/cgroup/cpu,cpuacct
systemd-1          -     -     -    - /proc/sys/fs/binfmt_misc
mqueue             0     0     0    - /dev/mqueue
debugfs            0     0     0    - /sys/kernel/debug
hugetlbfs          0     0     0    - /dev/hugepages
fusectl            0     0     0    - /sys/fs/fuse/connections
configfs           0     0     0    - /sys/kernel/config
/dev/vda15      105M  3.4M  102M   4% /boot/efi
nsfs               0     0     0    - /run/docker/netns/default
binfmt_misc        0     0     0    - /proc/sys/fs/binfmt_misc
tracefs            0     0     0    - /sys/kernel/debug/tracing
/dev/vda1        78G   64G   14G  83% /var/lib/docker/overlay2
tmpfs           798M     0  798M   0% /run/user/1000
# du -sh /var/lib/docker
2.1G	/var/lib/docker
# docker system prune -a -f
Total reclaimed space: 0 B

docker volume prune -f
Total reclaimed space: 0 B
# docker ps -a 
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES

# docker images
REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE

I can't even seem to locate the open files with lsof:

# lsof | grep /var/lib/docker/overlay2

# lsof -nP | grep '(deleted)'

Edit: I dragged a coworker into this and they discovered it seems to be related to loopback devices:

losetup -a
/dev/loop15: [64513]:1345436 (/volumes.img (deleted))
/dev/loop1: [64513]:1300832 (/volumes.img (deleted))
/dev/loop13: [64513]:1345483 (/volumes.img (deleted))
/dev/loop11: [64513]:1345506 (/volumes.img (deleted))
/dev/loop8: [64513]:1347832 (/volumes.img (deleted))
/dev/loop6: [64513]:1301091 (/volumes.img (deleted))
/dev/loop4: [64513]:1301100 (/volumes.img (deleted))
/dev/loop2: [64513]:1301128 (/volumes.img (deleted))
/dev/loop14: [64513]:1345481 (/volumes.img (deleted))
/dev/loop0: [64513]:1302159 (/volumes.img (deleted))
/dev/loop12: [64513]:1345484 (/volumes.img (deleted))
/dev/loop9: [64513]:1347797 (/volumes.img (deleted))
/dev/loop10: [64513]:1345486 (/volumes.img (deleted))
/dev/loop7: [64513]:1347855 (/volumes.img (deleted))
/dev/loop5: [64513]:1301099 (/volumes.img (deleted))
/dev/loop3: [64513]:1300882 (/volumes.img (deleted))

Running this has cleared up all the unaccounted for disk space:

for i in `losetup -a | grep deleted | awk '{print $1}' | sed 's/:$//'`; do losetup -d $i; done

Edit 2: To anybody else who might stumble upon this thread, if you're running Concourse in k8s and your k8s nodes are leaking loopback devices like I saw above, it is most likely concourse. It appears that this leaking happens when a Concourse worker pod that was using a baggageclaim driver of btrfs terminates. If you switch to a baggageclaim driver of naive it appears to prevent the issue entirely.

@cpuguy83
Copy link
Member

@j-kaplan The space used in your df output has nothing to do with docker disk usage. It is the full disk usage from the root mount (see how it is identical to /dev/vda1 78G 64G 14G 83% /.

It also seems like you have /var/lib/docker/overlay2 mounted to itself, something that old docker versions did but 17.12.1 should not be doing. Are you sure you've upgraded?

These loopback devices are not related to docker.

@j-kaplan
Copy link

These loopback devices are not related to docker.

I wasn't sure what was creating the loopback devices since these nodes are running docker for use as Kubernetes nodes. I will dig into that side of the world to see whats going on there.

Thanks for your help.

@arturopie
Copy link

We are running into this issue too:

# docker --version
Docker version 18.03.1-ce, build 9ee9f40
# docker system prune -a -f
Total reclaimed space: 0B
# docker volume prune -f
Total reclaimed space: 0B
# docker ps -a
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
# docker images -a
REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE
# lsof | grep /var/lib/docker/overlay2

# df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        1.9G   48K  1.8G   1% /dev
tmpfs           1.9G     0  1.9G   0% /dev/shm
/dev/nvme0n1p1  493G  120G  372G  25% /
# du -sh /var/lib/docker
14G	/var/lib/docker
# du -h --max-depth 1 /var/lib/docker
4.0K	./trust
56K	./network
28K	./volumes
20K	./builder
14G	./overlay2
4.0K	./tmp
4.0K	./swarm
4.0K	./runtimes
120K	./containerd
20K	./plugins
201M	./image
4.0K	./containers
14G	.

I think this issue should be reopened. Let me know if there is any other info I can provide to debug this issue.

@j-kaplan
Copy link

@arturopie Do you happen to be running Concourse in k8s? We're in the middle of tracking down a lead but it appears that is what is leaking the loopback devices.

@arturopie
Copy link

@j-kaplan we are not running k8s.

I don't think we have any loopback device:

# losetup -a

@Dalzhim
Copy link

Dalzhim commented Oct 26, 2018

Just like @arturopie, I am running in the same issue:

# docker --version
Docker version 18.03.1-ce, build 9ee9f40
# docker system prune -a -f
Total reclaimed space: 0 B

docker volume prune -f
Total reclaimed space: 0 B
# docker ps -a
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
# docker images -a
REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE
# lsof | grep /var/lib/docker/overlay2

# df -h
udev                        5.8G     0  5.8G   0% /dev
tmpfs                       1.2G  536K  1.2G   1% /run
/dev/mapper/aw--tests-root   90G   68G   18G  80% /
tmpfs                       5.9G     0  5.9G   0% /dev/shm
tmpfs                       5.0M     0  5.0M   0% /run/lock
tmpfs                       5.9G     0  5.9G   0% /sys/fs/cgroup
/dev/xvda1                  228M  155M   61M  72% /boot
tmpfs                       1.2G     0  1.2G   0% /run/user/1002
sudo du -sh /var/lib/docker
1.4T	/var/lib/docker
sudo du -h --max-depth 1 /var/lib/docker
86M	/var/lib/docker/image
4.0K	/var/lib/docker/runtimes
1.4T	/var/lib/docker/overlay2
448K	/var/lib/docker/network
348K	/var/lib/docker/containerd
4.0K	/var/lib/docker/trust
20K	/var/lib/docker/plugins
333M	/var/lib/docker/volumes
211M	/var/lib/docker/swarm
4.0K	/var/lib/docker/tmp
4.0K	/var/lib/docker/containers
20K	/var/lib/docker/builder
1.4T	/var/lib/docker

I'm also not running k8s and I don't have any loopback devices.

# losetup -a

What seems to be keeping this zombie disk space in use is that I have about 1500 mounts from elements in docker's overlay filesystem. Here's an exerpt:

# mount
overlay on /var/lib/docker/overlay2/4061c2b085c1bb9dc5c5eabe857446290f799ac9b7bf2645917c83cdb63b6d3b/merged type overlay (rw,relatime,lowerdir=/var/lib/docker/overlay2/l/4BMOONDLWUQ7ZWQXT3JORDKP7F:/var/lib/docker/overlay2/l/CGG6QC36WMFOWO6MAIHYWFYQ47:/var/lib/docker/overlay2/l/RYLPC32FMHRGH6VUZ4UDRVLSJV:/var/lib/docker/overlay2/l/VXZQOGGYK5C2KX27M7UAIZTCCQ:/var/lib/docker/overlay2/l/VXLVIIW6CAS2HNJO6NEHZLKBGM:/var/lib/docker/overlay2/l/QMMXIF2XL5PZJ5MFZXYBTK6RJS:/var/lib/docker/overlay2/l/4RNQ7THRJV7TWKOBAVC5XA5DEK,upperdir=/var/lib/docker/overlay2/4061c2b085c1bb9dc5c5eabe857446290f799ac9b7bf2645917c83cdb63b6d3b/diff,workdir=/var/lib/docker/overlay2/4061c2b085c1bb9dc5c5eabe857446290f799ac9b7bf2645917c83cdb63b6d3b/work)
overlay on /var/lib/docker/overlay2/a76b69897bd1b592fdb8790acd874e90863e8117b893703d8076f2b0a6b9fc13/merged type overlay (rw,relatime,lowerdir=/var/lib/docker/overlay2/l/4BMOONDLWUQ7ZWQXT3JORDKP7F:/var/lib/docker/overlay2/l/CGG6QC36WMFOWO6MAIHYWFYQ47:/var/lib/docker/overlay2/l/RYLPC32FMHRGH6VUZ4UDRVLSJV:/var/lib/docker/overlay2/l/VXZQOGGYK5C2KX27M7UAIZTCCQ:/var/lib/docker/overlay2/l/VXLVIIW6CAS2HNJO6NEHZLKBGM:/var/lib/docker/overlay2/l/QMMXIF2XL5PZJ5MFZXYBTK6RJS:/var/lib/docker/overlay2/l/4RNQ7THRJV7TWKOBAVC5XA5DEK,upperdir=/var/lib/docker/overlay2/a76b69897bd1b592fdb8790acd874e90863e8117b893703d8076f2b0a6b9fc13/diff,workdir=/var/lib/docker/overlay2/a76b69897bd1b592fdb8790acd874e90863e8117b893703d8076f2b0a6b9fc13/work)
overlay on /var/lib/docker/overlay2/9c3a878fe52f359b8e81540a8abb4ca5aaacf10c7d509be84d24e3dbc22a143b/merged type overlay (rw,relatime,lowerdir=/var/lib/docker/overlay2/l/4BMOONDLWUQ7ZWQXT3JORDKP7F:/var/lib/docker/overlay2/l/CGG6QC36WMFOWO6MAIHYWFYQ47:/var/lib/docker/overlay2/l/RYLPC32FMHRGH6VUZ4UDRVLSJV:/var/lib/docker/overlay2/l/VXZQOGGYK5C2KX27M7UAIZTCCQ:/var/lib/docker/overlay2/l/VXLVIIW6CAS2HNJO6NEHZLKBGM:/var/lib/docker/overlay2/l/QMMXIF2XL5PZJ5MFZXYBTK6RJS:/var/lib/docker/overlay2/l/4RNQ7THRJV7TWKOBAVC5XA5DEK,upperdir=/var/lib/docker/overlay2/9c3a878fe52f359b8e81540a8abb4ca5aaacf10c7d509be84d24e3dbc22a143b/diff,workdir=/var/lib/docker/overlay2/9c3a878fe52f359b8e81540a8abb4ca5aaacf10c7d509be84d24e3dbc22a143b/work)
overlay on /var/lib/docker/overlay2/96efd205e1fa795ab8f90c19aecab2cf71270a5901ff2546ca8926ac06925953/merged type overlay (rw,relatime,lowerdir=/var/lib/docker/overlay2/l/4BMOONDLWUQ7ZWQXT3JORDKP7F:/var/lib/docker/overlay2/l/CGG6QC36WMFOWO6MAIHYWFYQ47:/var/lib/docker/overlay2/l/RYLPC32FMHRGH6VUZ4UDRVLSJV:/var/lib/docker/overlay2/l/VXZQOGGYK5C2KX27M7UAIZTCCQ:/var/lib/docker/overlay2/l/VXLVIIW6CAS2HNJO6NEHZLKBGM:/var/lib/docker/overlay2/l/QMMXIF2XL5PZJ5MFZXYBTK6RJS:/var/lib/docker/overlay2/l/4RNQ7THRJV7TWKOBAVC5XA5DEK,upperdir=/var/lib/docker/overlay2/96efd205e1fa795ab8f90c19aecab2cf71270a5901ff2546ca8926ac06925953/diff,workdir=/var/lib/docker/overlay2/96efd205e1fa795ab8f90c19aecab2cf71270a5901ff2546ca8926ac06925953/work)

Unfortunately, even after cleaning up the mounts, no disk space is being freed and I still have to do a manual cleanup. The reboot doesn't help.

# sudo umount -a --types overlay

@arturopie
Copy link

In my case, I don't have any overlay mount:

# mount | grep overlay

@cpuguy83
Copy link
Member

If you still have layers sitting in the graph driver directory (and no images) then most likely these are from older versions of docker.
Docker does not clean these out when there was an error from some previous version.

This would have happened from an older version of Docker when doing "docker rm -f ", and the layer could not be removed due to a mount leak.

@arturopie
Copy link

@cpuguy83 we never upgrade Docker on the same machine, we create a new machine with an empty disk when we upgrade Docker, so I'm sure that's not the issue in our case.

@RyanGuyCode
Copy link

I don't know if everyone is gone, but here are some tips and tricks. just make the docker system cleanup a chronjob: https://nickjanetakis.com/blog/docker-tip-32-automatically-clean-up-after-docker-daily

Start again by finding which is the culprit directory:

'df -hx --max-depth=1 /'
and
'df -h'

for me the culprit was docker: /var/lib/docker/overlay2/
short team this works:
docker system prune -a -f

Long term:
Run:
'crontab -e'

insert this:
'0 3 * * * /usr/bin/docker system prune -f'

@mjramtech
Copy link

We're seeing this with docker 18.09.6 on a fairly recently built server where docker definitely hasn't been upgraded.

Why has this not been re-opened? The OP only agreed for it to be closed because the issue "went away", there are plenty of perfectly valid subsequent reports.

@bmulcahy
Copy link

bmulcahy commented Sep 8, 2019

Just encountered this same issue on Ubuntu Disco Dingo.

Issue was related to an running container not being stopped until it was forced stopped. After force stopping the container I was then able to delete the image and then docker system prune -v -f ran and cleaned up all the overlay2 bloat

@neerolyte
Copy link
Author

@mjramtech Everyone thinks they have the same issue but it's clear a lot of the noise is from completely unrelated issues e.g @bmulcahy prune removes unused data, if you have a running container, the associated image is not unused.

If someone can come up with a repro case with open data, it's worth opening a new issue imo. I can't repro any more and I could never repro with data I could share.

@Dmitry1987
Copy link

We have the same issue on docker Docker version 19.03.5, build 633a0ea838
the docker was restarted, container recreated (it had only node-exporter running!) , lsof can't find any deleted open files, the container PID1 doesn't show any open files.
The 'df' was showing at
image
and "du" thinks different about the usage, it doesn't see the 82gb in use.
But even after deleting that folder that shows up in 'df' (I didn't care to recreate node-exporter, it was his folder) , the space didn't clean up, and the mount for '/var' shows this

mount | grep var
/dev/nvme1n1p1 on /var type ext4 (rw,noatime,data=ordered)
/var/lib/snapd/snaps/core_8268.snap on /snap/core/8268 type squashfs (ro,nodev,relatime)
/var/lib/snapd/snaps/amazon-ssm-agent_1480.snap on /snap/amazon-ssm-agent/1480 type squashfs (ro,nodev,relatime)
lxcfs on /var/lib/lxcfs type fuse.lxcfs (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other)
/var/lib/snapd/snaps/amazon-ssm-agent_1816.snap on /snap/amazon-ssm-agent/1816 type squashfs (ro,nodev,relatime)
/var/lib/snapd/snaps/core_8592.snap on /snap/core/8592 type squashfs (ro,nodev,relatime)

any ideas why "df" calculates used space in a way other tools can't find the invisible files/ghosts? Should I inspect the filesystem somehow with advanced tools to understand what's going on?

@rahulsoni43
Copy link

@Dmitry1987 I notice the same issue. FS size is 58 GB and overlay2 is 60GB+

@Dmitry1987
Copy link

@thaJeztah it feels like the issue should be reopened :D

@Dmitry1987
Copy link

Docker version 19.03.5 is fairly new but still have the problem reported in 2017...

@thaJeztah
Copy link
Member

I don't want to reopen this issue, because became somewhat of a "kitchen sink" of "possibly related, but could be different issues". Some issues were fixed, and other reports really need more details to investigate if there's still issues to address. It's better to start a fresh ticket (but feel free to link to this issue).

Looking at your earlier comment

But even after deleting that folder that shows up in 'df' (I didn't care to recreate node-exporter, it was his folder) , the space didn't clean up, and the mount for '/var' shows this

First of all, I really don't recommend manually removing directories from under /var/lib/docker as those are managed by the docker daemon, and removing files could easily mess up state. If that directory belonged to a running container, then removing it won't actually remove the files until they're no longer in use (and/or unmounted). See https://unix.stackexchange.com/a/68532

I see you're mentioning you're running node-exporter (https://github.com/prometheus/node_exporter), which can bind-mount the whole system's filesystem into the container. (Possibly depending on mount-propagation settings), this can be problematic. If you bind-mount /var/lib/docker into a container, that bind-mount can "pull in" all mounts from other containers into that container's mount namespace, which means that none of those container's filesystems can be removed until the node-exporter container is removed (files unmounted). I believe there were also some kernel bugs that could result in those mounts to never be released.

As to differences between df and du, I'd have to do testing, but the combination of mounts being in different namespaces, together with overlay filesystems could easily lead to inconsistencies in reporting the actual space used (e.g. multiple containers sharing the same image would each mount the same image layers, but tools could traverse those mounts and account their size multiple times).

I see some mention of snaps and lxc in your comment; this could be unrelated, but if you installed docker using the snap packages, those packages are maintained by Canonical, and I've seen many reports of those being problematic; I'd recommend (if possible) to test if it also reproduces on the official packages (https://docs.docker.com/engine/install/ubuntu/)

(per earlier comments above); it's possible that files are still in use (which could be by stopped containers or untagged images); in that case, the disk use may be legitimate; if possible, verify if disk space does go down after removing all images and containers.

If you think there's a bug at hand, and have ways to reproduce the issue so that it can be looked into, feel free to open a new ticket, but when doing so;

  • at least provide the details that are requested in the issue template
  • provide exact steps to reproduce the issue (but try to write a minimal case to reproduce so that it could be used for writing a test if needed)

@Dmitry1987
Copy link

Thank you for the suggestions @thaJeztah , it does make sense to report new specific cases with full details. The mount namespaces information is new for me, I will try to learn more about that. But what do you mean by

If you bind-mount /var/lib/docker into a container, that bind-mount can "pull in" all mounts from other containers into that container's mount namespace

is it that all mounts defined by all containers will be locked by the one which mounts all /var/lib/docker or /var for example, and the disk space can be affected by that? (kernel or filesystem won't be able to release disk space form files deleted in these folders or something?)

@thaJeztah
Copy link
Member

is it that all mounts defined by all containers will be locked by the one which mounts all /var/lib/docker or /var for example, and the disk space can be affected by that? (kernel or filesystem won't be able to release disk space form files deleted in these folders or something?)

I'm a bit "hazy" on the exact details (I know @kolyshkin and @cpuguy83 dove more deeply into this when debugging some "nasty" situations), but my "layman explanation" is that "container A" has mounts in it's own namespace (and thus only visible within that namespace), now if "container B" mounts those paths, those paths can only be unmounted if both "container B" and "container A" unmount them. But things can become more tricky than that; if "container A" has mounts with mount-propagation set (slave? shared?), the mounts of "container B" will also propagate to "container A", and now there's an "infinite loop" (container B's mounts cannot be unmounted until container A's mounts are unmounted, which cannot be unmounted until container A's mounts are unmounted).

@cyril-bouthors
Copy link

I was able to recover 32 million inodes and 500GB of storage in /var/lib/docker/overlay2/ by removing unused GitLab Docker images and containers:

docker system prune -fa

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests