New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple parallel docker build
runs leak disk space that can't be recovered (with reproduction)
#46136
Comments
There are lots of other bug reports about leaked disk space that may or may not be related:
|
Thanks for reporting; I'm indeed not sure if all would be directly related. For some extra context; these builds are running with BuildKit enabled (so not the classic builder?) |
Yes, using BuildKit. |
I've confirmed this is the case. Reproduces in all recent Moby versions. Does not reproduce with buildx+container driver. |
As a workaround, I configured all affected machines to only run 1 CI job with no concurrency. Since then I have not experienced "leaking" disk space any more. A "docker system prune" cron job always cleans up everything. |
I did play around with this one and noticed multiple of these in log:
I tracked where this message comes from and did a dirty workaround with retry mechanism: diff --git a/vendor/github.com/moby/buildkit/source/local/local.go b/vendor/github.com/moby/buildkit/source/local/local.go
index d2cd9d989c..f6a855d053 100644
--- a/vendor/github.com/moby/buildkit/source/local/local.go
+++ b/vendor/github.com/moby/buildkit/source/local/local.go
@@ -133,7 +133,13 @@ func (ls *localSourceHandler) snapshot(ctx context.Context, caller session.Calle
return nil, err
}
for _, si := range sis {
- if m, err := ls.cm.GetMutable(ctx, si.ID()); err == nil {
+ m, err := ls.cm.GetMutable(ctx, si.ID())
+ for err != nil && errors.Is(err, cache.ErrLocked) {
+ m, err = ls.cm.GetMutable(ctx, si.ID())
+ time.Sleep(time.Millisecond * 1)
+ }
+
+ if err == nil {
bklog.G(ctx).Debugf("reusing ref for local: %s", m.ID())
mutable = m
break and it seems to resolve the issue. It's not a solution though because it seems to causes deadlock in subsequent Maybe the cache created here is not properly cleaned up later? I'm not very familiar with buildkit code, but maybe this can point @tonistiigi in the right direction? |
@vvoland No, this shouldn't be related. Destination for local being locked is expected in parallel requests as it makes sure that one build does not overwrite the context files for another. The strategy is to just use a new directory when this happens instead of sleeping until the older build completes and releases its references. |
I debugged this and the reason data leaks seems to be in incorrect reference count in the code path
Line 438 in fa517bb
It's not entirely clear to me yet why this is the case or why parallel builds affect this but after switching to #45966 to make sure I don't make changes in old version the issue does not reproduce anymore and I can track the So #45966 fixes this issue. If there is a desire to understand the case in other versions better I can look more into it. My best guess is that it is related to the namespace changes in v0.12. cc @neersighted |
Could it be that buildkitd when building multiple layers in parallel that have common layers like in the repro from @intgr, layers will be rebuilt because of
which are then not reference counted and thus leaking storage? I am asking because I am experiencing the issue that common layers get rebuilt "randomly", even though everything is cached using registry and seeing a lot of @tonistiigi When common layers are locked, is it supposed to just rebuild that layer instead of waiting for it to get unlocked? |
Yes. Also this is not a layer but a destination directory for context upload. If you are debugging this issue then try #45966 per #46136 (comment) |
@tonistiigi fix is part of buildkit v0.12, so will be in moby v25.0, correct? Or is there a fix that can be backported to v0.11 that you know of? |
Only v0.12 afaik |
I'm still seeing this exact issue with docker 25 / buildkit 0.12, using the repro script from the original post. Docker thinks everything is pruned, but overlay2 keeps growing almost every time I run the script. Only occurs if builds are run in parallel. Docker version
Initial State:
After several runs of the script, with parallel disabled:
After several runs of the script, with parallel enabled:
Not sure about etiquette here, if I'm supposed to open a new issue or just post to this one, but it doesn't seem like it is actually fixed in docker 25 / buildkit 0.12. |
Looks like another fix was applied for this issue and will be part of the next release. |
@darintay Please test again with 26.0.0-rc.2 |
Did not seem to help. I'm using Docker version 26.0.0, build 2ae903e, and have to purge This change is a recent one: I've doubled the images build frequency to every 4 hours as a first naive workaround on March 23. Prior to that I've been running these builds without problems (with two weeks, not just two days retention) for a few years. |
I reran @intgr's script from the original post with Docker 25.0.5, which contained the fix (https://github.com/moby/moby/releases/tag/v25.0.5) The issue no longer reproduced. As for a more real-world repro, my parallel builder machine is much happier since the upgrade - I haven't had a chance to prune to 0 recently to inspect it, but if it is still leaking, it is an order of magnitude less than previously. |
@mirekphd Can you reproduce the leak in your setup with the script provided in the original issue post? If not, you're probably hitting a different bug that also leaks disk space. In which case I can only suggest to try to boil down your Docker build into a minimal reproduction, like I did, and report it as a new issue. Without a repro, unfortunately probably nothing will be done, like all the other non-specific issues listed here: #46136 (comment) |
Description
My CI machines, that run lots of
docker build
anddocker run
commands, often in parallel, keep running out of disk space.I have figured out that when running multiple
docker build
commands in parallel, Docker loses track of some directories and files it creates under the/var/lib/docker/overlay2
directory. This issue does not occur when the "build" commands are run in sequence (e.g. remove the trailing&
inrepro.sh
).After the build, despite running
docker system prune -af --volumes
to delete all build cache/artifacts and usingdocker system df
to verify that there should be no disk space in use, the size of Docker'soverlay2
directory grows every time with no limit.Reproduce
I have published a shell script and Dockerfile that systematically reproduces this issue at https://github.com/intgr/bug-reports/tree/main/docker-build-disk-space-leak
Run the
./repro.sh
shell script multiple times and noticeoverlay2
directory increasing in size.The script needs to run
docker
commands and usessudo
to monitor the size of theoverlay2
directory.It can be tested in the public playground https://labs.play-with-docker.com/ for example.
git clone https://github.com/intgr/bug-reports cd bug-reports/docker-build-disk-space-leak ./repro.sh
Example output when running the script
Notice that the ACTUAL number of items and disk space keeps growing every time when running
./repro.sh
, despite Docker reporting 0 bytes used.Expected behavior
When I delete all containers, all images, volumes, caches, everything, then Docker disk usage should return back near to what it uses after a clean installation.
docker version
Client: Version: 24.0.2 API version: 1.43 Go version: go1.20.4 Git commit: cb74dfc Built: Thu May 25 21:50:49 2023 OS/Arch: linux/amd64 Context: default Server: Docker Engine - Community Engine: Version: 24.0.2 API version: 1.43 (minimum version 1.12) Go version: go1.20.4 Git commit: 659604f Built: Thu May 25 21:35:04 2023 OS/Arch: linux/amd64 Experimental: true containerd: Version: v1.7.1 GitCommit: 1677a17964311325ed1c31e2c0a3589ce6d5c30d runc: Version: 1.1.7 GitCommit: v1.1.7-0-g860f061 docker-init: Version: 0.19.0 GitCommit: de40ad0
docker info
Additional Info
No response
The text was updated successfully, but these errors were encountered: