Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache --mount=bind output smartly #2821

Open
zen0wu opened this issue Apr 23, 2022 · 2 comments
Open

Cache --mount=bind output smartly #2821

zen0wu opened this issue Apr 23, 2022 · 2 comments

Comments

@zen0wu
Copy link

zen0wu commented Apr 23, 2022

When I saw --mount=bind, I assume it's a (much) better version of COPY/ADD and I assume what happens is, it smartly figures out which files my later command "touches", and only put those files in the cache layer.

This is useful for example, in a monorepo, I want to copy a small subset of files in all my modules into my docker context. I can just run a cp with a glob, and then only those files would be copied. There's some caveat here, but I assume cp won't actually read the file content (only stat them).

Currently what happens is, the entire mount source is considered as the cache and will bust the layer cache whenever any of the files change there (even tho it's not what's being copied into the image), as mentioned here: moby/moby#15858 (comment)

This would completely solve moby/moby#15858, in a much elegant/automatic way, and allow arbitrary linux commands.

The other way of solving this, is "not bust the cache" if the produced layer ends up exactly the same as before. I'm not sure how docker internal works to comment whether this is possible at all.

@zen0wu zen0wu changed the title Make --mount=bind only consider actually files that are opened Cache --mount=bind output smartly Apr 23, 2022
@zen0wu
Copy link
Author

zen0wu commented Apr 24, 2022

Here's a quick repro demonstrating what I mean

Dockerfile

# syntax=docker/dockerfile:1.4

FROM alpine

RUN --mount=type=bind,target=/test,source=./test cp /test/a /a

ADD ./test/x /x

commands:

mkdir test
touch test/a test/x

# initial build
DOCKER_BUILDKIT=1 docker build .

# subsequent build without changing anything
# output says "CACHED [stage-0 2/3]" and "CACHED [stage-0 3/3]"
DOCKER_BUILDKIT=1 docker build . 

# add a file under test that's not being used by the build
touch test/y

# build again, cache is busted
DOCKER_BUILDKIT=1 docker build .

@tonistiigi
Copy link
Member

Something like this was indeed discussed early on as one possible smarter caching strategy. I guess it isn't completely clear how many cases would benefit from it to justify the additional maintenance/development complexities. It also has some overhead. Doing it only for the user-defined mounts is an interesting twist on the original idea that might help with the overhead issue.

There are multiple components needed to make this work that are currently missing. When running the process we would need to capture what files the container accesses. Probably the only option in here would be to add a bunch of seccomp notifiers.

The other part is that this requires a different cache-key logic. After the process has completed we would need to store both the cache-key (digest computed from the files content) as well as the list of files that were accessed. This list could be quite big that may become an issue for remote cache backends. When checking cache matches before running process for the second time we can't just compute the content digest again but need to ask "what are the possible sets of files that we have existing cache keys for". Then we need to compute the cache key for all of these sets to see if any of them matches (I guess. How to make it not grow out of control?). The current cache logic only computes the cache key before running the process and can directly check if any records exist matching that key in any backend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants