Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should not declare VOLUME for /data/db #306

Open
aparamon opened this issue Sep 28, 2018 · 12 comments
Open

Should not declare VOLUME for /data/db #306

aparamon opened this issue Sep 28, 2018 · 12 comments
Labels
Request Request for image modification or feature

Comments

@aparamon
Copy link

aparamon commented Sep 28, 2018

Currently, Dockerfile declares VOLUME for /data/db, /data/configdb:

VOLUME /data/db /data/configdb

This is sub-optimal, because some workflows in inherited images become excessively complicated.
For example, seeding a database from a dump now requires

RUN mkdir /data/local-db

COPY mongo-dump /var/mongo-dump
RUN mongod --fork --dbpath /data/local-db --logpath /var/log/mongodb.log && \
    mongorestore --db mydb /var/mongo-dump/mydb && \
    mongod --shutdown  --dbpath /data/local-db
RUN chown -R mongodb:mongodb /data/local-db

CMD ["mongod", "--dbpath", "/data/local-db"]

instead of just

RUN mongod --fork --logpath /var/log/mongodb.log && \
    mongorestore --db mydb /var/mongo-dump/mydb && \
    mongod --shutdown

because /data/db doesn't persist between docker build and docker run invocations.

It is proposed to remove VOLUME directive, and leave volumes configuration up to end user.

@wglambert wglambert added the Request Request for image modification or feature label Sep 28, 2018
@yosifkit
Copy link
Member

yosifkit commented Oct 3, 2018

Storing the already initialized database files in the images layers is not a great idea. Because it would be in a copy-on-write filesystem, the moment that you start a new container and MongoDB changes any of its files, it now uses twice as much space.

See also https://docs.docker.com/storage/storagedriver/overlayfs-driver/#modifying-files-or-directories:

Writing to a file for the first time: The first time a container writes to an existing file, that file does not exist in the container (upperdir). The overlay/overlay2 driver performs a copy_up operation to copy the file from the image (lowerdir) to the container (upperdir). The container then writes the changes to the new copy of the file in the container layer.

However, OverlayFS works at the file level rather than the block level. This means that all OverlayFS copy_up operations copy the entire file, even if the file is very large and only a small part of it is being modified. This can have a noticeable impact on container write performance.

And https://docs.docker.com/storage/storagedriver/overlayfs-driver/#performance-best-practices:

Use volumes for write-heavy workloads: Volumes provide the best and most predictable performance for write-heavy workloads. This is because they bypass the storage driver and do not incur any of the potential overheads introduced by thin provisioning and copy-on-write.

Have you thought of using /docker-entrypoint-initdb.d/ to have it restore the database on start instead? (hub docs)

@aparamon
Copy link
Author

aparamon commented Oct 4, 2018

Hi @yosifkit, thank you for your detailed explanation!
My original motivation for restoring DB at build time was to optimize the container start-up time. However now it seems that the actual speed-up would only take place in read-only scenario, as in case of write operations Docker would still have to copy the huge file.
What do you believe to be the optimal way of setting-up a pre-seeded Mongo container for read-only usage?

@chobolt
Copy link

chobolt commented Mar 2, 2020

I spent quite some time understanding why my bind volume was not being used by mongo docker image, and eventually discovered the hardcoded VOLUME in the dockerfile.

I'm fine with rebuilding the image without this declaration (which I did), but I find it weird to force the use of volumes when it should be up to the final user to decide how to store data.

@mildmojo
Copy link

mildmojo commented Jan 21, 2021

In my case, I want to construct a seeded database image that will be the basis for runtime containers that use a persistent, named volume for /data/db in my development environment. I can't base my seeded image Dockerfile on the official mongo image because of the VOLUME declaration. (It took me a couple of days to figure out why my Dockerfile writes to /data/db were being discarded at build time--yikes!)

If I fork the official mongo Dockerfile and remove the VOLUME instruction, the resulting image works great. I can base my Dockerfile on this -no-volume base image, do RUN seed-my-database.sh at build time, then later invoke docker or use a compose file that mounts a volume at /data/db, and my seed data is copied to the volume when the container's created. Perfect.

Storing the already initialized database files in the images layers is not a great idea. Because it would be in a copy-on-write filesystem, the moment that you start a new container and MongoDB changes any of its files, it now uses twice as much space.

The storage space tradeoff is worth it for me, since it saves me so much time when I need to reset to a known state. My seeded DB is a few hundred megabytes or a gigabyte, which I can easily spare on my dev machine or CI instances.

Have you thought of using /docker-entrypoint-initdb.d/ to have it restore the database on start instead?

In my case, the DB seed process from a known DB dump takes 8-10 minutes to complete on my workstation. I may need to reset the DB to a known state dozens of times while debugging DB migrations or a new feature. That reset takes seconds if I have a seeded image, but would be untenable if I had to restore the DB every time.

I can work around it, but I'd love to see an official -no-volume variant image. I'd expect the extra maintenance effort to be very small, given that the rest of it would be identical.

@tianon
Copy link
Member

tianon commented Jan 21, 2021 via email

@mildmojo
Copy link

If you add "--dbpath" or set ".storage.dbPath" in a specified "--config" file, that value will be respected.

I ended up using this, but it means I have to make sure compose files or docker run commands that use this image duplicate my --dbpath /mongo-data/db argument if they're providing their own command values, which isn't great or obvious. I'd really prefer to store my data in the default location.

@tianon
Copy link
Member

tianon commented Feb 6, 2021

If you're building an image with the data pre-seeded, you can combat that by setting CMD:

CMD ["--dbpath", "/mongo-data/db"]

(and then it'll be the default for users who don't specify a command)

@mildmojo
Copy link

mildmojo commented Feb 7, 2021

If you're building an image with the data pre-seeded, you can combat that by setting CMD:

Yeah, this is the approach aparamon outlined, and it's the strategy I'm currently using.

I agree with aparamon that a seeded image is a legitimate use case, and the official image doesn't work well for seeding at build time because it contains VOLUME /data/db. The workaround for seeded images adds Dockerfile chaff, creates an extra unused anonymous volume at runtime, and it yields images that carry an asterisk: your seed data disappears when your users' containers naïvely provide their own commands (e.g. a compose file with command: --auth or a docker run --rm mongo-seeded --auth).

As a tooling developer, it would help to have an official mongo image without VOLUME so that I could build seeded mongo images that are easy-to-use drop-in replacements for the official (empty) image.

@polarathene
Copy link

polarathene commented Mar 12, 2024

I'm not sure why, but it is a common issue to see for the DB images (equivalent issue for postgres, mysql, redis).

These issues remain open for many years, with little valid justification for why VOLUME is kept? Do they remain open for visibility? Undecided? Or until consensus for all to make the switch together?

EDIT: I understand for these DB images:

  • The VOLUME paths are initially empty with the image and then populated at runtime for each container instance.
  • That you can provide a different location via a runtime option (--db-path, as has been communicated in above comments already) which still:
    • Accumulates disk, but within the container instead of a separate volume.
    • Redundantly create empty anonymous volumes.

Because it would be in a copy-on-write filesystem, the moment that you start a new container and MongoDB changes any of its files, it now uses twice as much space.

The concern for "twice as much space" seems moot as that's already what is happening implicitly? For MongoDB, this is 300MB+ per container instance created.

Prepopulated `VOLUME` (not applicable to mongo usage)

An implicit anonymous volume copies data from the image to the host per container instance created. This is wasteful and accumulates over runs (if not removing afterwards via --rm).

The example I link to is fairly simple:

  • Golang image builds CoreDNS.
  • VOLUME instruction used in Dockerfile for Go's package storage and build cache. This represents over 2GB of data.
  • Outcome:
    • Each container instance that is run, implicitly copies that VOLUME declared data during container startup. This adds notable delay to container startup time as well.
    • Stopping the containers does not release these anonymous volumes, and they are not easy to inspect (via Docker CLI, Docker Desktop is more informative about context).
    • 4 of these containers is over 9GB of data to anonymous volumes, but would otherwise be none.

Use volumes for write-heavy workloads: Volumes provide the best and most predictable performance for write-heavy workloads. This is because they bypass the storage driver and do not incur any of the potential overheads introduced by thin provisioning and copy-on-write.

It should be opt-in. A container still persists internal state until it's destroyed/removed.

If a user wants to persist the data or have better performance, they should provide a volume explicitly?

  • Anonymous and named volumes will copy the content (if any) from the image like VOLUME, and since that is done so explicitly by the user there is no surprises.
  • Bind volume mounts have different semantics (no copy by default) but also worthwhile, and compatible with this image.

So while volumes are a best practice, I disagree about the VOLUME instruction (like this comment states, Dockerfile should be concerned with the image build, not how to manage runtime state).


Other known concerns with VOLUME

  • Extending a base image (that has a VOLUME instruction) does not support an opt-out.
  • BuildKit has a slight difference in behaviour for VOLUME at build time (1, 2), which may have been applicable to the original Dockerfile issue of this thread.
    • Although that's more of a bugfix I still see users building projects with the legacy builder, so VOLUME usage with the fix may still result in some bug reports adding to maintainer burden that could otherwise be avoided.
  • When providing an explicit volume and the implicit VOLUME mounts a subpath:
    • The anonymous volume has priority, it will source the image content and remain.
    • Even if your explicit volume is a bind mount, the anonymous volume will replace the equivalent subpath with the image content. To prevent this you need to explicitly mount that subpath too.
    • Why would this behaviour be desired for this image vs avoided by removing VOLUME? If a user is unaware, they may be led to think their explicit volume should have all data persisted, considering any anonymous volumes as disposable resulting in data loss.
  • A summary of an earlier investigation of mine on VOLUME (with plenty of references to justify VOLUME as a bad practice).

Docker Compose quirk

I did come across this reasoning by @yosifkit (2017) that Docker Compose will try to preserve the anonymous volume across image upgrades (verified as still applicable).

At a rough glance it seems tied to the service name and that the new image has the expected VOLUME <path> declared.

  • Any change to the image data the VOLUME would seed from within that image would be ignored.
  • --force-recreate can remove other internal container state, but won't discard/reset the implicit anonymous volume.

The feature is well intentioned, but I can see how an intentional breaking change update to the image could clash with this easily with

  • A new VOLUME introduced as a subpath within an existing VOLUME.
  • A subpath to a location that users had already been using explicit mounts on.
  • A developer iterating on a local image of their own with VOLUME, unaware of this compose specific "feature", potentially with some confused troubleshooting.

The main concern expressed by @yosifkit was that images which remove their VOLUME instruction(s) would result in data loss for compose.yaml users that rely on this implicit feature to persist their data. It's questionable that anyone relies on this when they really care about the data, is it advised/encouraged somewhere? Documented, or a hidden feature?

@yosifkit later notes in 2021 that this Docker Compose feature was implemented prior to proper external volume feature support, back when we only had anonymous volumes with the VOLUME instruction? (which changed in 2015)

@LaurentGoderre
Copy link
Member

For the original use case, an init container would serve this much better. You could have an image with your data copied in and use mongorestore with the hostname of the other container. In that case you don't need to modify the default behavior and still maintain startup speed.

@mildmojo
Copy link

For the original use case, an init container would serve this much better.

This could work for some users. It seems best used with strong orchestration (e.g. docker compose) and a very small DB seed. Some drawbacks:

  • An init container could add a penalty of several minutes each time the DB container is recreated if your seed is more than a few hundred megs with a handful of collections and indexes, vs. seconds for an image with data baked in. This can feel really demoralizing during a dev/test loop.
  • A long initial import on container creation would mean the DB is up & responsive but in an invalid state until the restore is complete, which can cause other services to fail and require manual intervention after e.g. docker compose up.
  • Like using --dbpath in CMD, using an additional container means you have another asterisk, another step, to get a working database when you're not using orchestration or you're writing/refactoring orchestration config.

Baking DB data into your own mongo image makes it fast and simple to launch new database containers that are in a valid state from first boot, and the official mongo images don't work well as a base for that. And, it's time-consuming and difficult to figure out why your build-time data goes missing when you base your dockerfile on the official mongo images.

However, the comment polarathene linked seems to observe that buildkit treats VOLUME as EXPOSE-style advice only, and data written to volumes during build steps persists in the final image. So maybe this is... fixed by buildkit? 😶

@tianon
Copy link
Member

tianon commented Mar 18, 2024

However, the comment polarathene linked seems to observe that buildkit treats VOLUME as EXPOSE-style advice only, and data written to volumes during build steps persists in the final image. So maybe this is... fixed by buildkit? 😶

Yep!

FROM mongo:7.0
RUN touch /data/db/foo
$ docker buildx build .
...
#6 writing image sha256:1ca35025fc343c6e199f7abedad161821524f8374d78d474a923aab571d8473f done
#6 DONE 0.0s

$ docker run --rm sha256:1ca35025fc343c6e199f7abedad161821524f8374d78d474a923aab571d8473f ls -l /data/db/
total 0
-rw-r--r-- 1 root root 0 Mar 18 23:11 foo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Request Request for image modification or feature
Projects
None yet
Development

No branches or pull requests

8 participants