Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce the rawblock snapshotter #3130

Closed
wants to merge 2 commits into from

Conversation

bergwolf
Copy link
Contributor

@bergwolf bergwolf commented Mar 26, 2019

Hello,

The PR introduces a new snapshotter for containerd named rawblock, based on the original work from @laijs in hyperhq/hyperd. It is a file based block storage snapshotter. Each snapshot is a file on the host file system and can be mounted via loopback device to provide a file system view. Child snapshots are created via local file system's reflink capability (when available). The reflink capability is implemented in several Linux kernel file systems such as btrfs, xfs and NFSv4.2. With it, the rawblock snapshotter can be pretty space efficient and convenient to use.

A few interesting questions people might ask:

  1. Does it share page cache among different layers?
  • No, currently not. Different layers are mounted as different file systems and each will maintain its own page cache. But it is possible to bypass the page cache completely if the host file system providing reflink capability also supports DAX at the same time. Of course one would need fast storage (e.g., nvme) to gain performance back.
  1. How many layers of page cache are involved?
  • One. Loopback device double cache problem is the past. The loopback device is configured to use direct IO and thus it avoids page cache on the host file system.
  1. Can it be used to implement other types of file based block storage snapshotter, like qcow2?
  • Yes, it can. The only difference of different types of file based block storage snapshotter is the methods of creating new snapshot files. We can add a type field in the snapshotter config to specify which method to use. But let's add things one at a time ;)
  1. What are the advantages and disadvantages compared to devmapper?

To compare them is to compare dm-thin vs. file system snapshots. Here are what I can find/think of so far:

  • Pros:
    • Simplicity. dm-thin requires an additional block(btree) mapping from the thin device to the blocks in the data file, and needs a userspace daemon to monitor and increase the size of the thin-pool data file automatically. Rawblock does not have such complexity, -- the snapshot functionality is handled by host file system together with file block mapping. Free disk space is also managed by host file system internally.
    • dm-thin data file did not return disk spaces back to host file system after deleting thin provisionings/snapshots. There is a long standing moby issue for it Device-mapper does not release free space from removed images moby/moby#3182. In theory it can be fixed by fallocate(2) FALLOC_FL_PUNCH_HOLE so I'm not sure if it is still the case.
  • Cons:
    • If host file system does not support reflink, rawblock snapshot creation falls back to plain data copy and can be very slow.
    • dm-thin allows to put metadata and data files separately. So it is possible to put metadata file on a fast storage (e.g., nvme) to gain better metadata performance.
    • dm-thin pool metadata and data files can be raw block devices and thus removing the extra layer of host file system, making it possible to be faster than rawblock snapshotter due to one less file system layer.

/cc @laijs @lifupan

@codecov-io
Copy link

codecov-io commented Mar 26, 2019

Codecov Report

Merging #3130 into master will increase coverage by 0.05%.
The diff coverage is 44.44%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3130      +/-   ##
==========================================
+ Coverage   43.58%   43.63%   +0.05%     
==========================================
  Files         104      107       +3     
  Lines       11156    11527     +371     
==========================================
+ Hits         4862     5030     +168     
- Misses       5555     5714     +159     
- Partials      739      783      +44
Flag Coverage Δ
#linux 47.47% <44.44%> (-0.1%) ⬇️
#windows 40.47% <ø> (-0.03%) ⬇️
Impacted Files Coverage Δ
mount/mount_linux.go 28.99% <0%> (-2.02%) ⬇️
mount/losetup_linux.go 0% <0%> (ø)
snapshots/rawblock/rawblock.go 53.73% <53.73%> (ø)
snapshots/rawblock/config.go 73.91% <73.91%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 90a7da8...7794595. Read the comment docs.

@bergwolf bergwolf force-pushed the rawblock branch 2 times, most recently from f315333 to 982b426 Compare March 27, 2019 03:06
@renzhengeek
Copy link

dm-thin data file did not return disk spaces back to host file system after deleting thin provisionings/snapshots.

@bergwolf Hi, do you mean?

Although fstrim the mountpoint which will make discard request to thin device and free disk space can be returned back to the thin device, the fs still cannot use those space?

@renzhengeek
Copy link

dm-thin data file did not return disk spaces back to host file system after deleting thin provisionings/snapshots.

@bergwolf Hi, do you mean?

Although fstrim the mountpoint which will make discard request to thin device and free disk space can be returned back to the thin device, the fs still cannot use those space?

hmm, if used by kata runtime, each thin device will be passed through to the guest, why is there "host filesystem"?

@bergwolf
Copy link
Contributor Author

@renzhengeek

hmm, if used by kata runtime, each thin device will be passed through to the guest, why is there "host filesystem"?

It isn't kata specific. I was referring to the dm-thin pool data and metadata files. They are usually files on the host file system.

Although fstrim the mountpoint which will make discard request to thin device and free disk space can be returned back to the thin device, the fs still cannot use those space?

dm-thin needs to translate the trim command into FALLOC_FL_PUNCH_HOLE to tell host file system to make the data file sparse again. I'm not sure if it is implemented or not. It should be able fix moby/moby#3182 though.

@renzhengeek
Copy link

Hi,

dm-thin needs to translate the trim command into FALLOC_FL_PUNCH_HOLE to tell host file system to make the data file sparse again. I'm not sure if it is implemented or not. It should be able fix moby/moby#3182 though.

Aha, I probably got the point. The DM devices you talk about here use files as underlaying storage, not raw disk. Thanks.

@dmcgowan
Copy link
Member

dmcgowan commented Apr 2, 2019

Is there a separate repository this driver currently exists in and shown as useable through the proxy driver interface?

@bergwolf
Copy link
Contributor Author

bergwolf commented Apr 3, 2019

@dmcgowan No, there isn't yet. Due to its simplicity, I want to propose to make it builtin. The actual implementation is quite simple. Almost half of the patch is to enable the mount.All API to handle -o loop option. That part of code is actually a missing piece of the API, and can be used to simplify the already merged devmapper snapshotter as well.

@bergwolf
Copy link
Contributor Author

bergwolf commented Apr 3, 2019

@dmcgowan For testing, one can simply use the --snapshotter rawblock option. Is there any special testing utilities you want to see integrated with this snapshotter?

@theopenlab-ci
Copy link

theopenlab-ci bot commented Mar 13, 2020

Build succeeded.

@theopenlab-ci
Copy link

theopenlab-ci bot commented Mar 13, 2020

Build succeeded.

If a mount has specified `loop` option, we need to handle it on our
own instead of passing it to the kernel. In such case, create a
loopback device, attach the mount source to it, and mount the loopback
device rather than the mount source.

Signed-off-by: Peng Tao <bergwolf@hyper.sh>
@theopenlab-ci
Copy link

theopenlab-ci bot commented Mar 14, 2020

Build succeeded.

@theopenlab-ci
Copy link

theopenlab-ci bot commented Mar 14, 2020

Build succeeded.

@bergwolf
Copy link
Contributor Author

CI failure seems unrelated?

E0314 09:51:02.722787 3768 remote_runtime.go:200] CreateContainer in sandbox "7fb3519b6b340b0ab229a65499e59cc3262b9fb1a5954d8885cc460e1b6f4b9b" from runtime service failed: rpc error: code = Unknown desc = : failed to generate seccomp spec opts: invalid seccomp profile "/tmp/seccomp-tests102614447/block-host-name.json"

It creates local file based block device and formats
configured file system on it. If local file system supports
reflink (e.g., btrfs, ocfs2, xfs, NFSv4.2), snapshots creation
can be very fast and underlying data blocks are shared by the
local file system.

Signed-off-by: Peng Tao <bergwolf@hyper.sh>
@theopenlab-ci
Copy link

theopenlab-ci bot commented Mar 16, 2020

Build succeeded.

@bergwolf
Copy link
Contributor Author

Another network issue?

Failures
 - mingw (exited 1) - mingw not installed. An error occurred during installation:
 The remote name could not be resolved: 'packages.chocolatey.org'
Command exited with code 1

@fuweid
Copy link
Member

fuweid commented Mar 16, 2020

Another network issue?

Failures
 - mingw (exited 1) - mingw not installed. An error occurred during installation:
 The remote name could not be resolved: 'packages.chocolatey.org'
Command exited with code 1

re-running it and hope it can get green.

@AkihiroSuda
Copy link
Member

Due to its simplicity, I want to propose to make it builtin.

Maybe we can accept this as a non-core project and as a gRPC plugin first.
We can consider making this built-in later if there is a huge demand.

@bergwolf
Copy link
Contributor Author

bergwolf commented Apr 9, 2020

@AkihiroSuda In that case, we still need the first commit 37c9ddb to add loopback support to containerd's mount package. What do others think? I can split the PR if that is the decision.

@AkihiroSuda
Copy link
Member

In that case, we still need the first commit 37c9ddb to add loopback support to containerd's mount package.

SGTM

@bergwolf
Copy link
Contributor Author

Please see #4178 for add loopback support to the mount package.

@AkihiroSuda AkihiroSuda removed this from the 1.4 milestone May 9, 2020
@mxpv
Copy link
Member

mxpv commented Jan 8, 2021

Loopback mounts PR is in, please open a proposal for a non-core project in order to get started with grpc version of the plugin.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants