Skip to content

Freezing File Systems With dmsetup suspend Versus fsfreeze

Eric Weber edited this page Apr 19, 2024 · 40 revisions

Contents

Purpose

The goal of https://github.com/longhorn/longhorn/issues/2187 is to find a safe way to freeze a file system before taking a user-requested snapshot. This should help ensure the consistency of the snapshot.

  • https://github.com/longhorn/longhorn/issues/2187#issuecomment-2026172943 outlines an approach that ensures we can safely use the fsfreeze -f command without accidentally freezing the root file system while taking the snapshot (in the event that the Longhorn volume is unmounted during the process). A previous implementation allowed for this possibility.
  • https://github.com/longhorn/longhorn/issues/2187#issuecomment-1714234075 outlines an approach that adds a Device Mapper linear device on top of a Longhorn volume and uses dmsetup suspend. dmsetup suspend targets a block device, so it is inherently safe from an accidental freeze of the root file system. This approach, while more complicated initially, may also allow future versions of Longhorn to live upgrade v1 volumes into upgraded instance-managers without the need for down time.

The decision was made to pursue the fsfreeze in the short term if it could be done safely. While implementing it, we discovered a test case in which an instance-manager crashing at the wrong time could strand workload processes in an uninterruptible sleep state. The only resolution to this situation is a full (potentially hard) node reboot. Further investigation also showed that this was possible in the dmsetup suspend case as well.

This page serves as a location to document the ongoing investigation into this potential issue.

Resolution summary

For reasons discussed in-depth below, kernels < v5.17 are vulnerable to the issue under investigation. Depending on the particular backports they include, these kernels may exhibit different behaviors when running ext4 vs xfs file systems. Kernels >= v5.17 are not vulnerable for ext4 and xfs. Other file systems were not tested.

Failing test case

Overview

In certain environments, with certain file systems, if the instance-manager crashes while a dmsetup suspend or fsfreeze -f is ongoing, a workload process writing to the file system gets stuck in uninterruptible sleep. We have not found a way to get the process unstuck without rebooting the node. As a result:

  • The workload container cannot be stopped, even with a --force. (It is possible to remove it from the Kubernetes API, but its process still runs on the host.
  • The mounted file system cannot be unmounted (due to the stuck workload process). The CSI flow cannot make progress.
  • The dirty page cache on the node remains as full as it was when the instance-manager crashed. The file system cannot be unmounted, and the pages cannot be written out (or apparently discarded).

After the testing and discussion below, we have determined that kernels >= 5.17.0 or kernels with ALL of the top four patches from this branch do NOT fail in this way.

Tested environments

We tested both ext4 and xfs file systems in the following environments:

Rocky

[eweber@eweber-engine-test ~]$ cat /etc/os-release
NAME="Rocky Linux"
VERSION="9.3 (Blue Onyx)"
ID="rocky"
ID_LIKE="rhel centos fedora"
VERSION_ID="9.3"
PLATFORM_ID="platform:el9"
PRETTY_NAME="Rocky Linux 9.3 (Blue Onyx)"
ANSI_COLOR="0;32"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:rocky:rocky:9::baseos"
HOME_URL="https://rockylinux.org/"
BUG_REPORT_URL="https://bugs.rockylinux.org/"
SUPPORT_END="2032-05-31"
ROCKY_SUPPORT_PRODUCT="Rocky-Linux-9"
ROCKY_SUPPORT_PRODUCT_VERSION="9.3"
REDHAT_SUPPORT_PRODUCT="Rocky Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="9.3"

[root@eweber-engine-test eweber]# dmsetup version
Library version:   1.02.195 (2023-04-21)
Driver version:    4.48.0

[root@eweber-engine-test eweber]# fsfreeze -V
fsfreeze from util-linux 2.37.4

[root@eweber-engine-test eweber]# uname -r
5.14.0-362.24.1.el9_3.0.1.x86_64

Ubuntu (old)

PRETTY_NAME="Ubuntu 22.04.2 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.2 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy

root@eweber-v126-worker-9c1451b4-kgxdq:~# dmsetup version
Library version:   1.02.175 (2021-01-08)
Driver version:    4.45.0

root@eweber-v126-worker-9c1451b4-kgxdq:~# fsfreeze -V
fsfreeze from util-linux 2.37.2

root@eweber-v126-worker-9c1451b4-kgxdq:~# uname -r
5.15.0-102-generic

Ubuntu (new)

root@ubuntu-s-4vcpu-8gb-sfo3-01:~# cat /etc/os-release
PRETTY_NAME="Ubuntu 23.10"
NAME="Ubuntu"
VERSION_ID="23.10"
VERSION="23.10 (Mantic Minotaur)"
VERSION_CODENAME=mantic
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=mantic
LOGO=ubuntu-logo

root@ubuntu-s-4vcpu-8gb-sfo3-01:~# dmsetup version
Library version:   1.02.185 (2022-05-18)
Driver version:    4.48.0

root@ubuntu-s-4vcpu-8gb-sfo3-01:~# fsfreeze -V
fsfreeze from util-linux 2.39.1

root@ubuntu-s-4vcpu-8gb-sfo3-01:~# uname -r
6.5.0-9-generic

Test methodology

  1. Create a simple Longhorn volume with one replica and one engine inside a container.
# Use docker.
docker run --privileged -v /:/host --rm --net host --name test longhornio/longhorn-engine:master-head launch-simple-longhorn test 10g tgt-blockdev

# Or, use nerdctl.
nerdctl run --privileged -v /:/host --rm --net host --name test longhornio/longhorn-engine:master-head launch-simple-longhorn test 10g tgt-blockdev
  1. Create an ext4 or xfs file system on the volume.
  2. Mount the volume to /mnt/test.
  3. Use dd on a loop to keep up a stream of constant writes to a file on the file system.
while true; do dd of=/mnt/test/file if=/dev/urandom bs=16M; done
  1. Monitor the dirty page cache. While dd is running, there may be as much as 1 GiB of data in it.
while true; do cat /proc/meminfo | grep -i dirty; sleep 0.5; done
  1. Identify the process group for the processes running in the instance-manager container.
# 5977 in this example.
[root@eweber-engine-test eweber]# ps -eo pid,ppid,pgid,user,stat,pcpu,comm,wchan:32 | grep long
   5977    5965    5977 root     Sl    0.3 longhorn-instan ep_poll
   6042    5977    5977 root     Sl    0.2 longhorn        futex_wait_queue
   6050    6042    5977 root     Sl    0.0 longhorn        futex_wait_queue
   6134    5977    5977 root     Sl    1.1 longhorn        futex_wait_queue
  1. Execute fsfreeze -f or dmsetup suspend.
  2. Before the command completes, kill the instance-manager process group. It typically takes at least five seconds for the data to flush.
  3. Wait two minutes for the iSCSI timeout. Before the timeout, the executed command and dd are definitely stuck (as they are waiting for I/O to complete).
  4. Monitor the reason the processes are stuck.
while true; do ps -eo pid,ppid,pgid,user,stat,pcpu,comm,wchan:32 | grep -e freeze -e dmsetup -e " dd"; sleep 1; done
  1. After two minutes, evaluate success.
    • The test is successful if:
      • dd stops on its own OR the dd process can be killed.
      • The file system can be unmounted.
    • The test fails if:
      • dd remains stuck in uninterruptible sleep (the D state) and cannot be killed.
      • The file system cannot be unmounted.

Results

Typical failure output

These results are for an ext4 file system in the Ubuntu (old) environment with fsfreeze -f.

Some numbers from the dirty page cache monitor before fsfreeze -f:

Dirty:            585920 kB
Dirty:            540712 kB
Dirty:            507844 kB

Stuck processes during the two minute iSCSI timeout window:

   8214    5585    8214 root     D+   42.8 dd              percpu_rwsem_wait
   8362    7281    8362 root     D+    0.0 fsfreeze        wb_wait_for_completion

Some numbers from the dirty page cache monitor while stuck:

Dirty:            409608 kB
Dirty:            409608 kB
Dirty:            409608 kB

fsfreeze -f looks to have completed successfully AFTER I/O errors (paused for two minutes):

[root@eweber-engine-test eweber]# fsfreeze -f /mnt/test
[root@eweber-engine-test eweber]# echo $?
0

Single stuck process after fsfreeze -f became unstuck:

   8214    5585    8214 root     D+    5.4 dd              percpu_rwsem_wait

Can supposedly unfreeze.

[root@eweber-engine-test eweber]# fsfreeze -u /mnt/test
[root@eweber-engine-test eweber]# echo $?
0

Cannot kill dd:

[root@eweber-engine-test eweber]# ps -ef | grep " dd "
root        8214    5585  3 22:19 pts/1    00:00:09 dd of=/mnt/test/file if=/dev/urandom bs=16M
root       13403    7281  0 22:24 pts/4    00:00:00 grep --color=auto  dd 

[root@eweber-engine-test eweber]# kill -9 8214

[root@eweber-engine-test eweber]# ps -ef | grep " dd "
root        8214    5585  3 22:19 pts/1    00:00:09 dd of=/mnt/test/file if=/dev/urandom bs=16M
root       13570    7281  0 22:24 pts/4    00:00:00 grep --color=auto  dd 

Cannot unmount or otherwise clean up:

[root@eweber-engine-test eweber]# umount /mnt/test
umount: /mnt/test: target is busy.

[root@eweber-engine-test eweber]# fuser -v /mnt/test/file
                     USER        PID ACCESS COMMAND
/mnt/test/file:      root       8214 F.... dd

Matrix

Environment Result
fsfreeze, ext4, Rocky FAIL (2x)
fsfreeze, ext4, Ubuntu (old) PASS (2x) (fsfreeze -f failed with Input/output error, could get things unstuck)
fsfreeze, xfs, Rocky FAIL (2x) (fsfreeze: cannot open /mnt/test: Input/output error when trying to unfreeze)
fsfreeze, xfs, Ubuntu (old) FAIL (2x) (fsfreeze: cannot open /mnt/test: Input/output error when trying to unfreeze)
dmsetup, ext4, Rocky FAIL (2x)
dmsetup, ext4, Ubuntu (old) PASS (1x) (dmsetup suspend failed with Input/output error, could get things unstuck)
dmsetup xfs, Rocky PASS (3x) (dmsetup suspend succeeded, but a dmsetup resume could get things unstuck)
dmsetup, xfs, Ubuntu (old) PASS (2x) (dmsetup suspend succeeded, but a dmsetup resume could get things unstuck)
fsfreeze, ext4, Ubuntu (old) PASS (1x) (fsfreeze -f failed with Input/output error, could get things unstuck)
fsfreeze, xfs, Ubuntu (old) PASS (1x) (fsfreeze -f failed with Input/output error, could get things unstuck)
dmsetup, ext4, Ubuntu (old) PASS (1x) (dmsetup suspend failed with Input/output error, could get things unstuck)
dmsetup, xfs, Ubuntu (old) PASS (1x) (dmsetup suspend failed with Input/output error, could get things unstuck)

Explanations with kernel code

How to obtain kernel code

Rocky

Execute the following on the Rocky system. There isn't a particular easy way to refer to it online.

yum install rpm-build
yumdownloader --source kernel
rpmbuild -rp kernel-5.14.0-362.24.1.el9_3.0.1.src.rpm
rm kernel-5.14.0-362.24.1.el9_3.0.1.src.rpm

# The source tree is now located in this directory.
ls ~/rpmbuild/BUILD/kernel-5.14.0-362.24.1.el9_3/linux-5.14.0-362.24.1.el9.0.1.x86_64/

Ubuntu

Execute the following on the Ubuntu system.

# Enable source for jammy and jammy-updates.
# Seriously, enable source for jammy-updates. It took me forever to understand why I had outdated source code.
vi /etc/apt.sources.list
apt source linux-image-unsigned-$(uname -r)

# The source tree is now located in this directory.
ls linux-5.15.0

# The list of changes from the base kernel are in this file.
ls ~/linux-5.15.0/debian/changelog

Alternatively, Ubuntu code can be browsed online.

https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/jammy/tree/?h=Ubuntu-5.15.0-102.112

Do fsfreeze and dmsetup suspend call the same code?

Yes. At the outset of this investigation, we assumed that dmsetup suspend and fsfreeze -f were fundamentally different operations that had similar effects (flushing I/O and ensuring no additional writes could reach the volume until reversed). However, where file system interactions are concerned, the two commands execute the same kernel code, starting with a call to the freeze_super function.

Investigation methodology

  1. Create a simple Longhorn volume with one replica and one engine inside a container.
# Use docker.
docker run --privileged -v /:/host --rm --net host --name test longhornio/longhorn-engine:master-head launch-simple-longhorn test 10g tgt-blockdev

# Or, use nerdctl.
nerdctl run --privileged -v /:/host --rm --net host --name test longhornio/longhorn-engine:master-head launch-simple-longhorn test 10g tgt-blockdev
  1. Create an ext4 or xfs file system on the volume.
  2. Mount the volume to /mnt/test.
  3. Use dd on a loop to keep up a stream of constant writes to a file on the file system. (Unnecessary in this section, but done for consistency.)
while true; do dd of=/mnt/test/file if=/dev/urandom bs=16M; done
  1. Optional, use strace to monitor system calls as either command executes.
  2. Use trace-cmd as a frontend to the kernel's builtin ftrace tracer to monitor kernel function calls as either command executes.

Evidence

dmsetup suspend

In the dmsetup case, we must first create a Device Mapper linear device on top of the volume. Then, we mount that device to the expected location.

[root@eweber-engine-test eweber]# dmsetup create test --table "0 $(blockdev --getsz /dev/longhorn/test) linear /dev/longhorn/test 0"

[root@eweber-engine-test eweber]# mkfs.ext4 /dev/mapper/test
mke2fs 1.46.5 (30-Dec-2021)
Discarding device blocks: done                            
Creating filesystem with 2621440 4k blocks and 655360 inodes
Filesystem UUID: 3a118df8-31b5-4f6e-8e4e-ecaa03dc5f7a
Superblock backups stored on blocks: 
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632

Allocating group tables: done                            
Writing inode tables: done                            
Creating journal (16384 blocks): done
Writing superblocks and filesystem accounting information: done 

[root@eweber-engine-test eweber]# mount /dev/mapper/test /mnt/test

dmsetup suspend calls freeze_super (and the underlying file system's freeze functions).

           <...>-3212  [001]    0.878436227234: funcgraph_entry:                   |  dev_suspend() {
           <...>-3212  [001]    0.878436228568: funcgraph_entry:                   |    down_read() {
           <...>-3212  [001]    0.878436228896: funcgraph_entry:        0.253 us   |      __cond_resched();
           <...>-3212  [001]    0.878436229565: funcgraph_exit:         1.144 us   |    }
           <...>-3212  [001]    0.878436229859: funcgraph_entry:                   |    __find_device_hash_cell() {
           <...>-3212  [001]    0.878436230813: funcgraph_entry:        0.635 us   |      dm_get();
           <...>-3212  [001]    0.878436231775: funcgraph_exit:         1.948 us   |    }
           <...>-3212  [001]    0.878436231972: funcgraph_entry:        0.187 us   |    up_read();
           <...>-3212  [001]    0.878436232727: funcgraph_entry:        0.180 us   |    dm_suspended_md();
           <...>-3212  [001]    0.878436233152: funcgraph_entry:                   |    dm_suspend() {
           <...>-3212  [001]    0.878436233429: funcgraph_entry:                   |      mutex_lock() {
           <...>-3212  [001]    0.878436233690: funcgraph_entry:        0.351 us   |        __cond_resched();
           <...>-3212  [001]    0.878436234092: funcgraph_exit:         0.686 us   |      }
           <...>-3212  [001]    0.878436234747: funcgraph_entry:                   |      __dm_suspend() {
           <...>-3212  [001]    0.878436235210: funcgraph_entry:        1.030 us   |        dm_table_presuspend_targets();
           <...>-3212  [001]    0.878436236630: funcgraph_entry:                   |        freeze_bdev() {
           <...>-3212  [001]    0.878436236899: funcgraph_entry:                   |          mutex_lock() {
           <...>-3212  [001]    0.878436237075: funcgraph_entry:        0.179 us   |            __cond_resched();
           <...>-3212  [001]    0.878436237565: funcgraph_exit:         0.778 us   |          }
           <...>-3212  [001]    0.878436237968: funcgraph_entry:                   |          get_active_super() {
           <...>-3212  [001]    0.878436238155: funcgraph_entry:        0.193 us   |            _raw_spin_lock();
           <...>-3212  [001]    0.878436246604: funcgraph_entry:        0.888 us   |            grab_super();
           <...>-3212  [001]    0.878436247666: funcgraph_entry:        0.256 us   |            up_write();
           <...>-3212  [001]    0.878436248050: funcgraph_exit:       + 10.114 us  |          }
           <...>-3212  [001]    0.878436248783: funcgraph_entry:                   |          freeze_super() {
           <...>-3212  [001]    0.878436249064: funcgraph_entry:        0.275 us   |            down_write();
           <...>-3212  [001]    0.878436249586: funcgraph_entry:        0.175 us   |            up_write();
           <...>-3212  [001]    0.878436249987: funcgraph_entry:      # 6932.678 us |            percpu_down_write();
           <...>-3212  [001]    0.878443184838: funcgraph_entry:        0.509 us   |            down_write();
           <...>-3212  [001]    0.878443185559: funcgraph_entry:      # 5974.983 us |            percpu_down_write();
           <...>-3212  [001]    0.878449162587: funcgraph_entry:      + 27.267 us  |            sync_filesystem();
           <...>-3212  [001]    0.878449190034: funcgraph_entry:      # 5981.525 us |            percpu_down_write();
           <...>-3212  [001]    0.878455174124: funcgraph_entry:      # 10180.372 us |            ext4_freeze();
           <...>-3212  [001]    0.878465356669: funcgraph_entry:        0.504 us   |            up_write();
           <...>-3212  [001]    0.878465357296: funcgraph_exit:       # 29108.544 us |          }
           <...>-3212  [001]    0.878465358018: funcgraph_entry:        0.283 us   |          deactivate_super();
           <...>-3212  [001]    0.878465358578: funcgraph_entry:                   |          filemap_write_and_wait_range() {
           <...>-3212  [001]    0.878465358986: funcgraph_entry:        1.162 us   |            __filemap_fdatawrite_range();
           <...>-3212  [001]    0.878465360421: funcgraph_entry:        1.384 us   |            __filemap_fdatawait_range();
           <...>-3212  [001]    0.878465361984: funcgraph_entry:        0.298 us   |            filemap_check_errors();
           <...>-3212  [001]    0.878465362405: funcgraph_exit:         3.856 us   |          }
           <...>-3212  [001]    0.878465362770: funcgraph_entry:        0.187 us   |          mutex_unlock();
           <...>-3212  [001]    0.878465363188: funcgraph_exit:       # 29126.589 us |        }
...

fsfreeze -f

fsfreeze -f calls freeze_super (and the underlying file system's freeze functions). We might expect to see a call to ioctl_fsfreeze in this trace, but the kernel compiler appears to have inlined it (making it invisible to us).

[root@eweber-engine-test eweber]# trace-cmd stream -g do_vfs_ioctl --max-graph-depth 3 -p function_graph fsfreeze -f /mnt/test
  plugin 'function_graph'
           <...>-5086  [002]    0.1353871399735: funcgraph_entry:                   |  do_vfs_ioctl() {
           <...>-5086  [002]    0.1353871401279: funcgraph_entry:                   |    ns_capable() {
           <...>-5086  [002]    0.1353871401703: funcgraph_entry:        1.898 us   |      security_capable();
           <...>-5086  [002]    0.1353871403894: funcgraph_exit:         2.704 us   |    }
           <...>-5086  [002]    0.1353871404302: funcgraph_entry:                   |    freeze_super() {
           <...>-5086  [002]    0.1353871404575: funcgraph_entry:        0.355 us   |      down_write();
           <...>-5086  [002]    0.1353871405103: funcgraph_entry:        0.179 us   |      up_write();
           <...>-5086  [002]    0.1353871405475: funcgraph_entry:      # 8157.458 us |      percpu_down_write();
           <...>-5086  [002]    0.1353879564566: funcgraph_entry:        0.752 us   |      down_write();
           <...>-5086  [002]    0.1353879565499: funcgraph_entry:      # 5965.246 us |      percpu_down_write();
           <...>-5086  [002]    0.1353885532231: funcgraph_entry:      + 22.944 us  |      sync_filesystem();
           <...>-5086  [002]    0.1353885555723: funcgraph_entry:      # 6011.502 us |      percpu_down_write();
           <...>-5086  [002]    0.1353891569581: funcgraph_entry:      # 3429.802 us |      ext4_freeze();
           <...>-5086  [002]    0.1353895001075: funcgraph_entry:        0.333 us   |      up_write();
           <...>-5086  [002]    0.1353895001564: funcgraph_exit:       # 23597.295 us |    }
           <...>-5086  [002]    0.1353895001814: funcgraph_exit:       # 23603.238 us |  }

Why does ext4 always fail on Rocky (with both dmsetup suspend and fsfreeze) but always succeed on Ubuntu (old)?

The Ubuntu (old) version of freeze_super has this patch, but the Rocky version does not: https://www.diffchecker.com/9APEhdvg/.

Using trace-cmd, we see the following:

  • In the Rocky system, after sync_filesystem, percpu_down_write and ext4_freeze are called. This completes the freeze.
  • In the Ubuntu (old) system, after sync_filesystem, percpu_up_write is called and no freeze commands. This is the work of the patch. Anything freeze_super did is undone and freeze_super returns.

https://www.diffchecker.com/asMoaRPn/

So both dmsetup suspend and fsfreeze -f FAIL to freeze an ext4 file system on Ubuntu when the block device is broken (leaving it unfrozen and unstuck, which is good). They both SUCCEED to freeze an ext4 file system on Rocky when the block device is broken.

Why does xfs always fail when fsfreeze is used?

In both environments, freeze_super actually succeeds to freeze the xfs file system when the block device is gone.

https://www.diffchecker.com/qGPCnkI6/

This is a bit surprising, and we don't have a perfect answer for why. Some ideas:

  • Rocky is missing this patch and Ubuntu (old) is missing this patch. Perhaps if either distribution had both, the behavior would be different? (TODO: Test with a maximally updated kernel.)
  • ext4 remounts read only when I/O errors occur, but xfs does not. The Rocky system already ignores sync_filesystem errors, but perhaps __sync_blockdev does not return an error here while the file system is mounted read/write?

ext4 remounts read only when I/O errors occur, but xfs does not. An strace shows that, in the ext4 case, fsfreeze -u succeeds to openat(AT_FDCWD, "/mnt/test", O_RDONLY) = 3 mount point read-only and then execute the ioctl(3, FITHAW) system call as expected. (Whether or not this system call can succeed is the topic of the previous question.) On the other hand, in the xfs case, fsfreeze -u fails to execute openat(AT_FDCWD, "/mnt/test", O_RDONLY), and never makes the system call to thaw the file system. This system call WOULD be successful, as it is when dmsetup resume calls the same underlying code (see the next question).

https://www.diffchecker.com/4LmHO5xP/

These observations do not explain why we cannot simply unfreeze the erroneously frozen ext4 file system on Rocky. In fact, both fsfreeze -u and dmsetup resume return successfully (indicating that the file system is unfrozen).

https://www.diffchecker.com/BwieIIyH/

However, comparing to a thaw_super call for an ext4 file system with an unbroken block device, the issue becomes clear.

https://www.diffchecker.com/XnfFK0Ap/

ext4_unfreeze and percpu_up_write are not called in the broken block device case. The percpu_up_write call, specifically, is necessary to unblock workload processes in uninterruptible sleep. They are stuck in percpu_rwsem_wait and MUST have the thaw code release the sb->s_writers.rw_sem semaphore to wake up. Perhaps it is a kernel bug.

In all relevant versions of the kernel (Rocky, Ubuntu old, and the kernel master branch), the necessary code is not called if the file system superblock is read-only:

  • Rocky jumps over the call to sb_freeze_unlock.
static int thaw_super_locked(struct super_block *sb)
{
	int error;

	if (sb->s_writers.frozen != SB_FREEZE_COMPLETE) {
		up_write(&sb->s_umount);
		return -EINVAL;
	}

	if (sb_rdonly(sb)) {
		sb->s_writers.frozen = SB_UNFROZEN;
		goto out;
	}

	lockdep_sb_freeze_acquire(sb);

	if (sb->s_op->unfreeze_fs) {
		error = sb->s_op->unfreeze_fs(sb);
		if (error) {
			printk(KERN_ERR
				"VFS:Filesystem thaw failed\n");
			lockdep_sb_freeze_release(sb);
			up_write(&sb->s_umount);
			return error;
		}
	}

	sb->s_writers.frozen = SB_UNFROZEN;
	sb_freeze_unlock(sb);
out:
	wake_up(&sb->s_writers.wait_unfrozen);
	deactivate_locked_super(sb);
	return 0;
}

static void sb_freeze_unlock(struct super_block *sb)
{
	int level;

	for (level = SB_FREEZE_LEVELS - 1; level >= 0; level--)
		percpu_up_write(sb->s_writers.rw_sem + level);
}

In all relevant versions of the kernel (Rocky, Ubuntu old, and the kernel master branch), ext4 remounts itself read-only before the unfreeze call by the following mechanism:

  • After the two minute iSCSI timeout, I/O errors occur.
[Wed Apr 17 15:14:10 2024]  connection3:0: detected conn error (1020)
[Wed Apr 17 15:16:12 2024]  session3: session recovery timed out after 120 secs
[Wed Apr 17 15:16:12 2024] sd 3:0:0:1: rejecting I/O to offline device
[Wed Apr 17 15:16:12 2024] blk_print_req_error: 12 callbacks suppressed
[Wed Apr 17 15:16:12 2024] I/O error, dev sda, sector 5673984 op 0x1:(WRITE) flags 0x4000 phys_seg 320 prio class 2
[Wed Apr 17 15:16:12 2024] I/O error, dev sda, sector 5694112 op 0x1:(WRITE) flags 0x4000 phys_seg 320 prio class 2
[Wed Apr 17 15:16:12 2024] I/O error, dev sda, sector 5687808 op 0x1:(WRITE) flags 0x4000 phys_seg 320 prio class 2
[Wed Apr 17 15:16:12 2024] I/O error, dev sda, sector 5685248 op 0x1:(WRITE) flags 0x4000 phys_seg 320 prio class 2
[Wed Apr 17 15:16:12 2024] I/O error, dev sda, sector 5690368 op 0x1:(WRITE) flags 0x0 phys_seg 148 prio class 2
[Wed Apr 17 15:16:12 2024] EXT4-fs warning (device sda): ext4_end_bio:343: I/O error 10 writing to inode 12 starting block 711296)
[Wed Apr 17 15:16:12 2024] I/O error, dev sda, sector 5679104 op 0x1:(WRITE) flags 0x4000 phys_seg 320 prio class 2
[Wed Apr 17 15:16:12 2024] I/O error, dev sda, sector 5701632 op 0x1:(WRITE) flags 0x4000 phys_seg 320 prio class 2
[Wed Apr 17 15:16:12 2024] I/O error, dev sda, sector 5699232 op 0x1:(WRITE) flags 0x0 phys_seg 300 prio class 2
[Wed Apr 17 15:16:12 2024] EXT4-fs warning (device sda): ext4_end_bio:343: I/O error 10 writing to inode 12 starting block 712591)
[Wed Apr 17 15:16:12 2024] I/O error, dev sda, sector 5704192 op 0x1:(WRITE) flags 0x4000 phys_seg 320 prio class 2
[Wed Apr 17 15:16:12 2024] I/O error, dev sda, sector 5682672 op 0x1:(WRITE) flags 0x4000 phys_seg 320 prio class 2
[Wed Apr 17 15:16:12 2024] EXT4-fs warning (device sda): ext4_end_bio:343: I/O error 10 writing to inode 12 starting block 710654)
[Wed Apr 17 15:16:12 2024] EXT4-fs warning (device sda): ext4_end_bio:343: I/O error 10 writing to inode 12 starting block 710208)
[Wed Apr 17 15:16:12 2024] EXT4-fs warning (device sda): ext4_end_bio:343: I/O error 10 writing to inode 12 starting block 712404)
[Wed Apr 17 15:16:12 2024] buffer_io_error: 4086 callbacks suppressed
[Wed Apr 17 15:16:12 2024] Buffer I/O error on device sda, logical block 708608
[Wed Apr 17 15:16:12 2024] Buffer I/O error on device sda, logical block 708609
[Wed Apr 17 15:16:12 2024] Buffer I/O error on device sda, logical block 708610
[Wed Apr 17 15:16:12 2024] Buffer I/O error on device sda, logical block 708611
[Wed Apr 17 15:16:12 2024] Buffer I/O error on device sda, logical block 708612
[Wed Apr 17 15:16:12 2024] Buffer I/O error on device sda, logical block 708613
[Wed Apr 17 15:16:12 2024] Buffer I/O error on device sda, logical block 708614
[Wed Apr 17 15:16:12 2024] Buffer I/O error on device sda, logical block 708615
[Wed Apr 17 15:16:12 2024] Buffer I/O error on device sda, logical block 708616
[Wed Apr 17 15:16:12 2024] Buffer I/O error on device sda, logical block 708617
[Wed Apr 17 15:16:12 2024] EXT4-fs warning (device sda): ext4_end_bio:343: I/O error 10 writing to inode 12 starting block 714624)
[Wed Apr 17 15:16:12 2024] EXT4-fs warning (device sda): ext4_end_bio:343: I/O error 10 writing to inode 12 starting block 714637)
[Wed Apr 17 15:16:12 2024] Aborting journal on device sda-8.
[Wed Apr 17 15:16:12 2024] buffer_io_error: 8 callbacks suppressed
[Wed Apr 17 15:16:12 2024] Buffer I/O error on dev sda, logical block 1081344, lost sync page write
[Wed Apr 17 15:16:12 2024] JBD2: I/O error when updating journal superblock for sda-8.
[Wed Apr 17 12:58:59 2024] EXT4-fs error (device dm-0): ext4_check_bdev_write_error:217: comm kworker/u8:3: Error while async write back metadata
[Wed Apr 17 12:58:59 2024] Buffer I/O error on dev dm-0, logical block 0, lost sync page write
[Wed Apr 17 12:58:59 2024] EXT4-fs (dm-0): I/O error while writing superblock
[Wed Apr 17 12:58:59 2024] EXT4-fs (dm-0): Delayed block allocation failed for inode 12 at logical offset 2009088 with max blocks 2048 with error 30
[Wed Apr 17 12:58:59 2024] EXT4-fs (dm-0): This should not happen!! Data will be lost

[Wed Apr 17 12:58:59 2024] EXT4-fs error (device dm-0) in ext4_writepages:2848: Journal has aborted
[Wed Apr 17 12:58:59 2024] Buffer I/O error on dev dm-0, logical block 0, lost sync page write
[Wed Apr 17 12:58:59 2024] EXT4-fs (dm-0): I/O error while writing superblock
[Wed Apr 17 12:58:59 2024] EXT4-fs error (device dm-0): ext4_journal_check_start:83: comm kworker/u8:0: Detected aborted journal
[Wed Apr 17 12:58:59 2024] Buffer I/O error on dev dm-0, logical block 0, lost sync page write
[Wed Apr 17 12:58:59 2024] EXT4-fs (dm-0): I/O error while writing superblock
[Wed Apr 17 12:58:59 2024] EXT4-fs (dm-0): Remounting filesystem read-only
  • The "Detected aborted journal" error is handled with a call to ext4_abort, which calls __ext4_error with force_ro == true. This calls ext4_handle_error with force_ro == true, resulting in a read-only remount of the file system.

Why does xfs always succeed when dmsetup suspend is used?

The thaw_super code has no problem unfreezing a frozen xfs file system when its block device is broken in either environment.

# Truncated and edited.
[root@eweber-engine-test eweber]# trace-cmd stream -g dm_resume --max-graph-depth 7 -p function_graph dmsetup resume /dev/mapper/test
  plugin 'function_graph'
           <...>-30399 [002]    0.53441717859251: funcgraph_entry:                   |  dm_resume() {
                                                                                        ...
           <...>-30399 [002]    0.53441717897710: funcgraph_entry:                   |    thaw_bdev() {
                                                                                          ...
           <...>-30399 [002]    0.53441717920684: funcgraph_entry:                   |      thaw_super() {
           <...>-30399 [002]    0.53441717920881: funcgraph_entry:                   |        down_write() {
           <...>-30399 [002]    0.53441717921058: funcgraph_entry:        0.181 us   |          __cond_resched();
           <...>-30399 [002]    0.53441717921532: funcgraph_exit:         0.683 us   |        }
           <...>-30399 [002]    0.53441717921727: funcgraph_entry:                   |        thaw_super_locked() {
           <...>-30399 [002]    0.53441717922171: funcgraph_entry:                   |          xfs_fs_unfreeze() {
           <...>-30399 [002]    0.53441717922435: funcgraph_entry:                   |            xfs_restore_resvblks() {
           <...>-30399 [002]    0.53441717922974: funcgraph_entry:        2.417 us   |              xfs_reserve_blocks();
           <...>-30399 [002]    0.53441717925559: funcgraph_exit:         3.151 us   |            }
           <...>-30399 [002]    0.53441717925891: funcgraph_entry:                   |            xfs_log_work_queue() {
           <...>-30399 [002]    0.53441717926281: funcgraph_entry:        0.196 us   |              __msecs_to_jiffies();
           <...>-30399 [002]    0.53441717926710: funcgraph_entry:        2.871 us   |              queue_delayed_work_on();
           <...>-30399 [002]    0.53441717929681: funcgraph_exit:         3.835 us   |            }
           <...>-30399 [002]    0.53441717930115: funcgraph_entry:        0.188 us   |            xfs_blockgc_start();
           <...>-30399 [002]    0.53441717930537: funcgraph_entry:        0.206 us   |            xfs_inodegc_start();
           <...>-30399 [002]    0.53441717930855: funcgraph_exit:         8.717 us   |          }
           <...>-30399 [002]    0.53441717932076: funcgraph_entry:                   |          percpu_up_write()
           <...>-30399 [002]    0.53441717958812: funcgraph_entry:                   |          percpu_up_write()
           <...>-30399 [002]    0.53441717961854: funcgraph_entry:                   |          percpu_up_write()
           <...>-30399 [002]    0.53441717975566: funcgraph_entry:                   |          __wake_up()
           <...>-30399 [002]    0.53441717977372: funcgraph_entry:                   |          deactivate_locked_super()
           <...>-30399 [002]    0.53441717978129: funcgraph_exit:       + 56.436 us  |        }
           <...>-30399 [002]    0.53441717978369: funcgraph_exit:       + 57.736 us  |      }
           <...>-30399 [002]    0.53441717978644: funcgraph_entry:        0.176 us   |      mutex_unlock();
           <...>-30399 [002]    0.53441717978948: funcgraph_exit:       + 81.270 us  |    }
           <...>-30399 [002]    0.53441717979196: funcgraph_entry:        0.205 us   |    mutex_unlock();
           <...>-30399 [002]    0.53441717979566: funcgraph_exit:       ! 121.879 us |  }

The dmsetup userspace utility has no need to attempt to open the mount point (since it operates on the device mapper block device). So it can call the thaw_super code when the fsfreeze utility cannot (see the question above).

It stands to reason that, even if dmsetup suspend can erroneously freeze an ext4 file system with a broken block device on Rocky, dmsetup resume should be able to unfreeze it.

Why do all tests pass for Ubuntu (new)?

As mentioned previously, Rocky and Ubuntu (old) are each missing different patches from https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=return-sync-fs-errors-5.17.

  • Because Rocky is missing this patch, it is not able to abort out of ANY freeze_super call as a result of sync_filesystem errors.
  • Because Ubuntu (old) is missing this patch, it is not able to report xfs-specific errors up through sync_filesystem to freeze_super. As a result, it is not able to abort out of freeze_super calls when these xfs-specific errors occur.

All four patches hit the kernel in v5.17. Since Ubuntu (new) runs v6.5.0.9-generic, it benefits. In all four file system and command combinations we tested, freeze_super correctly fails during sync_filesystem. Actual file system freeze code is not run, and the file system remains unfrozen, allowing for cleanup.

# ext4, fsfreeze

root@ubuntu-s-4vcpu-8gb-sfo3-01:~# trace-cmd record -g freeze_super --max-graph-depth 4 -p function_graph -F fsfreeze -f /mnt/test
  plugin 'function_graph'
fsfreeze: /mnt/test: freeze failed: Input/output error

        fsfreeze-27697 [002]  2199.131288: funcgraph_entry:                   |      sync_blockdev() {
        fsfreeze-27697 [002]  2199.131288: funcgraph_entry:        3.780 us   |        filemap_write_and_wait_range();
        fsfreeze-27697 [002]  2199.131292: funcgraph_exit:         4.545 us   |      }
        fsfreeze-27697 [002]  2199.131293: funcgraph_exit:       $ 124998814 us |    }

# sync_blockdev fails, so the `percpu_sem` is released instead of freeze code being executed.

        fsfreeze-27697 [002]  2199.131293: funcgraph_entry:                   |    percpu_up_write() {
        fsfreeze-27697 [002]  2199.131294: funcgraph_entry:                   |      __wake_up() {
        fsfreeze-27697 [002]  2199.131294: funcgraph_entry:        0.816 us   |        __wake_up_common_lock();
        fsfreeze-27697 [002]  2199.131295: funcgraph_exit:         1.509 us   |      }
        fsfreeze-27697 [002]  2199.131296: funcgraph_entry:                   |      rcu_sync_exit() {
        fsfreeze-27697 [002]  2199.131296: funcgraph_entry:        0.372 us   |        _raw_spin_lock_irq();
        fsfreeze-27697 [002]  2199.131297: funcgraph_entry:        1.554 us   |        call_rcu();
        fsfreeze-27697 [002]  2199.131299: funcgraph_entry:        0.355 us   |        _raw_spin_unlock_irq();
        fsfreeze-27697 [002]  2199.131300: funcgraph_exit:         3.610 us   |      }
        fsfreeze-27697 [002]  2199.131300: funcgraph_exit:         6.381 us   |    }
        fsfreeze-27697 [002]  2199.131300: funcgraph_entry:                   |    percpu_up_write() {
        fsfreeze-27697 [002]  2199.131301: funcgraph_entry:                   |      __wake_up() {
        fsfreeze-27697 [002]  2199.131301: funcgraph_entry:      + 15.984 us  |        __wake_up_common_lock();
        fsfreeze-27697 [002]  2199.131318: funcgraph_exit:       + 16.704 us  |      }
        fsfreeze-27697 [002]  2199.131318: funcgraph_entry:                   |      rcu_sync_exit() {
        fsfreeze-27697 [002]  2199.131318: funcgraph_entry:        0.377 us   |        _raw_spin_lock_irq();
        fsfreeze-27697 [002]  2199.131319: funcgraph_entry:        0.743 us   |        call_rcu();
        fsfreeze-27697 [002]  2199.131320: funcgraph_entry:        0.335 us   |        _raw_spin_unlock_irq();
        fsfreeze-27697 [002]  2199.131321: funcgraph_exit:         2.684 us   |      }
        fsfreeze-27697 [002]  2199.131321: funcgraph_exit:       + 20.554 us  |    }
# xfs, fsfreeze

root@ubuntu-s-4vcpu-8gb-sfo3-01:~# trace-cmd record -g freeze_super --max-graph-depth 4 -p function_graph -F fsfreeze -f /mnt/test
  plugin 'function_graph'
fsfreeze: /mnt/test: freeze failed: Input/output error

        fsfreeze-34581 [001]  2555.782108: funcgraph_entry:                   |      xfs_fs_sync_fs() {
        fsfreeze-34581 [001]  2555.782109: funcgraph_entry:        6.537 us   |        xfs_log_force();
        fsfreeze-34581 [001]  2555.782116: funcgraph_exit:         7.497 us   |      }
        fsfreeze-34581 [001]  2555.782116: funcgraph_exit:       $ 125695011 us |    }

# xfs_fs_sync_fs fails, so the `percpu_sem` is released instead of freeze code being executed.

        fsfreeze-34581 [001]  2555.782117: funcgraph_entry:                   |    percpu_up_write() {
        fsfreeze-34581 [001]  2555.782117: funcgraph_entry:                   |      __wake_up() {
        fsfreeze-34581 [001]  2555.782117: funcgraph_entry:        1.053 us   |        __wake_up_common_lock();
        fsfreeze-34581 [001]  2555.782119: funcgraph_exit:         1.629 us   |      }
        fsfreeze-34581 [001]  2555.782119: funcgraph_entry:                   |      rcu_sync_exit() {
        fsfreeze-34581 [001]  2555.782120: funcgraph_entry:        0.287 us   |        _raw_spin_lock_irq();
        fsfreeze-34581 [001]  2555.782121: funcgraph_entry:        1.246 us   |        call_rcu();
        fsfreeze-34581 [001]  2555.782122: funcgraph_entry:        0.399 us   |        _raw_spin_unlock_irq();
        fsfreeze-34581 [001]  2555.782123: funcgraph_exit:         3.402 us   |      }
        fsfreeze-34581 [001]  2555.782123: funcgraph_exit:         6.315 us   |    }
        fsfreeze-34581 [001]  2555.782123: funcgraph_entry:                   |    percpu_up_write() {
        fsfreeze-34581 [001]  2555.782124: funcgraph_entry:                   |      __wake_up() {
        fsfreeze-34581 [001]  2555.782124: funcgraph_entry:      + 17.528 us  |        __wake_up_common_lock();
        fsfreeze-34581 [001]  2555.782142: funcgraph_exit:       + 18.121 us  |      }
        fsfreeze-34581 [001]  2555.782142: funcgraph_entry:                   |      rcu_sync_exit() {
        fsfreeze-34581 [001]  2555.782142: funcgraph_entry:        0.246 us   |        _raw_spin_lock_irq();
        fsfreeze-34581 [001]  2555.782143: funcgraph_entry:        0.497 us   |        call_rcu();
        fsfreeze-34581 [001]  2555.782143: funcgraph_entry:        0.227 us   |        _raw_spin_unlock_irq();
        fsfreeze-34581 [001]  2555.782144: funcgraph_exit:         1.779 us   |      }
        fsfreeze-34581 [001]  2555.782144: funcgraph_exit:       + 20.760 us  |    }
# ext4, dmsetup suspend

root@ubuntu-s-4vcpu-8gb-sfo3-01:~# trace-cmd record -g dm_suspend --max-graph-depth 7 -p function_graph -F dmsetup suspend /dev/mapper/test
  plugin 'function_graph'
device-mapper: suspend ioctl on test  failed: Input/output error
Command failed.

         dmsetup-649130 [000]  4374.119724: funcgraph_entry:                   |            sync_blockdev() {
         dmsetup-649130 [000]  4374.119724: funcgraph_entry:        3.833 us   |              filemap_write_and_wait_range();
         dmsetup-649130 [000]  4374.119728: funcgraph_exit:         4.496 us   |            }
         dmsetup-649130 [000]  4374.119728: funcgraph_exit:       $ 122709683 us |          }

# sync_blockdev fails, so the `percpu_sem` is released instead of freeze code being executed.

         dmsetup-649130 [000]  4374.119729: funcgraph_entry:                   |          percpu_up_write() {
         dmsetup-649130 [000]  4374.119730: funcgraph_entry:                   |            __wake_up() {
         dmsetup-649130 [000]  4374.119730: funcgraph_entry:        1.086 us   |              __wake_up_common_lock();
         dmsetup-649130 [000]  4374.119731: funcgraph_exit:         1.721 us   |            }
         dmsetup-649130 [000]  4374.119732: funcgraph_entry:                   |            rcu_sync_exit() {
         dmsetup-649130 [000]  4374.119732: funcgraph_entry:        0.383 us   |              _raw_spin_lock_irq();
         dmsetup-649130 [000]  4374.119733: funcgraph_entry:        1.585 us   |              call_rcu();
         dmsetup-649130 [000]  4374.119735: funcgraph_entry:        0.362 us   |              _raw_spin_unlock_irq();
         dmsetup-649130 [000]  4374.119735: funcgraph_exit:         3.793 us   |            }
         dmsetup-649130 [000]  4374.119736: funcgraph_exit:         6.728 us   |          }
         dmsetup-649130 [000]  4374.119736: funcgraph_entry:                   |          percpu_up_write() {
         dmsetup-649130 [000]  4374.119737: funcgraph_entry:                   |            __wake_up() {
         dmsetup-649130 [000]  4374.119737: funcgraph_entry:      + 10.881 us  |              __wake_up_common_lock();
         dmsetup-649130 [000]  4374.119748: funcgraph_exit:       + 11.660 us  |            }
         dmsetup-649130 [000]  4374.119749: funcgraph_entry:                   |            rcu_sync_exit() {
         dmsetup-649130 [000]  4374.119749: funcgraph_entry:        0.748 us   |              _raw_spin_lock_irq();
         dmsetup-649130 [000]  4374.119750: funcgraph_entry:        0.719 us   |              call_rcu();
         dmsetup-649130 [000]  4374.119751: funcgraph_entry:        0.311 us   |              _raw_spin_unlock_irq();
         dmsetup-649130 [000]  4374.119752: funcgraph_exit:         3.053 us   |            }
         dmsetup-649130 [000]  4374.119752: funcgraph_exit:       + 15.741 us  |          }
# xfs, dmsetup suspend

root@ubuntu-s-4vcpu-8gb-sfo3-01:~# trace-cmd record -g dm_suspend --max-graph-depth 7 -p function_graph -F dmsetup suspend /dev/mapper/test
  plugin 'function_graph'
device-mapper: suspend ioctl on test  failed: Input/output error

         dmsetup-1231464 [002] 69949.597824: funcgraph_entry:                   |            xfs_fs_sync_fs() {
         dmsetup-1231464 [002] 69949.597825: funcgraph_entry:        7.674 us   |              xfs_log_force();
         dmsetup-1231464 [002] 69949.597833: funcgraph_exit:         8.694 us   |            }
         dmsetup-1231464 [002] 69949.597833: funcgraph_exit:       $ 131845458 us |          }

# xfs_fs_sync_fs fails, so the `percpu_sem` is released instead of freeze code being executed.

         dmsetup-1231464 [002] 69949.597834: funcgraph_entry:                   |          percpu_up_write() {
         dmsetup-1231464 [002] 69949.597835: funcgraph_entry:                   |            __wake_up() {
         dmsetup-1231464 [002] 69949.597835: funcgraph_entry:        0.762 us   |              __wake_up_common_lock();
         dmsetup-1231464 [002] 69949.597836: funcgraph_exit:         1.258 us   |            }
         dmsetup-1231464 [002] 69949.597836: funcgraph_entry:                   |            rcu_sync_exit() {
         dmsetup-1231464 [002] 69949.597837: funcgraph_entry:        0.284 us   |              _raw_spin_lock_irq();
         dmsetup-1231464 [002] 69949.597838: funcgraph_entry:        1.711 us   |              call_rcu();
         dmsetup-1231464 [002] 69949.597840: funcgraph_entry:        0.275 us   |              _raw_spin_unlock_irq();
         dmsetup-1231464 [002] 69949.597840: funcgraph_exit:         3.813 us   |            }
         dmsetup-1231464 [002] 69949.597840: funcgraph_exit:         6.184 us   |          }
         dmsetup-1231464 [002] 69949.597841: funcgraph_entry:                   |          percpu_up_write() {
         dmsetup-1231464 [002] 69949.597841: funcgraph_entry:                   |            __wake_up() {
         dmsetup-1231464 [002] 69949.597841: funcgraph_entry:      + 14.798 us  |              __wake_up_common_lock();
         dmsetup-1231464 [002] 69949.597856: funcgraph_exit:       + 15.344 us  |            }
         dmsetup-1231464 [002] 69949.597857: funcgraph_entry:                   |            rcu_sync_exit() {
         dmsetup-1231464 [002] 69949.597857: funcgraph_entry:        0.271 us   |              _raw_spin_lock_irq();
         dmsetup-1231464 [002] 69949.597858: funcgraph_entry:        0.557 us   |              call_rcu();
         dmsetup-1231464 [002] 69949.597858: funcgraph_entry:        0.268 us   |              _raw_spin_unlock_irq();
         dmsetup-1231464 [002] 69949.597859: funcgraph_exit:         2.166 us   |            }
         dmsetup-1231464 [002] 69949.597859: funcgraph_exit:       + 18.497 us  |          }

Other investigation questions

Does dmsetup suspend freeze all file systems on a block device?

Quick answer

No. dmsetup suspend only freezes a file system if it is on the "root partition". If a Longhorn volume is partitioned and dmsetup suspend is aimed at the volume itself (e.g. /dev/mapper/volume or /dev/dm-0), dmsetup suspend does not freeze file systems on the partitions.

If dmsetup suspend targets a device that contains partitions:

  • The device is protected from any further I/O. Processes attempting to write to it or any of its partitions become stuck in uninterruptible sleep.
  • None of the file systems on its partitions are synced. No data in the dirty page cache is written down, so a backup of the device is not file-system consistent for any of the contained file systems.

Test methodology

  1. Create a simple Longhorn volume with one replica and one engine inside a container.
# Use docker.
docker run --privileged -v /:/host --rm --net host --name test longhornio/longhorn-engine:master-head launch-simple-longhorn test 10g tgt-blockdev

# Or, use nerdctl.
nerdctl run --privileged -v /:/host --rm --net host --name test longhornio/longhorn-engine:master-head launch-simple-longhorn test 10g tgt-blockdev
  1. Create a Device Mapper linear device on top of the Longhorn volume.
dmsetup create test --table "0 $(blockdev --getsz /dev/longhorn/test) linear /dev/longhorn/test 0"
  1. Create two partitions on the volume.
[root@eweber-engine-test eweber]# fdisk -l /dev/mapper/test
Disk /dev/mapper/test: 10 GiB, 10737418240 bytes, 20971520 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0xf309a5c4

Device            Boot   Start      End Sectors Size Id Type
/dev/mapper/test1         2048  8390655 8388608   4G 83 Linux
/dev/mapper/test2      8390656 16779263 8388608   4G 83 Linux

[root@eweber-engine-test eweber]# kpartx -a /dev/mapper/test 
  1. Create an ext4 file system on one parition and an xfs file system on the other.
  2. Mount the partitions to /mnt/partition-1 and /mnt/partition-2`.
[root@eweber-engine-test eweber]# mount /dev/mapper/test1 /mnt/partition-1
[root@eweber-engine-test eweber]# mount /dev/mapper/test2 /mnt/partition-2
  1. Use dd on a loop to keep up a stream of constant writes to a file on one of the file systems.
while true; do dd of=/mnt/test/file if=/dev/urandom bs=16M; done
  1. Monitor the dirty page cache. While dd is running, there may be as much as 1 GiB of data in it.
while true; do cat /proc/meminfo | grep -i dirty; sleep 0.5; done
  1. Use trace-cmd as a frontend to the kernel's builtin ftrace tracer to monitor kernel function calls as either command executes.

Results

Suspend the overall device

dmsetup suspend does not find an active superblock to freeze. If it did, we would see a call to freeze_super after get_active_super.

[root@eweber-engine-test eweber]# trace-cmd stream -g freeze_bdev --max-graph-depth 2 -p function_graph dmsetup suspend /dev/mapper/test
  plugin 'function_graph'
           <...>-48125 [000]    0.17713285524674: funcgraph_entry:                   |  freeze_bdev() {
           <...>-48125 [000]    0.17713285525945: funcgraph_entry:        0.642 us   |    mutex_lock();
           <...>-48125 [000]    0.17713285527105: funcgraph_entry:        9.421 us   |    get_active_super();
           <...>-48125 [000]    0.17713285536857: funcgraph_entry:        3.139 us   |    filemap_write_and_wait_range();
           <...>-48125 [000]    0.17713285540231: funcgraph_entry:        0.216 us   |    mutex_unlock();
           <...>-48125 [000]    0.17713285540667: funcgraph_exit:       + 17.550 us  |  }

dmsetup suspend does not empty the dirty page cache (likely because that is the responsibility of the freeze_super code). The device is frozen (it will accept no more I/O), but the file systems on its partitions are not consistent on disk. Processes using a file system on one of the partitions are stuck in different code locations than they usually are in response to a freeze.

[root@eweber-engine-test eweber]# while true; do cat /proc/meminfo | grep -i dirty; sleep 0.05; done
Dirty:            691168 kB
Dirty:            691168 kB
Dirty:            691168 kB
Dirty:            691168 kB
...

[root@eweber-engine-test eweber]# while true; do ps -eo pid,ppid,pgid,user,stat,pcpu,comm,wchan:32 | grep -e sync -e touch -e freeze -e dmsetup -e " dd"; sleep 1; done
  43717    2558   43717 root     D+    5.2 dd              ext4_da_map_blocks.constprop.0
  46734   46682   46734 root     D+    0.0 touch           vfs_utimes

Suspend a partition of the device

On the other hand, dmsetup suspend DOES freeze a file system if directed at that file system's partition.

[root@eweber-engine-test eweber]# trace-cmd stream -g freeze_bdev --max-graph-depth 2 -p function_graph dmsetup suspend /dev/mapper/test1
  plugin 'function_graph'
           <...>-56194 [002]    0.20142737856666: funcgraph_entry:                   |  freeze_bdev() {
           <...>-56194 [002]    0.20142737858840: funcgraph_entry:        1.075 us   |    mutex_lock();
           <...>-56194 [002]    0.20142737860564: funcgraph_entry:      + 11.694 us  |    get_active_super();
           <...>-56194 [002]    0.20142737873097: funcgraph_entry:      # 41062.640 us |    freeze_super();
           <...>-56194 [002]    0.20142778939215: funcgraph_entry:        0.750 us   |    deactivate_super();
           <...>-56194 [002]    0.20142778940296: funcgraph_entry:        4.560 us   |    filemap_write_and_wait_range();
           <...>-56194 [002]    0.20142778945199: funcgraph_entry:        0.360 us   |    mutex_unlock();
           <...>-56194 [002]    0.20142778945802: funcgraph_exit:       # 41091.056 us |  }

dmsetup suspend empties the dirty page cache. Processes using a file system on the partition are stuck in the code location we have come to associate with a locked super block.

[root@eweber-engine-test eweber]# while true; do cat /proc/meminfo | grep -i dirty; sleep 0.05; done
Dirty:                 0 kB
Dirty:                 0 kB
Dirty:                 0 kB
Dirty:                 0 kB

[root@eweber-engine-test eweber]# while true; do ps -eo pid,ppid,pgid,user,stat,pcpu,comm,wchan:32 | grep -e sync -e touch -e freeze -e dmsetup -e " dd"; sleep 1; done
  49589    2558   49589 root     D+   12.1 dd              percpu_rwsem_wait
  50713   46682   50713 root     D+    0.0 touch           percpu_rwsem_wait

Can we reduce danger by calling sync before dmsetup suspend or fsfreeze?

  • We only seem to be in "danger" while the freeze operation is ongoing (whether initiated by fsfreeze -f or dmsetup suspend, and
  • The vast majority of the freeze time is spent flushing the data to cache.

It seems plausible that calling sync purposefully before calling fsfreeze -f or dmsetup suspend would reduce the size of the dirty page cache and reduce the "danger window".

Quick answer

We can reduce the time it takes to call fsfreeze -f or dmsetup suspend by ~66% under a constant write load (from about three seconds to about one second). Results may vary by environment.

Results

In a limited number of tests using the dd invocation typical of this investigation:

fsfreeze -f takes an average of 3.79 s.

[root@eweber-engine-test eweber]# cat /proc/meminfo | grep -i dirty; time fsfreeze -f /mnt/test
Dirty:            794576 kB

real    0m4.174s
user    0m0.001s
sys     0m0.001s

Calling sync before fsfreeze -f reduces the average time to 1.09 s. (Note that the dirty page cache is not empty when fsfreeze -f is called. sync causes existing data to flush to disk, but does not prevent new data from being written.)

[root@eweber-engine-test eweber]# cat /proc/meminfo | grep -i dirty; sync; cat /proc/meminfo | grep -i dirty; time fsfreeze -f /mnt/test
Dirty:            663444 kB
Dirty:            294852 kB

real    0m1.125s
user    0m0.000s
sys     0m0.003s

dmsetup suspend takes an average of 3.10 s.

[root@eweber-engine-test eweber]# cat /proc/meminfo | grep -i dirty; time dmsetup suspend /dev/mapper/test 
Dirty:            663688 kB

real    0m3.109s
user    0m0.002s
sys     0m0.003s

Calling sync before dmsetup suspend reduces the average time to 1.03 s. (Note that the dirty page cache is not empty when dmsetup suspend is called. sync causes existing data to flush to disk, but does not prevent new data from being written.)

[root@eweber-engine-test eweber]# cat /proc/meminfo | grep -i dirty; sync; cat /proc/meminfo | grep -i dirty; time dmsetup suspend /dev/mapper/test
Dirty:            647144 kB
Dirty:            229332 kB

real    0m0.893s
user    0m0.001s
sys     0m0.002s

How do these methods impact block mode volumes?

fsfreeze

Quick answer

fsfreeze -f does NOT affect block volumes. The longhorn-engine runs in an instance-manager container (and thus, in an instance-manager container's namespace). If a workload consumes a Longhorn volume in block mode and decides to mount a file system it contains, the file system is mounted in the workload pod's mount namespace. The mount cannot be seen from the instance-manager's mount namespace. It can also not be seen from the host's mount namespace. This prevents longhorn-engine from finding a mount point to run fsfreeze -f against.

Evidence

Modify the block_volume.yaml example to include a privileged security context and deploy it in a cluster running Longhorn master-head.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: longhorn-block-vol
  namespace: default
spec:
  accessModes:
    - ReadWriteOnce
  volumeMode: Block
  storageClassName: longhorn
  resources:
    requests:
      storage: 2Gi
---
apiVersion: v1
kind: Pod
metadata:
  name: block-volume-test
  namespace: default
spec:
  containers:
    - name: block-volume-test
      image: nginx:stable-alpine
      imagePullPolicy: IfNotPresent
      volumeDevices:
        - devicePath: /dev/longhorn/testblk
          name: block-vol
      ports:
        - containerPort: 80
      securityContext:
        privileged: true
  volumes:
    - name: block-vol
      persistentVolumeClaim:
        claimName: longhorn-block-vol

Mount the block volume to a mount point inside its container.

eweber@laptop:~/longhorn> k exec block-volume-test -- mkdir /mnt/testblk

eweber@laptop:~/longhorn> k exec block-volume-test -- apk add e2fsprogs
(1/4) Installing libblkid (2.38.1-r1)
(2/4) Installing libcom_err (1.46.6-r0)
(3/4) Installing e2fsprogs-libs (1.46.6-r0)
(4/4) Installing e2fsprogs (1.46.6-r0)
Executing busybox-1.35.0-r29.trigger
OK: 45 MiB in 66 packages

eweber@laptop:~/longhorn> k exec block-volume-test -- mkfs.ext4 /dev/longhorn/testblk
mke2fs 1.46.6 (1-Feb-2023)
Discarding device blocks: done                            
Creating filesystem with 524288 4k blocks and 131072 inodes
Filesystem UUID: d9264251-08cc-4bef-98ad-9f687a189c40
Superblock backups stored on blocks: 
        32768, 98304, 163840, 229376, 294912

Allocating group tables: done                            
Writing inode tables: done                            
Creating journal (16384 blocks): done
Writing superblocks and filesystem accounting information: done 

eweber@laptop:~/longhorn> k exec block-volume-test -- mount /dev/longhorn/testblk /mnt/testblk

eweber@laptop:~/longhorn> k exec block-volume-test -- touch /mnt/testblk/file
# Successful.

eweber@laptop:~/longhorn> k exec block-volume-test -- mount | grep test
/dev/longhorn/testblk on /mnt/testblk type ext4 (rw,relatime)

Try to see the mount from the instance-manager container and the host. It is not possible.

eweber@laptop:~/longhorn> kl exec instance-manager-699da83c0e9d22726e667344227e096b -- mount | grep test
# No result.

eweber@laptop:~/longhorn> ssh root@143.198.232.225 mount | grep test
# No result.

dmsetup suspend

Quick answer

dmsetup suspend DOES affect block volumes. The DM_SUSPEND ioctl looks for file system superblocks in kernel space. It can successfully find and freeze the file system on a Longhorn block volume even if that file system is mounted in some other mount namespace (assuming the file system is on the root partition).

Evidence

We cannot currently do this with Longhorn in Kubernetes because we must create the Device Mapper linear device manually and expose it to the container.

Run a simple Longhorn volume with Docker.

docker run --privileged -v /:/host --rm --net host --name test longhornio/longhorn-engine:master-head launch-simple-longhorn test 10g tgt-blockdev

Create a Device Mapper linear device and bind mount it into a workload container.

[root@eweber-engine-test longhorn-engine]# dmsetup create test --table "0 $(blockdev --getsz /dev/longhorn/test) linear /dev/longhorn/test 0"
[root@eweber-engine-test longhorn-engine]# docker run -it -v /dev/mapper/test:/dev/longhorn/test --privileged --rm --name fake-vm ubuntu

Inside the workload container, create and mount a volume.

root@a53f8945a38c:/# mkfs.ext4 /dev/longhorn/test 
mke2fs 1.46.5 (30-Dec-2021)
Discarding device blocks: done                            
Creating filesystem with 2621440 4k blocks and 655360 inodes
Filesystem UUID: 4f676067-17d2-48ca-b59c-55c195e66e2e
Superblock backups stored on blocks: 
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632

Allocating group tables: done                            
Writing inode tables: done                            
Creating journal (16384 blocks): done
Writing superblocks and filesystem accounting information: done 

root@a53f8945a38c:/# mkdir /mnt/test

root@a53f8945a38c:/# mount /dev/longhorn/test /mnt/test

root@a53f8945a38c:/# touch /mnt/test/file
# Successful.

Use trace-cmd to run dmsetup suspend on the host. Notice that it finds an ext4 file system and freezes it.

[root@eweber-engine-test longhorn-engine]# trace-cmd stream -g dm_suspend --max-graph-depth 7 -p function_graph -F dmsetup suspend /dev
/mapper/test
  plugin 'function_graph'
           ...
           <...>-326357 [001]    0.95164195440664: funcgraph_entry:                   |            sync_filesystem() {
           <...>-326357 [001]    0.95164195441236: funcgraph_entry:        1.422 us   |              writeback_inodes_sb();
           <...>-326357 [001]    0.95164195443559: funcgraph_entry:        5.755 us   |              ext4_sync_fs();
           <...>-326357 [001]    0.95164195449713: funcgraph_entry:        2.117 us   |              sync_blockdev_nowait();
           <...>-326357 [001]    0.95164195452089: funcgraph_entry:        3.872 us   |              sync_inodes_sb();
           <...>-326357 [001]    0.95164195456225: funcgraph_entry:        8.699 us   |              ext4_sync_fs();
           <...>-326357 [001]    0.95164195465172: funcgraph_entry:        2.199 us   |              sync_blockdev();
           <...>-326357 [001]    0.95164195467599: funcgraph_exit:       + 26.951 us  |            }
           ...
           <...>-326357 [001]    0.95164205439711: funcgraph_entry:                   |            ext4_freeze() {
           <...>-326357 [001]    0.95164205440158: funcgraph_entry:        0.959 us   |              jbd2_journal_lock_updates();
           <...>-326357 [001]    0.95164205441577: funcgraph_entry:      # 5745.082 us |              jbd2_journal_flush();
           <...>-326357 [001]    0.95164211189715: funcgraph_entry:        0.429 us   |              ext4_orphan_file_empty();
           <...>-326357 [001]    0.95164211190195: funcgraph_entry:      # 1674.269 us |              ext4_commit_super();
           <...>-326357 [001]    0.95164212865624: funcgraph_entry:        1.410 us   |              jbd2_journal_unlock_updates();
           <...>-326357 [001]    0.95164212867121: funcgraph_exit:       # 7427.433 us |            }
           ...

The file system cannot be written to on the host. This isn't a very good test, though. Even if the file system weren't frozen, touch would still fail because I/O to the underlying block device is frozen. (Perhaps we should have looked to see if the touch was stuck in percpu_rwsem_wait.)

root@a53f8945a38c:/# touch /mnt/test/file
# Stuck.
Clone this wiki locally