Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZFS direct write differs from same write through block device (block device write amplification) #16174

Open
woodholly opened this issue May 7, 2024 · 3 comments
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@woodholly
Copy link

woodholly commented May 7, 2024

I was playing with zfs and found something interesting: there is big difference in writing to zfs file directly VS writing into same file mounted as filesystem / block device.
Writing 2Gb directly leads to 2.8Gb real writes to disk, 40% more writes, this is OK.
But writing 2Gb into same file mounted as loop device leads to 4.2Gb of real writes to disk.
What kind of magic is working here ?

Same happens to zvols, same with files mounted as filesystem - 4.2Gb.
But Qcow2 is an EXCEPTION (because it does not use intermediate block device?), writes to Qcow2 are 2.8Gb too, not 4.2Gb (same when mounted inside VM or directly using qemu-nbd).

Openzfs 2.1.11-1
UPDATE: tested with openzfs 2.2.3-1, same results

Tests:

  1. prepare:
zpool create -d -o feature@async_destroy=enabled -o feature@empty_bpobj=enabled -o feature@lz4_compress=enabled -o ashift=12 -O compression=lz4 -O acltype=posixacl -O xattr=sa -O utf8only=on -O atime=off -O relatime=on -O recordsize=64k pool0 /dev/sda_lol
#Nothing special here, just recordsize=64k for filesystems.
zfs create pool0/test
  1. tests (bash script):
#!/bin/bash
POOL="pool0"
DISK_NAME="sda"
DISK_SECTOR_SIZE=512

# precreate 4Gb image file: 
dd if=/dev/random of=/$POOL/test/image2 bs=2M count=2000 conv=notrunc
sync; sleep 20

#find out starting number of Megabytes written to "sda" disk to this moment:
WRITES_START=$(( $(cat /sys/block/$DISK_NAME/stat|awk '{print $7}')*$DISK_SECTOR_SIZE/1024/1024 ))
#302983 - my example

#start writing 2Gb to zfs file directly:
dd if=/dev/random of=/$POOL/test/image2 bs=2M count=1000 conv=notrunc
sync

# wait 15 seconds, yes, sync is not enough, zfs continue to write even after sync:
sleep 15

#find out number of Megabytes written to "sda" disk:
WRITES_END=$(( $(cat /sys/block/$DISK_NAME/stat|awk '{print $7}')*$DISK_SECTOR_SIZE/1024/1024 ))
#305856

echo "Direct write test:" $(( $WRITES_END - $WRITES_START ))
echo "====================================================="
#305856-302983=2873 or ~2.8Gb, +873Mb of writes, +40%


#Same test with /$POOL/test/image2 mounted as device:
WRITES_START=$(( $(cat /sys/block/$DISK_NAME/stat|awk '{print $7}')*$DISK_SECTOR_SIZE/1024/1024 ))
losetup /dev/loop0 /$POOL/test/image2 
dd if=/dev/random of=/dev/loop0 bs=2M count=1000 conv=notrunc
sync
losetup -d /dev/loop0 
sleep 15
WRITES_END=$(( $(cat /sys/block/$DISK_NAME/stat|awk '{print $7}')*$DISK_SECTOR_SIZE/1024/1024 ))
echo "Loop device write test:" $(( $WRITES_END - $WRITES_START ))
echo "====================================================="
# 4144 ~4.1Gb, +107% writes, more than double, why not 2.8Gb ?

#Same test with /$POOL/test/image2 mounted as ext4 filesystem (same as RAW file in VM):
mkfs.ext4 /$POOL/test/image2; mkdir /mnt/test; mount /$POOL/test/image2 /mnt/test
sync
sleep 10
WRITES_START=$(( $(cat /sys/block/$DISK_NAME/stat|awk '{print $7}')*$DISK_SECTOR_SIZE/1024/1024 ))
# start test:
dd if=/dev/random of=/mnt/test/junk.raw bs=2M count=1000 conv=notrunc
sync
umount /mnt/test; rmdir /mnt/test
sleep 15
WRITES_END=$(( $(cat /sys/block/$DISK_NAME/stat|awk '{print $7}')*$DISK_SECTOR_SIZE/1024/1024 ))
echo "Raw ext4 write test:" $(( $WRITES_END - $WRITES_START ))
echo "====================================================="
#4155 ~4.1Gb, same +107% writes


#Same test with Qcow2 image mounted as ext4 using new /$POOL/test/image.qcow2 mounted as ext4 filesystem (same as qcow2 file in VM):
qemu-img create -f qcow2 /$POOL/test/image.qcow2 4G -o preallocation=full
modprobe nbd
qemu-nbd --connect=/dev/nbd0 /$POOL/test/image.qcow2
mkfs.ext4 /dev/nbd0; mkdir /mnt/test; mount /dev/nbd0 /mnt/test
sync
sleep 20
WRITES_START=$(( $(cat /sys/block/$DISK_NAME/stat|awk '{print $7}')*$DISK_SECTOR_SIZE/1024/1024 ))
dd if=/dev/random of=/mnt/test/junk.raw bs=2M count=1000 conv=notrunc
sync
umount /mnt/test; rmdir /mnt/test
qemu-nbd -d /dev/nbd0
sleep 15
WRITES_END=$(( $(cat /sys/block/$DISK_NAME/stat|awk '{print $7}')*$DISK_SECTOR_SIZE/1024/1024 ))
echo "Qcow2 write test:" $(( $WRITES_END - $WRITES_START ))
echo "====================================================="
#2839 ~ 2.8Gb, +40%, why not 4.1Gb ?
@woodholly woodholly added the Type: Defect Incorrect behavior (e.g. crash, hang) label May 7, 2024
@GregorKopka
Copy link
Contributor

EXT4 defaults to 4KB block size, your backing file has 64KB recordsize.
This could lead to partial block rewrites of the backing file.

@woodholly
Copy link
Author

woodholly commented May 8, 2024

Ext4 block size is unrelated, because there is a test without ext4 with same result (losetup + dd)

@GregorKopka
Copy link
Contributor

Having looked at this a bit closer:
Both cases you see the write inflation are when the write path goes through a loop device (mounting a filesystem from a file automagically sets one up). Looking at what losetup tells me about the loop devices created for mounting filesystems from files:

# losetup -l /dev/loop0
NAME       SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILE DIO LOG-SEC
/dev/loop0                          0  0             0     512
# mount /tmp/testfile /mnt
# losetup -l /dev/loop0
NAME       SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILE     DIO LOG-SEC
/dev/loop0         0      0         1  0 /tmp/testfile   0     512

(this is from mount /tmp/testfile /mnt) and if I interprete this correctly it tells us that the loop device uses SYNC I/O (DIO=0) and a block size of 512 bytes.

That could lead up to big writes being chopped up into smaller I/O that (as of them being SYNC) can not be coalesced by ZFS, which would explain the write amplification as of R/M/D cycles on data and/or metadata rewrites for the file.

As of your test with the qcow2 image not showing that write amplification I suspect that nbd uses a different write path that does not suffer from what losetup is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

2 participants