ZFS direct write differs from same write through block device (block device write amplification) #16174

woodholly · 2024-05-07T18:51:52Z

I was playing with zfs and found something interesting: there is big difference in writing to zfs file directly VS writing into same file mounted as filesystem / block device.
Writing 2Gb directly leads to 2.8Gb real writes to disk, 40% more writes, this is OK.
But writing 2Gb into same file mounted as loop device leads to 4.2Gb of real writes to disk.
What kind of magic is working here ?

Same happens to zvols, same with files mounted as filesystem - 4.2Gb.
But Qcow2 is an EXCEPTION (because it does not use intermediate block device?), writes to Qcow2 are 2.8Gb too, not 4.2Gb (same when mounted inside VM or directly using qemu-nbd).

Openzfs 2.1.11-1
UPDATE: tested with openzfs 2.2.3-1, same results

Tests:

prepare:

zpool create -d -o feature@async_destroy=enabled -o feature@empty_bpobj=enabled -o feature@lz4_compress=enabled -o ashift=12 -O compression=lz4 -O acltype=posixacl -O xattr=sa -O utf8only=on -O atime=off -O relatime=on -O recordsize=64k pool0 /dev/sda_lol
#Nothing special here, just recordsize=64k for filesystems.
zfs create pool0/test

tests (bash script):

#!/bin/bash
POOL="pool0"
DISK_NAME="sda"
DISK_SECTOR_SIZE=512

# precreate 4Gb image file: 
dd if=/dev/random of=/$POOL/test/image2 bs=2M count=2000 conv=notrunc
sync; sleep 20

#find out starting number of Megabytes written to "sda" disk to this moment:
WRITES_START=$(( $(cat /sys/block/$DISK_NAME/stat|awk '{print $7}')*$DISK_SECTOR_SIZE/1024/1024 ))
#302983 - my example

#start writing 2Gb to zfs file directly:
dd if=/dev/random of=/$POOL/test/image2 bs=2M count=1000 conv=notrunc
sync

# wait 15 seconds, yes, sync is not enough, zfs continue to write even after sync:
sleep 15

#find out number of Megabytes written to "sda" disk:
WRITES_END=$(( $(cat /sys/block/$DISK_NAME/stat|awk '{print $7}')*$DISK_SECTOR_SIZE/1024/1024 ))
#305856

echo "Direct write test:" $(( $WRITES_END - $WRITES_START ))
echo "====================================================="
#305856-302983=2873 or ~2.8Gb, +873Mb of writes, +40%


#Same test with /$POOL/test/image2 mounted as device:
WRITES_START=$(( $(cat /sys/block/$DISK_NAME/stat|awk '{print $7}')*$DISK_SECTOR_SIZE/1024/1024 ))
losetup /dev/loop0 /$POOL/test/image2 
dd if=/dev/random of=/dev/loop0 bs=2M count=1000 conv=notrunc
sync
losetup -d /dev/loop0 
sleep 15
WRITES_END=$(( $(cat /sys/block/$DISK_NAME/stat|awk '{print $7}')*$DISK_SECTOR_SIZE/1024/1024 ))
echo "Loop device write test:" $(( $WRITES_END - $WRITES_START ))
echo "====================================================="
# 4144 ~4.1Gb, +107% writes, more than double, why not 2.8Gb ?

#Same test with /$POOL/test/image2 mounted as ext4 filesystem (same as RAW file in VM):
mkfs.ext4 /$POOL/test/image2; mkdir /mnt/test; mount /$POOL/test/image2 /mnt/test
sync
sleep 10
WRITES_START=$(( $(cat /sys/block/$DISK_NAME/stat|awk '{print $7}')*$DISK_SECTOR_SIZE/1024/1024 ))
# start test:
dd if=/dev/random of=/mnt/test/junk.raw bs=2M count=1000 conv=notrunc
sync
umount /mnt/test; rmdir /mnt/test
sleep 15
WRITES_END=$(( $(cat /sys/block/$DISK_NAME/stat|awk '{print $7}')*$DISK_SECTOR_SIZE/1024/1024 ))
echo "Raw ext4 write test:" $(( $WRITES_END - $WRITES_START ))
echo "====================================================="
#4155 ~4.1Gb, same +107% writes


#Same test with Qcow2 image mounted as ext4 using new /$POOL/test/image.qcow2 mounted as ext4 filesystem (same as qcow2 file in VM):
qemu-img create -f qcow2 /$POOL/test/image.qcow2 4G -o preallocation=full
modprobe nbd
qemu-nbd --connect=/dev/nbd0 /$POOL/test/image.qcow2
mkfs.ext4 /dev/nbd0; mkdir /mnt/test; mount /dev/nbd0 /mnt/test
sync
sleep 20
WRITES_START=$(( $(cat /sys/block/$DISK_NAME/stat|awk '{print $7}')*$DISK_SECTOR_SIZE/1024/1024 ))
dd if=/dev/random of=/mnt/test/junk.raw bs=2M count=1000 conv=notrunc
sync
umount /mnt/test; rmdir /mnt/test
qemu-nbd -d /dev/nbd0
sleep 15
WRITES_END=$(( $(cat /sys/block/$DISK_NAME/stat|awk '{print $7}')*$DISK_SECTOR_SIZE/1024/1024 ))
echo "Qcow2 write test:" $(( $WRITES_END - $WRITES_START ))
echo "====================================================="
#2839 ~ 2.8Gb, +40%, why not 4.1Gb ?

GregorKopka · 2024-05-08T18:03:49Z

EXT4 defaults to 4KB block size, your backing file has 64KB recordsize.
This could lead to partial block rewrites of the backing file.

woodholly · 2024-05-08T18:09:55Z

Ext4 block size is unrelated, because there is a test without ext4 with same result (losetup + dd)

GregorKopka · 2024-05-30T09:55:19Z

Having looked at this a bit closer:
Both cases you see the write inflation are when the write path goes through a loop device (mounting a filesystem from a file automagically sets one up). Looking at what losetup tells me about the loop devices created for mounting filesystems from files:

# losetup -l /dev/loop0
NAME       SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILE DIO LOG-SEC
/dev/loop0                          0  0             0     512
# mount /tmp/testfile /mnt
# losetup -l /dev/loop0
NAME       SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILE     DIO LOG-SEC
/dev/loop0         0      0         1  0 /tmp/testfile   0     512

(this is from mount /tmp/testfile /mnt) and if I interprete this correctly it tells us that the loop device uses SYNC I/O (DIO=0) and a block size of 512 bytes.

That could lead up to big writes being chopped up into smaller I/O that (as of them being SYNC) can not be coalesced by ZFS, which would explain the write amplification as of R/M/D cycles on data and/or metadata rewrites for the file.

As of your test with the qcow2 image not showing that write amplification I suspect that nbd uses a different write path that does not suffer from what losetup is.

woodholly added the Type: Defect Incorrect behavior (e.g. crash, hang) label May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZFS direct write differs from same write through block device (block device write amplification) #16174

ZFS direct write differs from same write through block device (block device write amplification) #16174

woodholly commented May 7, 2024 •

edited

GregorKopka commented May 8, 2024

woodholly commented May 8, 2024 •

edited

GregorKopka commented May 30, 2024

ZFS direct write differs from same write through block device (block device write amplification) #16174

ZFS direct write differs from same write through block device (block device write amplification) #16174

Comments

woodholly commented May 7, 2024 • edited

GregorKopka commented May 8, 2024

woodholly commented May 8, 2024 • edited

GregorKopka commented May 30, 2024

woodholly commented May 7, 2024 •

edited

woodholly commented May 8, 2024 •

edited