Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hierarchical bandwidth and operations rate limits. #16205

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

pjd
Copy link
Contributor

@pjd pjd commented May 17, 2024

Introduce six new properties: limit_{bw,op}_{read,write,total}.

The limit_bw_* properties limit the read, write, or combined bandwidth, respectively, that a dataset and its descendants can consume. Limits are applied to both file systems and ZFS volumes.

The configured limits are hierarchical, just like quotas; i.e., even if a higher limit is configured on the child dataset, the parent's lower limit will be enforced.

The limits are applied at the VFS level, not at the disk level. The dataset is charged for each operation even if no disk access is required (e.g., due to caching, compression, deduplication, or NOP writes) or if the operation will cause more traffic (due to the copies property, mirroring, or RAIDZ).

Read bandwidth consumption is based on:

  • read-like syscalls, eg., aio_read(2), pread(2), preadv(2), read(2), readv(2), sendfile(2)

  • syscalls like getdents(2) and getdirentries(2)

  • reading via mmaped files

  • zfs send

Write bandwidth consumption is based on:

  • write-like syscalls, eg., aio_write(2), pwrite(2), pwritev(2), write(2), writev(2)

  • writing via mmaped files

  • zfs receive

The limit_op_* properties limit the read, write, or both metadata operations, respectively, that dataset and its descendants can generate.

Read operations consumption is based on:

  • read-like syscalls where the number of operations is equal to the number of blocks being read (never less than 1)

  • reading via mmaped files, where the number of operations is equal to the number of pages being read (never less than 1)

  • syscalls accessing metadata: readlink(2), stat(2)

Write operations consumption is based on:

  • write-like syscalls where the number of operations is equal to the number of blocks being written (never less than 1)

  • writing via mmaped files, where the number of operations is equal to the number of pages being written (never less than 1)

  • syscalls modifing a directory's content: bind(2) (UNIX-domain sockets), link(2), mkdir(2), mkfifo(2), mknod(2), open(2) (file creation), rename(2), rmdir(2), symlink(2), unlink(2)

  • syscalls modifing metadata: chflags(2), chmod(2), chown(2), utimes(2)

  • updating the access time of a file when reading it

Just like limit_bw_* limits, the limit_op_* limits are also hierarchical and applied at the VFS level.

Motivation and Context

Description

How Has This Been Tested?

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@tonyhutter
Copy link
Contributor

Some first-pass questions/comments:

Maybe I missed it, but do you specify the units anywhere? Are the limit_bw_* props in units of bytes/second and limit_op_* in ops/second? Can you mention the units in the man pages?

If you specify 100MB/s and they system is idle with 8GB/s bandwidth available, will it still only use 100MB/s?

What happens if someone specifies a *_total value that is larger/smaller than *_read or *_write combined? Does it just get capped to the minimum?

Does anything bad happen if you cap it to something super low, like 1 byte/sec?

The zfsprops.7 man page is roughly in alphabetical order. Could you move the limits_* section to just after the keylocation sections?

Is copy_file_range() counted?

@pjd
Copy link
Contributor Author

pjd commented May 20, 2024

Maybe I missed it, but do you specify the units anywhere? Are the limit_bw_* props in units of bytes/second and limit_op_* in ops/second? Can you mention the units in the man pages?

Sure, I'll add that.

If you specify 100MB/s and they system is idle with 8GB/s bandwidth available, will it still only use 100MB/s?

Correct.

What happens if someone specifies a *_total value that is larger/smaller than *_read or *_write combined? Does it just get capped to the minimum?

Correct. Always the lowest limit will be enforced. The same if the parent has a lower limit then the child or children combined.

Does anything bad happen if you cap it to something super low, like 1 byte/sec?

It cannot be lower than the resolution (which is 16 per second), so it will be rounded up to 16, but it will also allocate large number of slots to keep the history, so here one slot per byte of each pending request.

The zfsprops.7 man page is roughly in alphabetical order. Could you move the limits_* section to just after the keylocation sections?

Sure.

Is copy_file_range() counted?

So it was counted when it was doing fall back to read/write, but it wasn't counted in case of block cloning, which I think it should, so I just added it.

@tonyhutter
Copy link
Contributor

I did some hand testing of this and it works just as described 👍

$ ./zpool create -f tank -O compression=off /dev/nvme{0..8}n1 && dd if=/dev/zero of=/tank/bigfile conv=sync bs=1M count=10000 && ./zpool destroy tank 
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB, 9.8 GiB) copied, 3.91677 s, 2.7 GB/s


$ ./zpool create -f tank -O compression=off /dev/nvme{0..8}n1 && ./zfs set limit_bw_write=$((1024 * 1024 * 200)) tank && dd if=/dev/zero of=/tank/bigfile conv=sync bs=1M count=10000 && ./zpool destroy tank
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB, 9.8 GiB) copied, 50.9869 s, 206 MB/s

I also verified that it worked for multithreaded writes, and verified that top-level dataset's values correctly overrode their children's values.

@pjd
Copy link
Contributor Author

pjd commented May 24, 2024

I did some hand testing of this and it works just as described 👍

$ ./zpool create -f tank -O compression=off /dev/nvme{0..8}n1 && dd if=/dev/zero of=/tank/bigfile conv=sync bs=1M count=10000 && ./zpool destroy tank 
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB, 9.8 GiB) copied, 3.91677 s, 2.7 GB/s


$ ./zpool create -f tank -O compression=off /dev/nvme{0..8}n1 && ./zfs set limit_bw_write=$((1024 * 1024 * 200)) tank && dd if=/dev/zero of=/tank/bigfile conv=sync bs=1M count=10000 && ./zpool destroy tank
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB, 9.8 GiB) copied, 50.9869 s, 206 MB/s

I also verified that it worked for multithreaded writes, and verified that top-level dataset's values correctly overrode their children's values.

Thank you!

BTW. You can use suffixes got limit_bw_* properties, eg. limit_bw_write=200M

Comment on lines -11 to -13
# This run file contains all of the common functional tests. When
# adding a new test consider also adding it to the sanity.run file
# if the new test runs to completion in only a few seconds.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you want this change included in this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really part of this PR, but I was hoping to smuggle it in instead of creating a separate PR for it:)

@pjd pjd force-pushed the ratelimits branch 2 times, most recently from ce9c37a to 61f0f95 Compare May 24, 2024 22:29
Introduce six new properties: limit_{bw,op}_{read,write,total}.

The limit_bw_* properties limit the read, write, or combined bandwidth,
respectively, that a dataset and its descendants can consume.
Limits are applied to both file systems and ZFS volumes.

The configured limits are hierarchical, just like quotas; i.e., even if
a higher limit is configured on the child dataset, the parent's lower
limit will be enforced.

The limits are applied at the VFS level, not at the disk level.
The dataset is charged for each operation even if no disk access is
required (e.g., due to caching, compression, deduplication,
or NOP writes) or if the operation will cause more traffic (due to
the copies property, mirroring, or RAIDZ).

Read bandwidth consumption is based on:

- read-like syscalls, eg., aio_read(2), pread(2), preadv(2), read(2),
  readv(2), sendfile(2)

- syscalls like getdents(2) and getdirentries(2)

- reading via mmaped files

- zfs send

Write bandwidth consumption is based on:

- write-like syscalls, eg., aio_write(2), pwrite(2), pwritev(2),
  write(2), writev(2)

- writing via mmaped files

- zfs receive

The limit_op_* properties limit the read, write, or both metadata
operations, respectively, that dataset and its descendants can generate.

Read operations consumption is based on:

- read-like syscalls where the number of operations is equal to the
  number of blocks being read (never less than 1)

- reading via mmaped files, where the number of operations is equal
  to the number of pages being read (never less than 1)

- syscalls accessing metadata: readlink(2), stat(2)

Write operations consumption is based on:

- write-like syscalls where the number of operations is equal to the
  number of blocks being written (never less than 1)

- writing via mmaped files, where the number of operations is equal
  to the number of pages being written (never less than 1)

- syscalls modifing a directory's content: bind(2) (UNIX-domain
  sockets), link(2), mkdir(2), mkfifo(2), mknod(2), open(2) (file
  creation), rename(2), rmdir(2), symlink(2), unlink(2)

- syscalls modifing metadata: chflags(2), chmod(2), chown(2),
  utimes(2)

- updating the access time of a file when reading it

Just like limit_bw_* limits, the limit_op_* limits are also
hierarchical and applied at the VFS level.

Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants