Several improvements to ARC shrinking #16197

amotin · 2024-05-14T15:56:38Z

Motivation and Context

Since same time updating to Linux 6.6 kernel and increasing maximum ARC size in TrueNAS SCALE 24.04, we've started to receive multiple complains from people on excessive swapping, making systems unresponsive. While I attribute significant part of the problem to the new Multi-Gen LRU code enabled in 6.6 kernel (disabling it helps), I ended up with this set of smaller tunings to ZFS side, trying to make it a bit nicer in this terrible environment.

Description

When receiving memory pressure signal from OS be more strict trying to free some memory. Otherwise kernel may come again and request much more. Return as result how much arc_c was actually reduced due to this request, that may be less than requested.
Add new module parameter zfs_arc_shrinker_seeks to balance ARC eviction cost, relative to page cache and other subsystems.
Slightly update Linux arc_set_sys_free() math.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

adamdmoss

Half of these changes are definitely 👍 (I've been running with similar local changes to track and return how much was actually evicted), the rest I feel neutral or suspicious about as commented.
FWIW have you tried zfs_arc_shrinker_limit=0 rather than the more-complicated approach of estimating eviction cost etc? limit=0 allegedly used to cause arc collapse, but I've not been able to trigger than for a long time, at least in combination with eviction code that accounts for how much was actually evicted.

adamdmoss · 2024-05-14T16:41:48Z

man/man4/zfs.4

+restore evicted page.
+Bigger values makes ARC more precious and evictions smaller, comparing to
+other kernel subsystems.
+Value of 4 means parity with page cache.


IMHO this shouldn't be yet-another-tuneable, but not a big deal

Considering this affects amount of eviction by order of magnitude, I can hardly believe why it should be hard-coded. The other question is that thanks to zfs_arc_shrinker_limit we really for the most part ignore what kernel wants from us. Partially it is because kernel at times is completely insane in its requests, and new mentioned Multi-Gen LRU brings it to extreme, sometimes requesting eviction quarter to half of ARC for no reason. I've already sent complaint email to the author, hope it lead to something.

adamdmoss · 2024-05-14T16:45:41Z

module/os/linux/zfs/arc_os.c

@@ -325,13 +335,17 @@ arc_set_sys_free(uint64_t allmem)
 	/*
 	 * Base wmark_low is 4 * the square root of Kbytes of RAM.
 	 */
-	long wmark = 4 * int_sqrt(allmem/1024) * 1024;
+	long wmark = int_sqrt(allmem / 1024 * 16) * 1024;


What's the goal of this change? It seems to be reducing precision for no good reason.

It is actually improving precision, even though not by much. And primarily it is what newer kernels actually do.

adamdmoss · 2024-05-14T16:47:11Z

module/os/linux/zfs/arc_os.c

 	 */
 	wmark = MAX(wmark, 128 * 1024);
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(5, 7, 0)
+	wmark = MIN(wmark, 256 * 1024 * 1024);


I kinda dislike having kernel-specific magic limits here - at least without a great explanation in the code.

All this chunk is a kernel-specific magic. I am just updating it for newer kernels.

adamdmoss · 2024-05-14T16:49:02Z

module/zfs/arc.c

-		if (!arc_evict_needed) {
-			arc_evict_needed = B_TRUE;
-			zthr_wakeup(arc_evict_zthr);
+		if (lax) {


I feel that 'lax' is not self-explanatory - I can see what it's doing, but not why some callers want it and some don't?

When OS signals memory pressure, then something is not good and we should be more careful about overflows handling and should complete evictions before return. If the same happen as part of routing I/O, there is not point to be strict.

I've added more comments here.

- When receiving memory pressure signal from OS be more strict trying to free some memory. Otherwise kernel may come again and request much more. Return as result how much arc_c was actually reduced due to this request, that may be less than requested. - Add new module parameter zfs_arc_shrinker_seeks to balance ARC eviction cost, relative to page cache and other subsystems. - Slightly update Linux arc_set_sys_free() math. Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc.

snajpa · 2024-05-16T11:24:45Z

FWIW I think there's yet another possible source for excessive swapping in addition to your observations - it might be caused by too high zfs_abd_scatter_max_order. In our setup, it only takes a few days until excessive reclaim kicks in, then we have to add a zram-based swap device. When we lower zfs_abd_scatter_max_order to below 3, the excessive reclaim doesn't of course disappear fully, as there are other sources of pressure in the kernel to get higher order buddies, but it is very noticeable (load drops by 100 on a machine with 600 1st level containers and tons more nested).

In our situation, since we run with txg_timeout = 15 and pretty high dirty_data_max so that we really mostly sync on the 15s mark, it's those syncs that trigger a lot of paging out to swap. Using zram has so far mitigated it as we tend to have at least 100G+ free memory, but it's easily available only in 4k chunks...

amotin · 2024-05-16T14:34:32Z

@snajpa Yes, I was also thinking about zfs_abd_scatter_max_order. I don't have own numbers, but my thinking was that on FreeBSD, where ARC allocates only individual PAGE_SIZE pages, it takes from OS the least convenient memory, while on Linux ARC always allocates the best contiguous chunks it can, that leaves other subsystems that are more sensitive to fragmentation to suffer. Contiguous chunks should be good for I/O efficiency, and on FreeBSD I do measure some per-page overheads, but there must be some sweet spot.

snajpa · 2024-05-16T20:07:43Z

@amotin I haven't looked at the code yet, but if it doesn't do it already, it might be worth allocating the memory with flags so it doesn't trigger any reclaim at all and then decrement the requested order on fail

we could also optimize further by saving the last successful order :) and only sometimes (whatever that means for now) go for a higher order

amotin · 2024-05-16T20:10:48Z

it might be worth allocating the memory with flags so it doesn't trigger any reclaim at all and then decrement the requested order on fail

That is what ZFS does. It tries to allocate big first, but if fails, requests smaller and smaller until get enough. But that way it consumes all remaining big chunks first.

snajpa · 2024-05-16T22:01:09Z

It actually seems to directly call kvmalloc() when HAVE_KVMALLOC. In the 6.8 source I'm looking at, kvmalloc seems to do __GFP_NORETRY, for which the documentation says it does one round of reclaim in this implementation. I'm tempted to change that line to kmalloc_flags &= ~__GFP_DIRECT_RECLAIM; to see what happens :D not sure what to do (if anything) on ZFS level with this information though.

amotin · 2024-05-17T15:23:26Z

@snajpa Most of ARC capacity is allocated by abd_alloc_chunks() via alloc_pages_node().

snajpa · 2024-05-17T17:44:07Z

I've tried bpftraceing spl_kvmalloc calls and it seems at least dsl_dir_tempreserve_space and dmu_buf_hold_array_by_dnode are calling spl_kvmalloc (which ends up with one round of reclaim). Running this on a comparatively pretty much idle staging node, yet it's IMHO way too many calls in too little time...

[root@node1.stg.vpsfree.cz]
 ~ # timeout --foreground 10 bpftrace -e 'kprobe:spl_kvmalloc{ printf("%s: %s(%d)\n", probe, comm, pid); }' | wc -l
231947

Interestingly, it seems to be called to get always pretty similar amounts of memory - ranging from 273408 to 273856 bytes (?)

amotin requested review from ahrens and behlendorf May 14, 2024 15:56

amotin added the Status: Code Review Needed Ready for review and testing label May 14, 2024

amotin requested a review from grwilson May 14, 2024 15:57

adamdmoss reviewed May 14, 2024

View reviewed changes

amotin force-pushed the shrinker branch from e2b786d to 798a8f6 Compare May 14, 2024 18:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Several improvements to ARC shrinking #16197

Several improvements to ARC shrinking #16197

amotin commented May 14, 2024

adamdmoss left a comment

adamdmoss May 14, 2024

amotin May 14, 2024

adamdmoss May 14, 2024

amotin May 14, 2024

adamdmoss May 14, 2024

amotin May 14, 2024

adamdmoss May 14, 2024

amotin May 14, 2024

amotin May 14, 2024

snajpa commented May 16, 2024 •

edited

amotin commented May 16, 2024

snajpa commented May 16, 2024 •

edited

amotin commented May 16, 2024

snajpa commented May 16, 2024

amotin commented May 17, 2024

snajpa commented May 17, 2024 •

edited

Several improvements to ARC shrinking #16197

Are you sure you want to change the base?

Several improvements to ARC shrinking #16197

Conversation

amotin commented May 14, 2024

Motivation and Context

Description

Types of changes

Checklist:

adamdmoss left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

snajpa commented May 16, 2024 • edited

amotin commented May 16, 2024

snajpa commented May 16, 2024 • edited

amotin commented May 16, 2024

snajpa commented May 16, 2024

amotin commented May 17, 2024

snajpa commented May 17, 2024 • edited

snajpa commented May 16, 2024 •

edited

snajpa commented May 16, 2024 •

edited

snajpa commented May 17, 2024 •

edited