Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Several improvements to ARC shrinking #16197

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

amotin
Copy link
Member

@amotin amotin commented May 14, 2024

Motivation and Context

Since same time updating to Linux 6.6 kernel and increasing maximum ARC size in TrueNAS SCALE 24.04, we've started to receive multiple complains from people on excessive swapping, making systems unresponsive. While I attribute significant part of the problem to the new Multi-Gen LRU code enabled in 6.6 kernel (disabling it helps), I ended up with this set of smaller tunings to ZFS side, trying to make it a bit nicer in this terrible environment.

Description

  • When receiving memory pressure signal from OS be more strict trying to free some memory. Otherwise kernel may come again and request much more. Return as result how much arc_c was actually reduced due to this request, that may be less than requested.
  • Add new module parameter zfs_arc_shrinker_seeks to balance ARC eviction cost, relative to page cache and other subsystems.
  • Slightly update Linux arc_set_sys_free() math.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@amotin amotin requested review from ahrens and behlendorf May 14, 2024 15:56
@amotin amotin added the Status: Code Review Needed Ready for review and testing label May 14, 2024
@amotin amotin requested a review from grwilson May 14, 2024 15:57
Copy link
Contributor

@adamdmoss adamdmoss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Half of these changes are definitely 👍 (I've been running with similar local changes to track and return how much was actually evicted), the rest I feel neutral or suspicious about as commented.
FWIW have you tried zfs_arc_shrinker_limit=0 rather than the more-complicated approach of estimating eviction cost etc? limit=0 allegedly used to cause arc collapse, but I've not been able to trigger than for a long time, at least in combination with eviction code that accounts for how much was actually evicted.

restore evicted page.
Bigger values makes ARC more precious and evictions smaller, comparing to
other kernel subsystems.
Value of 4 means parity with page cache.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO this shouldn't be yet-another-tuneable, but not a big deal

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering this affects amount of eviction by order of magnitude, I can hardly believe why it should be hard-coded. The other question is that thanks to zfs_arc_shrinker_limit we really for the most part ignore what kernel wants from us. Partially it is because kernel at times is completely insane in its requests, and new mentioned Multi-Gen LRU brings it to extreme, sometimes requesting eviction quarter to half of ARC for no reason. I've already sent complaint email to the author, hope it lead to something.

@@ -325,13 +335,17 @@ arc_set_sys_free(uint64_t allmem)
/*
* Base wmark_low is 4 * the square root of Kbytes of RAM.
*/
long wmark = 4 * int_sqrt(allmem/1024) * 1024;
long wmark = int_sqrt(allmem / 1024 * 16) * 1024;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the goal of this change? It seems to be reducing precision for no good reason.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is actually improving precision, even though not by much. And primarily it is what newer kernels actually do.

*/
wmark = MAX(wmark, 128 * 1024);
#if LINUX_VERSION_CODE >= KERNEL_VERSION(5, 7, 0)
wmark = MIN(wmark, 256 * 1024 * 1024);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kinda dislike having kernel-specific magic limits here - at least without a great explanation in the code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this chunk is a kernel-specific magic. I am just updating it for newer kernels.

if (!arc_evict_needed) {
arc_evict_needed = B_TRUE;
zthr_wakeup(arc_evict_zthr);
if (lax) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that 'lax' is not self-explanatory - I can see what it's doing, but not why some callers want it and some don't?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When OS signals memory pressure, then something is not good and we should be more careful about overflows handling and should complete evictions before return. If the same happen as part of routing I/O, there is not point to be strict.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added more comments here.

 - When receiving memory pressure signal from OS be more strict
trying to free some memory.  Otherwise kernel may come again and
request much more.  Return as result how much arc_c was actually
reduced due to this request, that may be less than requested.
 - Add new module parameter zfs_arc_shrinker_seeks to balance ARC
eviction cost, relative to page cache and other subsystems.
 - Slightly update Linux arc_set_sys_free() math.

Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
@snajpa
Copy link
Contributor

snajpa commented May 16, 2024

FWIW I think there's yet another possible source for excessive swapping in addition to your observations - it might be caused by too high zfs_abd_scatter_max_order. In our setup, it only takes a few days until excessive reclaim kicks in, then we have to add a zram-based swap device. When we lower zfs_abd_scatter_max_order to below 3, the excessive reclaim doesn't of course disappear fully, as there are other sources of pressure in the kernel to get higher order buddies, but it is very noticeable (load drops by 100 on a machine with 600 1st level containers and tons more nested).

In our situation, since we run with txg_timeout = 15 and pretty high dirty_data_max so that we really mostly sync on the 15s mark, it's those syncs that trigger a lot of paging out to swap. Using zram has so far mitigated it as we tend to have at least 100G+ free memory, but it's easily available only in 4k chunks...

@amotin
Copy link
Member Author

amotin commented May 16, 2024

@snajpa Yes, I was also thinking about zfs_abd_scatter_max_order. I don't have own numbers, but my thinking was that on FreeBSD, where ARC allocates only individual PAGE_SIZE pages, it takes from OS the least convenient memory, while on Linux ARC always allocates the best contiguous chunks it can, that leaves other subsystems that are more sensitive to fragmentation to suffer. Contiguous chunks should be good for I/O efficiency, and on FreeBSD I do measure some per-page overheads, but there must be some sweet spot.

@snajpa
Copy link
Contributor

snajpa commented May 16, 2024

@amotin I haven't looked at the code yet, but if it doesn't do it already, it might be worth allocating the memory with flags so it doesn't trigger any reclaim at all and then decrement the requested order on fail

we could also optimize further by saving the last successful order :) and only sometimes (whatever that means for now) go for a higher order

@amotin
Copy link
Member Author

amotin commented May 16, 2024

it might be worth allocating the memory with flags so it doesn't trigger any reclaim at all and then decrement the requested order on fail

That is what ZFS does. It tries to allocate big first, but if fails, requests smaller and smaller until get enough. But that way it consumes all remaining big chunks first.

@snajpa
Copy link
Contributor

snajpa commented May 16, 2024

It actually seems to directly call kvmalloc() when HAVE_KVMALLOC. In the 6.8 source I'm looking at, kvmalloc seems to do __GFP_NORETRY, for which the documentation says it does one round of reclaim in this implementation. I'm tempted to change that line to kmalloc_flags &= ~__GFP_DIRECT_RECLAIM; to see what happens :D not sure what to do (if anything) on ZFS level with this information though.

@amotin
Copy link
Member Author

amotin commented May 17, 2024

@snajpa Most of ARC capacity is allocated by abd_alloc_chunks() via alloc_pages_node().

@snajpa
Copy link
Contributor

snajpa commented May 17, 2024

I've tried bpftraceing spl_kvmalloc calls and it seems at least dsl_dir_tempreserve_space and dmu_buf_hold_array_by_dnode are calling spl_kvmalloc (which ends up with one round of reclaim). Running this on a comparatively pretty much idle staging node, yet it's IMHO way too many calls in too little time...

[root@node1.stg.vpsfree.cz]
 ~ # timeout --foreground 10 bpftrace -e 'kprobe:spl_kvmalloc{ printf("%s: %s(%d)\n", probe, comm, pid); }' | wc -l
231947

Interestingly, it seems to be called to get always pretty similar amounts of memory - ranging from 273408 to 273856 bytes (?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants