-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Several improvements to ARC shrinking #16197
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Half of these changes are definitely 👍 (I've been running with similar local changes to track and return how much was actually evicted), the rest I feel neutral or suspicious about as commented.
FWIW have you tried zfs_arc_shrinker_limit=0 rather than the more-complicated approach of estimating eviction cost etc? limit=0 allegedly used to cause arc collapse, but I've not been able to trigger than for a long time, at least in combination with eviction code that accounts for how much was actually evicted.
restore evicted page. | ||
Bigger values makes ARC more precious and evictions smaller, comparing to | ||
other kernel subsystems. | ||
Value of 4 means parity with page cache. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO this shouldn't be yet-another-tuneable, but not a big deal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering this affects amount of eviction by order of magnitude, I can hardly believe why it should be hard-coded. The other question is that thanks to zfs_arc_shrinker_limit we really for the most part ignore what kernel wants from us. Partially it is because kernel at times is completely insane in its requests, and new mentioned Multi-Gen LRU brings it to extreme, sometimes requesting eviction quarter to half of ARC for no reason. I've already sent complaint email to the author, hope it lead to something.
@@ -325,13 +335,17 @@ arc_set_sys_free(uint64_t allmem) | |||
/* | |||
* Base wmark_low is 4 * the square root of Kbytes of RAM. | |||
*/ | |||
long wmark = 4 * int_sqrt(allmem/1024) * 1024; | |||
long wmark = int_sqrt(allmem / 1024 * 16) * 1024; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the goal of this change? It seems to be reducing precision for no good reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is actually improving precision, even though not by much. And primarily it is what newer kernels actually do.
*/ | ||
wmark = MAX(wmark, 128 * 1024); | ||
#if LINUX_VERSION_CODE >= KERNEL_VERSION(5, 7, 0) | ||
wmark = MIN(wmark, 256 * 1024 * 1024); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kinda dislike having kernel-specific magic limits here - at least without a great explanation in the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All this chunk is a kernel-specific magic. I am just updating it for newer kernels.
if (!arc_evict_needed) { | ||
arc_evict_needed = B_TRUE; | ||
zthr_wakeup(arc_evict_zthr); | ||
if (lax) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel that 'lax' is not self-explanatory - I can see what it's doing, but not why some callers want it and some don't?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When OS signals memory pressure, then something is not good and we should be more careful about overflows handling and should complete evictions before return. If the same happen as part of routing I/O, there is not point to be strict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added more comments here.
- When receiving memory pressure signal from OS be more strict trying to free some memory. Otherwise kernel may come again and request much more. Return as result how much arc_c was actually reduced due to this request, that may be less than requested. - Add new module parameter zfs_arc_shrinker_seeks to balance ARC eviction cost, relative to page cache and other subsystems. - Slightly update Linux arc_set_sys_free() math. Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc.
FWIW I think there's yet another possible source for excessive swapping in addition to your observations - it might be caused by too high In our situation, since we run with |
@snajpa Yes, I was also thinking about |
@amotin I haven't looked at the code yet, but if it doesn't do it already, it might be worth allocating the memory with flags so it doesn't trigger any reclaim at all and then decrement the requested order on fail we could also optimize further by saving the last successful order :) and only sometimes (whatever that means for now) go for a higher order |
That is what ZFS does. It tries to allocate big first, but if fails, requests smaller and smaller until get enough. But that way it consumes all remaining big chunks first. |
It actually seems to directly call |
@snajpa Most of ARC capacity is allocated by abd_alloc_chunks() via alloc_pages_node(). |
I've tried
Interestingly, it seems to be called to get always pretty similar amounts of memory - ranging from 273408 to 273856 bytes (?) |
Motivation and Context
Since same time updating to Linux 6.6 kernel and increasing maximum ARC size in TrueNAS SCALE 24.04, we've started to receive multiple complains from people on excessive swapping, making systems unresponsive. While I attribute significant part of the problem to the new Multi-Gen LRU code enabled in 6.6 kernel (disabling it helps), I ended up with this set of smaller tunings to ZFS side, trying to make it a bit nicer in this terrible environment.
Description
Types of changes
Checklist:
Signed-off-by
.