Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing ratio-based purge results in unbounded memory overhead #2596

Open
BigBigos opened this issue Jan 18, 2024 · 1 comment
Open

Missing ratio-based purge results in unbounded memory overhead #2596

BigBigos opened this issue Jan 18, 2024 · 1 comment

Comments

@BigBigos
Copy link

Hello.

I would like to understand why was the ratio-based purge algorithm removed. Our application is currently using an ancient jemalloc 3.4.1 and upgrading to anything newer than 4.x is blocked due to memory usage skyrocketing sporadically causing OOM issues (either kernel or our internal "kill a task if it uses too much memory").

Our application exhibits spikes in memory usage - it might so happen that a thread allocates a lot of memory and then releases it a second later (more precisely, a different thread will free the memory for it). The ratio-based purge algorithm would make sure that the "cached" dirty memory is constrained to 1/8th (by default) of the active memory, so we would just need to overprovision the memory a little. However with time-based purge algorithm the dirty memory might grow higher than the ratio would allow, increasing the memory pressure. Since each thread operates on a separate arena, another thread might exhibit a similar spike in memory consumption while not being able to use the dirty pages left by the previous thread, greatly amplifying the total memory consumption by the process.

Have I missed something? Can we somehow tweak jemalloc to work with our constraints? Can ratio-based purge algorithm be somehow brought back?

@interwq
Copy link
Member

interwq commented Jan 18, 2024

The bounded caching is indeed one big advantage of the ratio based purging. However one other part of the issue here is still about the inactivity at the allocator / arena / thread level, which we do have a well-tested solution -- can you try if adding background_thread:true env var MALLOC_CONF will resolve the issue? Details in https://github.com/jemalloc/jemalloc/blob/dev/TUNING.md.

As for switching to time-based purging, in the vast majority of workloads, we observed significant overall speed benefit (and in many cases, also memory benefit). See Jason's talk on the topic: http://applicative.acm.org/2015/applicative.acm.org/speaker-JasonEvans.html

We have debated a couple of times if we want to incorporate a ratio into the time-based purging. However for the last few similar cases, enabling background threads solved the problem. To be clear, the inactivity issue affects both ratio and time based purging, however the time-based purging is affected much more. Bg threads make sure progress on the time-based decay, so even though there's not a guarantee or bound, the issue usually won't build up to the worst case where all threads' caching peak simultaneously.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
@interwq @BigBigos and others