Add bin batching #2608

davidtgoldblatt · 2024-02-20T23:06:50Z

One of the costliest operations we do is tcache flushing to arena bins in many-to-many producer/consumer workloads. When this happens, we can acquire lots of bin mutexes just to flush one or two items. This can cause lock convoying, wasted time spinning etc.

After this stack of commits, if we aren't able to immediately acquire the bin lock (and we don't have very many items to flush, along with some other bookkeeping conditions), we instead just move the items to flush into a small per-bin queue (managed by a new type, the batcher_t) which can (so long as there is space) be pushed into in a lock-free manner.

In various large-scale production workloads, this reduces total CPU consumption by something like 0.2% of cycles when measured in a simple obvious way. The true win is likely much larger; these mutexes also show up as among the most frequent causes of off futex waiting, so we get locality and latency wins in callers as well.

davidtgoldblatt · 2024-02-21T00:05:13Z

I'm not 100% sure I'm reading them right, but I think that the static analysis results are incorrect -- the error pathway in both seems to involve a condition in which binind >= bin_info_nbatched_bins in one part of the control path (so that the arena_dalloc_bin_locked_info_t doesn't get initialized) followed by binind < bin_info_nbatched_bins (causing the stats to get accessed); but of course these can't both be true in one call stack.

davidtgoldblatt · 2024-02-23T18:42:05Z

(This is ready for review btw)

nadavrot · 2024-03-05T21:20:26Z

Very cool. The numbers look fantastic.

guangli-dai

Really a great idea to reduce contention by adding a lock-free per-bin batch. Regarding the new stats added, adding a push_attempts could be helpful for users to know if the batches_per_bin should be larger?

guangli-dai · 2024-03-11T19:09:42Z

include/jemalloc/internal/arena_inlines_b.h

+JEMALLOC_ALWAYS_INLINE void
+arena_bin_flush_batch_after_unlock(tsdn_t *tsdn, arena_t *arena, bin_t *bin,
+    unsigned binind, arena_bin_flush_batch_state_t *state) {
+	if (binind >= bin_info_nbatched_sizes) {


I think the static analysis is not correct here, it looks safe to me. nit: can we use arena_bin_has_batch in places like this?

guangli-dai · 2024-03-11T20:37:12Z

src/batcher.c

+     *   make it into shared data structures).  Since this can happen
+     *   regardless, we can just discard the data in any slots that were
+     *   mid-push (because the caller can't tell the difference between a push
+     *   call that didn't quite start and one that started but didn't finish).


While this seems inevitable, will the data loss here lead to memory leaks?

Yes, but that bug is already present. The thing to note is that using the batcher can't introduce any new bugs. I beefed up the comment here; PTAL.

interwq

Still reading but have some early questions

include/jemalloc/internal/batcher.h

src/batcher.c

src/tcache.c

interwq · 2024-04-12T23:39:12Z

My biggest concern right now, is about the complexity around the atomic based / lockfree synchronization, and the implication on reproducibility and debugability. To justify it's worth the cost, we should show that:

it brings significant performance improvement,
and 2) no easy lock-based solutions with close enough efficiency

For (1) the usual workload benchmarking should suffice. For (2), here's what I'm asking: can you try adding another mutex per batcher (to your current approach), that covers the necessary operations like push and pop around? Then benchmark that approach to see how much additional overhead there is -- if it's a lot slower then it shows the benefits of the atomics; if it's not that much overhead then it implies most synchronization cost here come from the hardware cache coherency. I hope it's not too much pain to try the mutex there. If you have simpler idea please feel free to propose as well.

davidtgoldblatt · 2024-04-17T00:17:40Z

Ran some canaries + local tests -- the mutex approach takes a modest but not huge hit (something like 0.05%, although it gets hard to measure stuff that small). (Also, ran some canaries for the cases @guangli-dai is looking at where ncached_max gets cranked up for hot threads -- still results in a ~0.15% e2e cycle reduction on top).

But, a mutex would also let us unify the batches and share space (giving a lot more flexibility in terms of items flushed if we like) so I'm inclined to try out a "peeking" approach, where we still mostly do mutex synchronization but make an educated guess if the push attempt will succeed first. Will try it out in the next day or two.

This can be used to batch up simple operation commands for later use by another thread.

In the next commit, we'll start using the batcher to eliminate mutex traffic. To avoid cluttering up that commit with the random bits of busy-work it entails, we'll centralize them here. This commit introduces: - A batched bin type. - The ability to mix batched and unbatched bins in the arena. - Conf parsing to set batches per size and a max batched size. - mallctl access to the corresponding opt-namespace keys. - Stats output of the above.

The main bits of shared code are the edata filtering and the stats flushing logic, both of which are fairly simple to read and not so painful to duplicate. The shared code comes at the cost of guarding all the subtle logic with `if (small)`, which doesn't feel worth it.

This accomplishes two things: - It avoids a full array scan (and any attendant branch prediction misses, etc.) while holding the bin lock. - It allows us to know the number of items that will be flushed before flushing them, which will (in an upcoming commit) let us know if it's safe to use the batched flush (in which case we won't acquire a lock at all).

This adds a fast-path for threads freeing a small number of allocations to bins which are not their "home-base" and which encounter lock contention in attempting to do so. In producer-consumer workflows, such small lock hold times can cause lock convoying that greatly increases overall bin mutex contention.

This lets us easily see what fraction of flush load is being taken up by the bins, and helps guide future optimization approaches (for example: should we prefetch during cache bin fills? It depends on how many objects the average fill pops out of the batch).

davidtgoldblatt · 2024-05-06T21:27:15Z

Using a mutex + locks for the bin batching. Somewhat slower (2 CASs / op instead of one, pushers block one another during a memcpy), but a big enough advantage over "wait for a pairing heap operation" that it hopefully doesn't end up mattering. It also buys some flexibility in terms of push limits, so I added a tunable there.

davidtgoldblatt force-pushed the bin_batcher branch 4 times, most recently from c35a43c to a5fe8cd Compare February 22, 2024 21:59

guangli-dai reviewed Mar 11, 2024

View reviewed changes

davidtgoldblatt force-pushed the bin_batcher branch from a5fe8cd to ac0c08d Compare March 11, 2024 23:50

interwq reviewed Apr 11, 2024

View reviewed changes

src/tcache.c Show resolved Hide resolved

davidtgoldblatt force-pushed the bin_batcher branch from ac0c08d to 0674025 Compare April 12, 2024 21:29

David Goldblatt added 6 commits May 1, 2024 17:02

Add batcher module.

e0e5413

This can be used to batch up simple operation commands for later use by another thread.

davidtgoldblatt force-pushed the bin_batcher branch from 0674025 to 4ad55bb Compare May 6, 2024 21:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add bin batching #2608

Add bin batching #2608

davidtgoldblatt commented Feb 20, 2024

davidtgoldblatt commented Feb 21, 2024

davidtgoldblatt commented Feb 23, 2024

nadavrot commented Mar 5, 2024

guangli-dai left a comment

guangli-dai Mar 11, 2024

davidtgoldblatt Mar 11, 2024

guangli-dai Mar 11, 2024

davidtgoldblatt Mar 11, 2024

interwq left a comment

interwq commented Apr 12, 2024

davidtgoldblatt commented Apr 17, 2024

davidtgoldblatt commented May 6, 2024

Add bin batching #2608

Are you sure you want to change the base?

Add bin batching #2608

Conversation

davidtgoldblatt commented Feb 20, 2024

davidtgoldblatt commented Feb 21, 2024

davidtgoldblatt commented Feb 23, 2024

nadavrot commented Mar 5, 2024

guangli-dai left a comment

Choose a reason for hiding this comment

guangli-dai Mar 11, 2024

Choose a reason for hiding this comment

davidtgoldblatt Mar 11, 2024

Choose a reason for hiding this comment

guangli-dai Mar 11, 2024

Choose a reason for hiding this comment

davidtgoldblatt Mar 11, 2024

Choose a reason for hiding this comment

interwq left a comment

Choose a reason for hiding this comment

interwq commented Apr 12, 2024

davidtgoldblatt commented Apr 17, 2024

davidtgoldblatt commented May 6, 2024