Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add bin batching #2608

Open
wants to merge 6 commits into
base: dev
Choose a base branch
from
Open

Conversation

davidtgoldblatt
Copy link
Member

One of the costliest operations we do is tcache flushing to arena bins in many-to-many producer/consumer workloads. When this happens, we can acquire lots of bin mutexes just to flush one or two items. This can cause lock convoying, wasted time spinning etc.

After this stack of commits, if we aren't able to immediately acquire the bin lock (and we don't have very many items to flush, along with some other bookkeeping conditions), we instead just move the items to flush into a small per-bin queue (managed by a new type, the batcher_t) which can (so long as there is space) be pushed into in a lock-free manner.

In various large-scale production workloads, this reduces total CPU consumption by something like 0.2% of cycles when measured in a simple obvious way. The true win is likely much larger; these mutexes also show up as among the most frequent causes of off futex waiting, so we get locality and latency wins in callers as well.

@davidtgoldblatt
Copy link
Member Author

I'm not 100% sure I'm reading them right, but I think that the static analysis results are incorrect -- the error pathway in both seems to involve a condition in which binind >= bin_info_nbatched_bins in one part of the control path (so that the arena_dalloc_bin_locked_info_t doesn't get initialized) followed by binind < bin_info_nbatched_bins (causing the stats to get accessed); but of course these can't both be true in one call stack.

@davidtgoldblatt davidtgoldblatt force-pushed the bin_batcher branch 4 times, most recently from c35a43c to a5fe8cd Compare February 22, 2024 21:59
@davidtgoldblatt
Copy link
Member Author

(This is ready for review btw)

@nadavrot
Copy link

nadavrot commented Mar 5, 2024

Very cool. The numbers look fantastic.

Copy link
Contributor

@guangli-dai guangli-dai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really a great idea to reduce contention by adding a lock-free per-bin batch. Regarding the new stats added, adding a push_attempts could be helpful for users to know if the batches_per_bin should be larger?

JEMALLOC_ALWAYS_INLINE void
arena_bin_flush_batch_after_unlock(tsdn_t *tsdn, arena_t *arena, bin_t *bin,
unsigned binind, arena_bin_flush_batch_state_t *state) {
if (binind >= bin_info_nbatched_sizes) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the static analysis is not correct here, it looks safe to me. nit: can we use arena_bin_has_batch in places like this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

src/batcher.c Outdated
* make it into shared data structures). Since this can happen
* regardless, we can just discard the data in any slots that were
* mid-push (because the caller can't tell the difference between a push
* call that didn't quite start and one that started but didn't finish).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this seems inevitable, will the data loss here lead to memory leaks?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but that bug is already present. The thing to note is that using the batcher can't introduce any new bugs. I beefed up the comment here; PTAL.

Copy link
Member

@interwq interwq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still reading but have some early questions

include/jemalloc/internal/batcher.h Outdated Show resolved Hide resolved
include/jemalloc/internal/batcher.h Outdated Show resolved Hide resolved
include/jemalloc/internal/batcher.h Outdated Show resolved Hide resolved
src/batcher.c Outdated Show resolved Hide resolved
src/batcher.c Outdated Show resolved Hide resolved
src/batcher.c Outdated Show resolved Hide resolved
src/tcache.c Show resolved Hide resolved
@interwq
Copy link
Member

interwq commented Apr 12, 2024

My biggest concern right now, is about the complexity around the atomic based / lockfree synchronization, and the implication on reproducibility and debugability. To justify it's worth the cost, we should show that:

  1. it brings significant performance improvement,
    and 2) no easy lock-based solutions with close enough efficiency

For (1) the usual workload benchmarking should suffice. For (2), here's what I'm asking: can you try adding another mutex per batcher (to your current approach), that covers the necessary operations like push and pop around? Then benchmark that approach to see how much additional overhead there is -- if it's a lot slower then it shows the benefits of the atomics; if it's not that much overhead then it implies most synchronization cost here come from the hardware cache coherency. I hope it's not too much pain to try the mutex there. If you have simpler idea please feel free to propose as well.

@davidtgoldblatt
Copy link
Member Author

Ran some canaries + local tests -- the mutex approach takes a modest but not huge hit (something like 0.05%, although it gets hard to measure stuff that small). (Also, ran some canaries for the cases @guangli-dai is looking at where ncached_max gets cranked up for hot threads -- still results in a ~0.15% e2e cycle reduction on top).

But, a mutex would also let us unify the batches and share space (giving a lot more flexibility in terms of items flushed if we like) so I'm inclined to try out a "peeking" approach, where we still mostly do mutex synchronization but make an educated guess if the push attempt will succeed first. Will try it out in the next day or two.

David Goldblatt added 6 commits May 1, 2024 17:02
This can be used to batch up simple operation commands for later use by another
thread.
In the next commit, we'll start using the batcher to eliminate mutex traffic.
To avoid cluttering up that commit with the random bits of busy-work it entails,
we'll centralize them here.  This commit introduces:
- A batched bin type.
- The ability to mix batched and unbatched bins in the arena.
- Conf parsing to set batches per size and a max batched size.
- mallctl access to the corresponding opt-namespace keys.
- Stats output of the above.
The main bits of shared code are the edata filtering and the stats flushing
logic, both of which are fairly simple to read and not so painful to duplicate.
The shared code comes at the cost of guarding all the subtle logic with
`if (small)`, which doesn't feel worth it.
This accomplishes two things:
- It avoids a full array scan (and any attendant branch prediction misses, etc.)
  while holding the bin lock.
- It allows us to know the number of items that will be flushed before flushing
  them, which will (in an upcoming commit) let us know if it's safe to use the
  batched flush (in which case we won't acquire a lock at all).
This adds a fast-path for threads freeing a small number of allocations to
bins which are not their "home-base" and which encounter lock contention in
attempting to do so. In producer-consumer workflows, such small lock hold times
can cause lock convoying that greatly increases overall bin mutex contention.
This lets us easily see what fraction of flush load is being taken up by the
bins, and helps guide future optimization approaches (for example: should we
prefetch during cache bin fills? It depends on how many objects the average fill
pops out of the batch).
@davidtgoldblatt
Copy link
Member Author

Using a mutex + locks for the bin batching. Somewhat slower (2 CASs / op instead of one, pushers block one another during a memcpy), but a big enough advantage over "wait for a pairing heap operation" that it hopefully doesn't end up mattering. It also buys some flexibility in terms of push limits, so I added a tunable there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants