New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add bin batching #2608
base: dev
Are you sure you want to change the base?
Add bin batching #2608
Conversation
I'm not 100% sure I'm reading them right, but I think that the static analysis results are incorrect -- the error pathway in both seems to involve a condition in which binind >= bin_info_nbatched_bins in one part of the control path (so that the |
c35a43c
to
a5fe8cd
Compare
(This is ready for review btw) |
Very cool. The numbers look fantastic. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really a great idea to reduce contention by adding a lock-free per-bin batch. Regarding the new stats added, adding a push_attempts
could be helpful for users to know if the batches_per_bin should be larger?
JEMALLOC_ALWAYS_INLINE void | ||
arena_bin_flush_batch_after_unlock(tsdn_t *tsdn, arena_t *arena, bin_t *bin, | ||
unsigned binind, arena_bin_flush_batch_state_t *state) { | ||
if (binind >= bin_info_nbatched_sizes) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the static analysis is not correct here, it looks safe to me. nit: can we use arena_bin_has_batch
in places like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/batcher.c
Outdated
* make it into shared data structures). Since this can happen | ||
* regardless, we can just discard the data in any slots that were | ||
* mid-push (because the caller can't tell the difference between a push | ||
* call that didn't quite start and one that started but didn't finish). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this seems inevitable, will the data loss here lead to memory leaks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but that bug is already present. The thing to note is that using the batcher can't introduce any new bugs. I beefed up the comment here; PTAL.
a5fe8cd
to
ac0c08d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still reading but have some early questions
ac0c08d
to
0674025
Compare
My biggest concern right now, is about the complexity around the atomic based / lockfree synchronization, and the implication on reproducibility and debugability. To justify it's worth the cost, we should show that:
For (1) the usual workload benchmarking should suffice. For (2), here's what I'm asking: can you try adding another mutex per batcher (to your current approach), that covers the necessary operations like push and pop around? Then benchmark that approach to see how much additional overhead there is -- if it's a lot slower then it shows the benefits of the atomics; if it's not that much overhead then it implies most synchronization cost here come from the hardware cache coherency. I hope it's not too much pain to try the mutex there. If you have simpler idea please feel free to propose as well. |
Ran some canaries + local tests -- the mutex approach takes a modest but not huge hit (something like 0.05%, although it gets hard to measure stuff that small). (Also, ran some canaries for the cases @guangli-dai is looking at where ncached_max gets cranked up for hot threads -- still results in a ~0.15% e2e cycle reduction on top). But, a mutex would also let us unify the batches and share space (giving a lot more flexibility in terms of items flushed if we like) so I'm inclined to try out a "peeking" approach, where we still mostly do mutex synchronization but make an educated guess if the push attempt will succeed first. Will try it out in the next day or two. |
This can be used to batch up simple operation commands for later use by another thread.
In the next commit, we'll start using the batcher to eliminate mutex traffic. To avoid cluttering up that commit with the random bits of busy-work it entails, we'll centralize them here. This commit introduces: - A batched bin type. - The ability to mix batched and unbatched bins in the arena. - Conf parsing to set batches per size and a max batched size. - mallctl access to the corresponding opt-namespace keys. - Stats output of the above.
The main bits of shared code are the edata filtering and the stats flushing logic, both of which are fairly simple to read and not so painful to duplicate. The shared code comes at the cost of guarding all the subtle logic with `if (small)`, which doesn't feel worth it.
This accomplishes two things: - It avoids a full array scan (and any attendant branch prediction misses, etc.) while holding the bin lock. - It allows us to know the number of items that will be flushed before flushing them, which will (in an upcoming commit) let us know if it's safe to use the batched flush (in which case we won't acquire a lock at all).
This adds a fast-path for threads freeing a small number of allocations to bins which are not their "home-base" and which encounter lock contention in attempting to do so. In producer-consumer workflows, such small lock hold times can cause lock convoying that greatly increases overall bin mutex contention.
This lets us easily see what fraction of flush load is being taken up by the bins, and helps guide future optimization approaches (for example: should we prefetch during cache bin fills? It depends on how many objects the average fill pops out of the batch).
0674025
to
4ad55bb
Compare
Using a mutex + locks for the bin batching. Somewhat slower (2 CASs / op instead of one, pushers block one another during a memcpy), but a big enough advantage over "wait for a pairing heap operation" that it hopefully doesn't end up mattering. It also buys some flexibility in terms of push limits, so I added a tunable there. |
One of the costliest operations we do is tcache flushing to arena bins in many-to-many producer/consumer workloads. When this happens, we can acquire lots of bin mutexes just to flush one or two items. This can cause lock convoying, wasted time spinning etc.
After this stack of commits, if we aren't able to immediately acquire the bin lock (and we don't have very many items to flush, along with some other bookkeeping conditions), we instead just move the items to flush into a small per-bin queue (managed by a new type, the
batcher_t
) which can (so long as there is space) be pushed into in a lock-free manner.In various large-scale production workloads, this reduces total CPU consumption by something like 0.2% of cycles when measured in a simple obvious way. The true win is likely much larger; these mutexes also show up as among the most frequent causes of off futex waiting, so we get locality and latency wins in callers as well.