Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use popcount + PDEP for uniform random slot selection on x86 for Haswell and later #26

Open
thestinger opened this issue Aug 30, 2018 · 7 comments

Comments

@thestinger
Copy link
Member

thestinger commented Aug 30, 2018

PDEP allows selecting the nth unset bit efficiently (a couple cycles) so it's a fantastic way of implementing this. There's no clear way to do it at all efficiently elsewhere, which is why the current portable implementation only randomizes the search start index and then uses the ffs intrinsic.

@pgera
Copy link

pgera commented Apr 1, 2021

I don't have much context about your overall problem, but you can do select reasonably efficiently without PDEP too. See folly for example (https://github.com/facebook/folly/blob/master/folly/experimental/Select64.h).

@thestinger
Copy link
Member Author

@thestinger
Copy link
Member Author

The non-random implementation is quite fast already and the only reason we don't manually unroll the loop is because the compiler generates code that's as good or better from this. I do have an implementation of unrolling it around somewhere.

The random implementation could be faster and ideally it would be a uniform random choice, not simply starting the search in a random location as we're currently doing. I don't want to pay any additional performance cost to make it better though.

@thestinger
Copy link
Member Author

Ideally, we would have an implementation that is faster and does uniform random choice everywhere, not only modern x86_64.

@jvoisin
Copy link
Contributor

jvoisin commented Jan 4, 2022

As explained on matrix by @thestinger :

it could potentially be improved another way: you can see where that's done in get_free_slot there's a bitmap consisting of 4 x u64 depending on the size class, 1-4 u64 may be used, so it has to iterate only over the ones that are used and for many size classes, the final u64 is only partially used so it has a function (get_mask) to mask out the bits it shouldn't look at each bit represents a slot in the slab most is 256 (for size 0 and 16) and then drops to 128 (so only 2 of the u64 used) for 32 byte allocations. Anyway ideally this would be made faster, the way it currently randomizes slot selection is picking a random index between [0..number_of_slots) and then it starts the search from there, wrapping back around to check the portion before that if needed so that's not a uniform random choice. If we used PDEP we could probably make it both faster and uniform or even faster and more uniform than otherwise but not fully

@PaulusParssinen
Copy link

PaulusParssinen commented Oct 18, 2022

Just noting that PDEP is terrible on AMD processors before Zen 3. It essentially is emulated in the microcode and gets slower (8 uops) for every bit set in the PDEP mask operand.
Performance of PDEP is terrible even on reg-to-reg and its even worse with memory operand.

More discussion/info about it here:

@jvoisin
Copy link
Contributor

jvoisin commented Jan 2, 2024

There are similar optimisations for Arm NEON, as used by isoalloc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants