Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add --with-lg-tcache-limit configuration option to allow for more than 4094 tcaches #2384

Open
wants to merge 2 commits into
base: dev
Choose a base branch
from

Conversation

veselink1
Copy link

@veselink1 veselink1 commented Feb 20, 2023

In the Couchbase Server Data Service (KV Engine), we use one tcache per arena, per thread (multipies).

We have one arena per bucket, and we currently support 30 buckets, but might eventually bump this up to 100 buckets per instance.

However, because we allocate a tcache per thread as well, the number of tcache which we use ends up being num-buckets x num-threads. Tcaches are allocated when a thread decides to use a bucket's arena, allocating tcaches lazily, so not all threads will need a tcache, but most will.

jemalloc can automatically allocate and make use of a tcache, but we've found that can result in incorrect accounting of memory stats per-arena, which is a deal-breaker for us, because we rely on some of these stats to calculate per-arena memory fragmentation.

We're currently testing how this change performs in our test environments, but plan on proceeding to ship a release with those changes and jemalloc configured with --with-lg-tcache-limit=15, to allow for up to 32K tcaches to be created. Note that we don't expect to actually reach that limit, but something within the range of 4-7K is what we'd want to be able to run correctly for some larger machine configurations.

This change is ABI-breaking, but does not change the jemalloc API. We link to jemalloc statically, so this is not an issue for us.

Make the file autoconf-generated. In the follow-up change, the
representation of the mallocx flags will be made configurable at
compile-time.
In the Couchbase Server Data Service (KV Engine), we use one tcache per
arena, per thread (multipies).

We have one arena per bucket, and we currently support 30 buckets, but
might eventually bump this up to 100 buckets per instance.

However, because we allocate a tcache per thread as well, the number of
tcache which we use ends up being num-buckets x num-threads. Tcaches are
allocated when a thread decides to use a bucket's arena, allocating
tcaches lazily, so not all threads will need a tcache, but most will.

jemalloc can automatically allocate and make use of a tcache, but we've
found that can result in incorrect accounting of memory stats per-arena,
which is a deal-breaker for us, because we rely on some of these stats
to calculate per-arena memory fragmentation.

The old limit on the number of tcaches was largely dependant on the
representation of the tcache ID in the 32-bit flags through which it was
specified to je_mallocx.

---

Flag bits:
a: arena
t: tcache
.: configurable, dependent on --with-lg-tcache-limit:
   The higher this value, the more of the configurable bits will be teated as
   tcache bits (the rest are arena bits).
0: unused
z: zero
n: alignment

Old representation:
  aaaaaaaa aaaatttt tttttttt 0znnnnnn
New representation:
  aaaaaaa. ........ ....tttt 0znnnnnn

Before this change, we had 12 bits for the arena index, 12 bits for the
tcache ID, giving us 4094 (tcache 0 and 1 reserved) tcaches and 4096
arenas.

By using more bits from the arena index representation, we can increase
the representation of the tcache ID to be wider/narrower, allowing up
to 2^17 valid tcache IDs to be specified (but 0 and 1 are reserved by
jemalloc). In this example, however, the number of areans falls to 128.
@interwq
Copy link
Member

interwq commented Feb 21, 2023

Thanks for sharing the patch @veselink1 . The changes look good to me and I get why you needed it. However I'm not sure we want to go this far, in terms of committing to the added options long term -- for example we have been talking about embedding the arena index into the radix tree (which has its own bits limitation as well) for fast "remote" arena detection. Would you be fine maintaining your own jemalloc branch for this feature? My feeling is this part isn't going to change too frequently, i.e. you won't get many merge conflicts, unless we do change the specifics around the tcache and arena bits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants