Add --with-lg-tcache-limit configuration option to allow for more than 4094 tcaches #2384

veselink1 · 2023-02-20T16:25:36Z

In the Couchbase Server Data Service (KV Engine), we use one tcache per arena, per thread (multipies).

We have one arena per bucket, and we currently support 30 buckets, but might eventually bump this up to 100 buckets per instance.

However, because we allocate a tcache per thread as well, the number of tcache which we use ends up being num-buckets x num-threads. Tcaches are allocated when a thread decides to use a bucket's arena, allocating tcaches lazily, so not all threads will need a tcache, but most will.

jemalloc can automatically allocate and make use of a tcache, but we've found that can result in incorrect accounting of memory stats per-arena, which is a deal-breaker for us, because we rely on some of these stats to calculate per-arena memory fragmentation.

We're currently testing how this change performs in our test environments, but plan on proceeding to ship a release with those changes and jemalloc configured with --with-lg-tcache-limit=15, to allow for up to 32K tcaches to be created. Note that we don't expect to actually reach that limit, but something within the range of 4-7K is what we'd want to be able to run correctly for some larger machine configurations.

This change is ABI-breaking, but does not change the jemalloc API. We link to jemalloc statically, so this is not an issue for us.

Make the file autoconf-generated. In the follow-up change, the representation of the mallocx flags will be made configurable at compile-time.

In the Couchbase Server Data Service (KV Engine), we use one tcache per arena, per thread (multipies). We have one arena per bucket, and we currently support 30 buckets, but might eventually bump this up to 100 buckets per instance. However, because we allocate a tcache per thread as well, the number of tcache which we use ends up being num-buckets x num-threads. Tcaches are allocated when a thread decides to use a bucket's arena, allocating tcaches lazily, so not all threads will need a tcache, but most will. jemalloc can automatically allocate and make use of a tcache, but we've found that can result in incorrect accounting of memory stats per-arena, which is a deal-breaker for us, because we rely on some of these stats to calculate per-arena memory fragmentation. The old limit on the number of tcaches was largely dependant on the representation of the tcache ID in the 32-bit flags through which it was specified to je_mallocx. --- Flag bits: a: arena t: tcache .: configurable, dependent on --with-lg-tcache-limit: The higher this value, the more of the configurable bits will be teated as tcache bits (the rest are arena bits). 0: unused z: zero n: alignment Old representation: aaaaaaaa aaaatttt tttttttt 0znnnnnn New representation: aaaaaaa. ........ ....tttt 0znnnnnn Before this change, we had 12 bits for the arena index, 12 bits for the tcache ID, giving us 4094 (tcache 0 and 1 reserved) tcaches and 4096 arenas. By using more bits from the arena index representation, we can increase the representation of the tcache ID to be wider/narrower, allowing up to 2^17 valid tcache IDs to be specified (but 0 and 1 are reserved by jemalloc). In this example, however, the number of areans falls to 128.

interwq · 2023-02-21T23:36:37Z

Thanks for sharing the patch @veselink1 . The changes look good to me and I get why you needed it. However I'm not sure we want to go this far, in terms of committing to the added options long term -- for example we have been talking about embedding the arena index into the radix tree (which has its own bits limitation as well) for fast "remote" arena detection. Would you be fine maintaining your own jemalloc branch for this feature? My feeling is this part isn't going to change too frequently, i.e. you won't get many merge conflicts, unless we do change the specifics around the tcache and arena bits.

veselink1 added 2 commits February 20, 2023 16:09

Generate jemalloc_internal_types.h using autoconf

9b22d8d

Make the file autoconf-generated. In the follow-up change, the representation of the mallocx flags will be made configurable at compile-time.

veselink1 force-pushed the dev branch from 723ffd5 to 7b756f1 Compare February 20, 2023 16:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add --with-lg-tcache-limit configuration option to allow for more than 4094 tcaches #2384

Add --with-lg-tcache-limit configuration option to allow for more than 4094 tcaches #2384

veselink1 commented Feb 20, 2023 •

edited

interwq commented Feb 21, 2023

Add --with-lg-tcache-limit configuration option to allow for more than 4094 tcaches #2384

Are you sure you want to change the base?

Add --with-lg-tcache-limit configuration option to allow for more than 4094 tcaches #2384

Conversation

veselink1 commented Feb 20, 2023 • edited

interwq commented Feb 21, 2023

veselink1 commented Feb 20, 2023 •

edited