Prefetch the next object returned by malloc #2193

dizhao-ampere · 2021-12-28T09:42:38Z

When small allocations are very frequent, this can speed up accessing
allocated memory. And if that is not the case, the cost is little.

Added config option with-prefetch-objects to control how many cache
lines to prefetch at most. Objects larger than this amount of cache lines
are not prefetched.
By default with-prefetch-objects=0 (disabled).
__builtin_prefetch() is used.

When small allocations are very frequent, this can speed up accessing allocated memory. And if that is not the case, the cost is little. - Added config option `with-prefetch-objects` to control how many cache lines to prefetch. Objects larger than this amount of cache lines are not prefetched. - By default with-prefetch-objects=0 (disabled). - __builtin_prefetch() is used.

dizhao-ampere · 2021-12-31T03:47:46Z

Sorry for the excessive commit history. I'll submit new pull request to squash them if this is OK, in case there are further conflicts.

davidtgoldblatt · 2021-12-31T05:39:28Z

(BTW, not ignoring this, and I think it's a good idea; just not in the office for a few more days; sorry for the delayed response. A rebase-via squash and the force-push should be fine, no need to create a new PR).

interwq · 2021-12-31T06:20:58Z

Thanks for the PR @dizhao-ampere. Intuitively this does make sense -- however have you tried it on any workload that showed improvements from the prefetching? For performance optimizations I'd prefer to have known cases that benefit from the change. In the past we tried it a few times but unfortunately never really got a clean win from it. Sharing some thoughts from past experiments:

the small sizes are where this matters -- however they tend to recycle fast and frequently, which in turn means they got a good chance of being in CPU cache already;
and the common usage pattern, is that right after malloc returns there will be a constructor to initialize the allocated memory. These initialization writes also serve as cache warmups (and since they are writes they are also non-blocking).

Then combined with hardware prefetcher getting more intelligent, are my suspicions of why it didn't help in our previous experiments. In fact I once spent weeks inserting prefetchings into various places (not limited within jemalloc) but frankly the vast majority of the changes didn't help and some regressed clearly.

It's been quite a while since our last attempt though; so if any new / different architectures that can benefit from this, I'd certainly be glad to give it another try.

dizhao-ampere · 2022-01-07T11:16:26Z

I tested this change on SPEC 2017 integer test cases. On some arm machines there's a 3~4% improvement on the omnetpp test case (one with many memory allocations). On some new x86 machine there's approximately 2% improvement. No significant change on other test cases is observed.

Regarding with the experiences @interwq shared:

I think mostly that is the case, but I guess if many objects are allocated in a short time (and will be freed after a while), then chances are that whey are not in cache. I wrote a small piece of code to simulate this (attached below). On the arm machine we use, there's nearly 2% improvement with these prefetches. On some x86 machine there's approximately 1% improvement (no very stable). On some other x86 machines I haven't found any improvements.
In this change, the 'next' object to be returned by malloc is prefetched, so later when the constructor writes to it, it's already in cache, so the writes will be faster. I also tested prefetching the 'current' object to be returned by malloc, there's no improvement.

The small example:

#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>

struct S {
  long ll[16];
  int ii[16];
  short ss[16];
};

static struct S ss;

void round_1(size_t count) {
  struct S **array = malloc(sizeof(struct S *) * count);
  // Allocate many objects shortly.
  for (size_t i = 0; i < count; i++) {
    array[i] = malloc(sizeof(struct S));
    *(array[i]) = ss;
  }
  // Free the objects.
  for (size_t i = 0; i < count; i++) {
    free(array[i]);
  }
  free(array);
}

int main(char **argv, int argc) {
  printf("size: %ld\n", sizeof(struct S));
  for (size_t i = 0; i < 12; i++) {
    ss.ll[i] = i * i * argc;
    ss.ii[i] = i * argc;
    ss.ss[i] = i * argc * argc;
  }
  struct timeval start, stop;
  gettimeofday(&start, NULL);
  for (size_t i = 0; i < 100000; i++) {
    round_1(80000);
  }
  gettimeofday(&stop, NULL);
  printf("us: %ld\n",
         (stop.tv_sec - start.tv_sec) * 1000000 + stop.tv_usec - start.tv_usec);
}

Thanks.

interwq · 2022-01-07T18:54:43Z

Thanks for sharing the benchmark results @dizhao-ampere. They sound promising -- we didn't get to test it on arm hosts so glad to hear it's helpful there. We just talked about this yesterday and @Lapenkov plans to benchmark it in our production workload as well. Though it may take us a couple of weeks to get to that (we have some administrative work at the beginning of the year). We will update here once the results are in.

dizhao-ampere · 2022-03-01T09:24:48Z

(Hi, just checking in about how this is going, and if there's something I can help with :) )

Lapenkov · 2022-03-15T03:25:58Z

(Hi, just checking in about how this is going, and if there's something I can help with :) )

Hello. I'm planning to benchmark your change this week, will let you know the results once I have them.

Lapenkov · 2022-03-23T00:48:41Z

Keeping you posted - I'm currently evaluating the impact of prefetching one cache line on large services, so far seems to give positive results (less DTLB misses, CPU savings). I'm performing some additional benchmarking, will share the results once I have them.

dizhao-ampere · 2022-04-19T18:08:16Z

Sorry to bother you again, how the benchmarking going? Will this be accepted in the near future? Thanks :)

Lapenkov · 2022-04-19T21:53:32Z

Hi. Sorry that it takes so long. The results I was getting were controversial, i.e. no clear and stable win could be detected. I was testing different strategies, such as prefetching only large objects, but haven't had time to interpret the results. Also, I would like to extend the testing to another service. This all takes time and shifts this beyond a simple benchmarks, hence the significant delay. I can't say it'll be merged soon, but I hope I can prioritize the extended benchmarking again in the near future.

dizhao-ampere added 3 commits December 28, 2021 16:56

try to fix tests

c282a18

resolve conflict

214a226

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prefetch the next object returned by malloc #2193

Prefetch the next object returned by malloc #2193

dizhao-ampere commented Dec 28, 2021

dizhao-ampere commented Dec 31, 2021

davidtgoldblatt commented Dec 31, 2021

interwq commented Dec 31, 2021

dizhao-ampere commented Jan 7, 2022

interwq commented Jan 7, 2022

dizhao-ampere commented Mar 1, 2022

Lapenkov commented Mar 15, 2022

Lapenkov commented Mar 23, 2022

dizhao-ampere commented Apr 19, 2022

Lapenkov commented Apr 19, 2022

Prefetch the next object returned by malloc #2193

Are you sure you want to change the base?

Prefetch the next object returned by malloc #2193

Conversation

dizhao-ampere commented Dec 28, 2021

dizhao-ampere commented Dec 31, 2021

davidtgoldblatt commented Dec 31, 2021

interwq commented Dec 31, 2021

dizhao-ampere commented Jan 7, 2022

interwq commented Jan 7, 2022

dizhao-ampere commented Mar 1, 2022

Lapenkov commented Mar 15, 2022

Lapenkov commented Mar 23, 2022

dizhao-ampere commented Apr 19, 2022

Lapenkov commented Apr 19, 2022