Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prefetch the next object returned by malloc #2193

Open
wants to merge 3 commits into
base: dev
Choose a base branch
from

Conversation

dizhao-ampere
Copy link

When small allocations are very frequent, this can speed up accessing
allocated memory. And if that is not the case, the cost is little.

  • Added config option with-prefetch-objects to control how many cache
    lines to prefetch at most. Objects larger than this amount of cache lines
    are not prefetched.
  • By default with-prefetch-objects=0 (disabled).
  • __builtin_prefetch() is used.

When small allocations are very frequent, this can speed up accessing
allocated memory. And if that is not the case, the cost is little.

- Added config option `with-prefetch-objects` to control how many cache
  lines to prefetch. Objects larger than this amount of cache lines are
  not prefetched.
- By default with-prefetch-objects=0 (disabled).
- __builtin_prefetch() is used.
@dizhao-ampere
Copy link
Author

Sorry for the excessive commit history. I'll submit new pull request to squash them if this is OK, in case there are further conflicts.

@davidtgoldblatt
Copy link
Member

(BTW, not ignoring this, and I think it's a good idea; just not in the office for a few more days; sorry for the delayed response. A rebase-via squash and the force-push should be fine, no need to create a new PR).

@interwq
Copy link
Member

interwq commented Dec 31, 2021

Thanks for the PR @dizhao-ampere. Intuitively this does make sense -- however have you tried it on any workload that showed improvements from the prefetching? For performance optimizations I'd prefer to have known cases that benefit from the change. In the past we tried it a few times but unfortunately never really got a clean win from it. Sharing some thoughts from past experiments:

  • the small sizes are where this matters -- however they tend to recycle fast and frequently, which in turn means they got a good chance of being in CPU cache already;
  • and the common usage pattern, is that right after malloc returns there will be a constructor to initialize the allocated memory. These initialization writes also serve as cache warmups (and since they are writes they are also non-blocking).

Then combined with hardware prefetcher getting more intelligent, are my suspicions of why it didn't help in our previous experiments. In fact I once spent weeks inserting prefetchings into various places (not limited within jemalloc) but frankly the vast majority of the changes didn't help and some regressed clearly.

It's been quite a while since our last attempt though; so if any new / different architectures that can benefit from this, I'd certainly be glad to give it another try.

@dizhao-ampere
Copy link
Author

I tested this change on SPEC 2017 integer test cases. On some arm machines there's a 3~4% improvement on the omnetpp test case (one with many memory allocations). On some new x86 machine there's approximately 2% improvement. No significant change on other test cases is observed.

Regarding with the experiences @interwq shared:

  • I think mostly that is the case, but I guess if many objects are allocated in a short time (and will be freed after a while), then chances are that whey are not in cache. I wrote a small piece of code to simulate this (attached below). On the arm machine we use, there's nearly 2% improvement with these prefetches. On some x86 machine there's approximately 1% improvement (no very stable). On some other x86 machines I haven't found any improvements.
  • In this change, the 'next' object to be returned by malloc is prefetched, so later when the constructor writes to it, it's already in cache, so the writes will be faster. I also tested prefetching the 'current' object to be returned by malloc, there's no improvement.

The small example:

#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>

struct S {
  long ll[16];
  int ii[16];
  short ss[16];
};

static struct S ss;

void round_1(size_t count) {
  struct S **array = malloc(sizeof(struct S *) * count);
  // Allocate many objects shortly.
  for (size_t i = 0; i < count; i++) {
    array[i] = malloc(sizeof(struct S));
    *(array[i]) = ss;
  }
  // Free the objects.
  for (size_t i = 0; i < count; i++) {
    free(array[i]);
  }
  free(array);
}

int main(char **argv, int argc) {
  printf("size: %ld\n", sizeof(struct S));
  for (size_t i = 0; i < 12; i++) {
    ss.ll[i] = i * i * argc;
    ss.ii[i] = i * argc;
    ss.ss[i] = i * argc * argc;
  }
  struct timeval start, stop;
  gettimeofday(&start, NULL);
  for (size_t i = 0; i < 100000; i++) {
    round_1(80000);
  }
  gettimeofday(&stop, NULL);
  printf("us: %ld\n",
         (stop.tv_sec - start.tv_sec) * 1000000 + stop.tv_usec - start.tv_usec);
}

Thanks.

@interwq
Copy link
Member

interwq commented Jan 7, 2022

Thanks for sharing the benchmark results @dizhao-ampere. They sound promising -- we didn't get to test it on arm hosts so glad to hear it's helpful there. We just talked about this yesterday and @Lapenkov plans to benchmark it in our production workload as well. Though it may take us a couple of weeks to get to that (we have some administrative work at the beginning of the year). We will update here once the results are in.

@dizhao-ampere
Copy link
Author

(Hi, just checking in about how this is going, and if there's something I can help with :) )

@Lapenkov
Copy link
Member

(Hi, just checking in about how this is going, and if there's something I can help with :) )

Hello. I'm planning to benchmark your change this week, will let you know the results once I have them.

@Lapenkov
Copy link
Member

Keeping you posted - I'm currently evaluating the impact of prefetching one cache line on large services, so far seems to give positive results (less DTLB misses, CPU savings). I'm performing some additional benchmarking, will share the results once I have them.

@dizhao-ampere
Copy link
Author

Sorry to bother you again, how the benchmarking going? Will this be accepted in the near future? Thanks :)

@Lapenkov
Copy link
Member

Hi. Sorry that it takes so long. The results I was getting were controversial, i.e. no clear and stable win could be detected. I was testing different strategies, such as prefetching only large objects, but haven't had time to interpret the results. Also, I would like to extend the testing to another service. This all takes time and shifts this beyond a simple benchmarks, hence the significant delay. I can't say it'll be merged soon, but I hope I can prioritize the extended benchmarking again in the near future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants