New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prefetch the next object returned by malloc #2193
base: dev
Are you sure you want to change the base?
Conversation
When small allocations are very frequent, this can speed up accessing allocated memory. And if that is not the case, the cost is little. - Added config option `with-prefetch-objects` to control how many cache lines to prefetch. Objects larger than this amount of cache lines are not prefetched. - By default with-prefetch-objects=0 (disabled). - __builtin_prefetch() is used.
Sorry for the excessive commit history. I'll submit new pull request to squash them if this is OK, in case there are further conflicts. |
(BTW, not ignoring this, and I think it's a good idea; just not in the office for a few more days; sorry for the delayed response. A rebase-via squash and the force-push should be fine, no need to create a new PR). |
Thanks for the PR @dizhao-ampere. Intuitively this does make sense -- however have you tried it on any workload that showed improvements from the prefetching? For performance optimizations I'd prefer to have known cases that benefit from the change. In the past we tried it a few times but unfortunately never really got a clean win from it. Sharing some thoughts from past experiments:
Then combined with hardware prefetcher getting more intelligent, are my suspicions of why it didn't help in our previous experiments. In fact I once spent weeks inserting prefetchings into various places (not limited within jemalloc) but frankly the vast majority of the changes didn't help and some regressed clearly. It's been quite a while since our last attempt though; so if any new / different architectures that can benefit from this, I'd certainly be glad to give it another try. |
I tested this change on SPEC 2017 integer test cases. On some arm machines there's a 3~4% improvement on the omnetpp test case (one with many memory allocations). On some new x86 machine there's approximately 2% improvement. No significant change on other test cases is observed. Regarding with the experiences @interwq shared:
The small example: #include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
struct S {
long ll[16];
int ii[16];
short ss[16];
};
static struct S ss;
void round_1(size_t count) {
struct S **array = malloc(sizeof(struct S *) * count);
// Allocate many objects shortly.
for (size_t i = 0; i < count; i++) {
array[i] = malloc(sizeof(struct S));
*(array[i]) = ss;
}
// Free the objects.
for (size_t i = 0; i < count; i++) {
free(array[i]);
}
free(array);
}
int main(char **argv, int argc) {
printf("size: %ld\n", sizeof(struct S));
for (size_t i = 0; i < 12; i++) {
ss.ll[i] = i * i * argc;
ss.ii[i] = i * argc;
ss.ss[i] = i * argc * argc;
}
struct timeval start, stop;
gettimeofday(&start, NULL);
for (size_t i = 0; i < 100000; i++) {
round_1(80000);
}
gettimeofday(&stop, NULL);
printf("us: %ld\n",
(stop.tv_sec - start.tv_sec) * 1000000 + stop.tv_usec - start.tv_usec);
} Thanks. |
Thanks for sharing the benchmark results @dizhao-ampere. They sound promising -- we didn't get to test it on arm hosts so glad to hear it's helpful there. We just talked about this yesterday and @Lapenkov plans to benchmark it in our production workload as well. Though it may take us a couple of weeks to get to that (we have some administrative work at the beginning of the year). We will update here once the results are in. |
(Hi, just checking in about how this is going, and if there's something I can help with :) ) |
Hello. I'm planning to benchmark your change this week, will let you know the results once I have them. |
Keeping you posted - I'm currently evaluating the impact of prefetching one cache line on large services, so far seems to give positive results (less DTLB misses, CPU savings). I'm performing some additional benchmarking, will share the results once I have them. |
Sorry to bother you again, how the benchmarking going? Will this be accepted in the near future? Thanks :) |
Hi. Sorry that it takes so long. The results I was getting were controversial, i.e. no clear and stable win could be detected. I was testing different strategies, such as prefetching only large objects, but haven't had time to interpret the results. Also, I would like to extend the testing to another service. This all takes time and shifts this beyond a simple benchmarks, hence the significant delay. I can't say it'll be merged soon, but I hope I can prioritize the extended benchmarking again in the near future. |
When small allocations are very frequent, this can speed up accessing
allocated memory. And if that is not the case, the cost is little.
with-prefetch-objects
to control how many cachelines to prefetch at most. Objects larger than this amount of cache lines
are not prefetched.