New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JEMalloc on ARM uses 10x more VIRT than the same work on Intel..? #2624
Comments
Can you share the full malloc stats output from both x64 and arm? The higher page size could contribute a bit to the VSIZE but I wouldn't expect it to cause this much difference. |
I am adding two files, one for intel and one for arm. I have compressed them; I hope that's not too annoying? |
@madscientist thanks for sharing the stats. I think I know what went wrong -- can you help to check the value of In some places like when reserving extra VM spaces (in a batched way, to avoid doing mmaps often), we use the huge page size as a heuristic: Line 6 in 92aa52c
which isn't great if hugepage size is large as in this case. As you observed, the impact on RSS should be minimal given that it's mostly unused VM. However the VSIZE is rather confusing, and the core dump will take longer. The good news is, the fix should be straightforward. |
You are exactly right; I checked on Intel and LG_HUGEPAGE is 21 == 2M, while on ARM LG_HUGEPAGE is 29 == 512M! I'll retry after adding --with-lg-hugepage=21 and see how it goes. Just curious, is there some reason for this? |
There's also a configure setting for lg-page, which is obviously different (intel = 12 == 4k, arm = 16 == 64k) but I thought this value was tied to the kernel's page size setting and couldn't be changed. Or, is that just for profiling purposes? What is the impact of changing this via the --with-lg-page configure option? |
Rebuilding with lg-hugepage=21 gives me good results for VSIZE:
Top-line stats:
Core file sizes are now equivalent to Intel. |
This is useful for cross compilation, or when you want to allow higher page size compatibility (the specified page size can be higher than the kernel page size, but not the other way around).
Okay this confirms the exp_grow default value issue. It should be a safe workaround for you, as long as you don't enable any huge page features (they are all default off, also we haven't evaluated them on arm yet). I'll fix it on the dev branch soon. Thanks for reporting and helping with the investigation @madscientist ! |
I received a bug report that our core dumping procedure on ARM was many times slower than on Intel. A quick investigation shows that the VIRT memory for our processes on ARM is ~11x larger than the same workload on Intel, although the resident sizes are equivalent. Here's the result of top on Intel:
Here's the top output for ARM, with the same workload:
Here are the top-line jemalloc stats retrieved from both (I addded the commas myself for readability :) ):
I can get full stats if useful.
For reference I rebuilt my ARM program using system alloc (glibc) and I got similar numbers to Intel from top (obviously no jemalloc stats):
The important details:
dev
HEAD version as of this week and tried it on ARM and got the same behavior.Is there a reason that we should expect the ARM implementation to have such a higher mapped (9x Intel) and retained (30x Intel!) values while most other values are similar (I guess tcache_bytes is also almost 2x but not sure if that's relevant)?
I posted something on Gitter but not sure if people hang out there anymore so I filed this.
The text was updated successfully, but these errors were encountered: