Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JEMalloc on ARM uses 10x more VIRT than the same work on Intel..? #2624

Open
madscientist opened this issue Mar 27, 2024 · 7 comments · May be fixed by #2628
Open

JEMalloc on ARM uses 10x more VIRT than the same work on Intel..? #2624

madscientist opened this issue Mar 27, 2024 · 7 comments · May be fixed by #2628
Assignees

Comments

@madscientist
Copy link
Contributor

I received a bug report that our core dumping procedure on ARM was many times slower than on Intel. A quick investigation shows that the VIRT memory for our processes on ARM is ~11x larger than the same workload on Intel, although the resident sizes are equivalent. Here's the result of top on Intel:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  92905 pds       20   0 1721744   1.2g  22716 S   0.0   3.8   0:50.32 myprog
  92945 pds       20   0 1605004   1.0g  32900 S   8.7   3.3   0:51.80 myprog

Here's the top output for ARM, with the same workload:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 160440 pds       20   0   17.8g   1.1g  18112 S   0.0   6.9   0:51.25 myprog
 160477 pds       20   0   17.8g 982.2m  29824 S   0.7   6.2   0:53.46 myprog

Here are the top-line jemalloc stats retrieved from both (I addded the commas myself for readability :) ):

Intel:                                      ARM:
active:                    1,069,473,792    active:                    1,027,932,160
mapped:                    1,137,229,824    mapped:                   10,157,752,320
retained:                    289,882,112    retained:                  8,766,947,328
base:                         24,828,112    base:                         18,051,512
internal:                        663,712    internal:                      1,245,184
metadata_thp:                          0    metadata_thp:                          0
tcache_bytes:                  2,685,328    tcache_bytes:                  4,754,664
tcache_stashed_bytes:                  0    tcache_stashed_bytes:                  0
resident:                  1,097,093,120    resident:                  1,049,624,576
abandoned_vm:                          0    abandoned_vm:                          0
extent_avail:                        187    extent_avail:                          9

I can get full stats if useful.

For reference I rebuilt my ARM program using system alloc (glibc) and I got similar numbers to Intel from top (obviously no jemalloc stats):

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 233542 pds       20   0 2323968   1.0g  17280 S   0.0   6.5   1:17.63 myprog
 233578 pds       20   0 2125632 937216  26944 S   4.1   5.8   1:14.95 myprog

The important details:

  • Both are using JEMalloc 5.3.0 but I pulled down the latest Git dev HEAD version as of this week and tried it on ARM and got the same behavior.
  • The ARM system is running Red Hat EL 8, which means it's using 64k page size. The Intel system is running Ubuntu 20.04 LTS, with (obviously) 4k page size. Unfortunately I don't have an easy way to test on ARM with 4k page sizes.
  • The Intel version is built with memory profiling enabled. The ARM version is built with memory profiling disabled.

Is there a reason that we should expect the ARM implementation to have such a higher mapped (9x Intel) and retained (30x Intel!) values while most other values are similar (I guess tcache_bytes is also almost 2x but not sure if that's relevant)?

I posted something on Gitter but not sure if people hang out there anymore so I filed this.

@interwq
Copy link
Member

interwq commented Mar 27, 2024

Can you share the full malloc stats output from both x64 and arm? The higher page size could contribute a bit to the VSIZE but I wouldn't expect it to cause this much difference.

@madscientist
Copy link
Contributor Author

I am adding two files, one for intel and one for arm. I have compressed them; I hope that's not too annoying?
intel-memstats-20240325.txt.gz
arm64-memstats-20240325.txt.gz

@interwq
Copy link
Member

interwq commented Mar 28, 2024

@madscientist thanks for sharing the stats. I think I know what went wrong -- can you help to check the value of HUGEPAGE) on the arm config? I suspect it's 512M according to https://wiki.debian.org/Hugepages

In some places like when reserving extra VM spaces (in a batched way, to avoid doing mmaps often), we use the huge page size as a heuristic:

exp_grow->next = sz_psz2ind(HUGEPAGE);

which isn't great if hugepage size is large as in this case.

As you observed, the impact on RSS should be minimal given that it's mostly unused VM. However the VSIZE is rather confusing, and the core dump will take longer. The good news is, the fix should be straightforward.

@madscientist
Copy link
Contributor Author

You are exactly right; I checked on Intel and LG_HUGEPAGE is 21 == 2M, while on ARM LG_HUGEPAGE is 29 == 512M!

I'll retry after adding --with-lg-hugepage=21 and see how it goes. Just curious, is there some reason for this?

@madscientist
Copy link
Contributor Author

There's also a configure setting for lg-page, which is obviously different (intel = 12 == 4k, arm = 16 == 64k) but I thought this value was tied to the kernel's page size setting and couldn't be changed. Or, is that just for profiling purposes? What is the impact of changing this via the --with-lg-page configure option?

@madscientist
Copy link
Contributor Author

Rebuilding with lg-hugepage=21 gives me good results for VSIZE:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
1015480 pds       20   0 1783936   1.1g  17984 S   2.7   7.0   0:55.12 myprog
1015516 pds       20   0 1449536 988.6m  29824 S   0.0   6.3   0:56.11 myprog

Top-line stats:

active:                    1,028,784,128
mapped:                    1,080,557,568
retained:                    185,597,952
base:                         15,962,352
internal:                      1,245,184
metadata_thp:                          0
tcache_bytes:                  5,352,248
tcache_stashed_bytes:                  0
resident:                  1,046,675,456
abandoned_vm:                          0
extent_avail:                          8

Core file sizes are now equivalent to Intel.

@interwq
Copy link
Member

interwq commented Mar 28, 2024

What is the impact of changing this via the --with-lg-page configure option?

This is useful for cross compilation, or when you want to allow higher page size compatibility (the specified page size can be higher than the kernel page size, but not the other way around).

Rebuilding with lg-hugepage=21 gives me good results for VSIZE

Okay this confirms the exp_grow default value issue. It should be a safe workaround for you, as long as you don't enable any huge page features (they are all default off, also we haven't evaluated them on arm yet).

I'll fix it on the dev branch soon. Thanks for reporting and helping with the investigation @madscientist !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants