JEMalloc on ARM uses 10x more VIRT than the same work on Intel..? #2624

madscientist · 2024-03-27T22:01:13Z

I received a bug report that our core dumping procedure on ARM was many times slower than on Intel. A quick investigation shows that the VIRT memory for our processes on ARM is ~11x larger than the same workload on Intel, although the resident sizes are equivalent. Here's the result of top on Intel:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  92905 pds       20   0 1721744   1.2g  22716 S   0.0   3.8   0:50.32 myprog
  92945 pds       20   0 1605004   1.0g  32900 S   8.7   3.3   0:51.80 myprog

Here's the top output for ARM, with the same workload:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 160440 pds       20   0   17.8g   1.1g  18112 S   0.0   6.9   0:51.25 myprog
 160477 pds       20   0   17.8g 982.2m  29824 S   0.7   6.2   0:53.46 myprog

Here are the top-line jemalloc stats retrieved from both (I addded the commas myself for readability :) ):

Intel:                                      ARM:
active:                    1,069,473,792    active:                    1,027,932,160
mapped:                    1,137,229,824    mapped:                   10,157,752,320
retained:                    289,882,112    retained:                  8,766,947,328
base:                         24,828,112    base:                         18,051,512
internal:                        663,712    internal:                      1,245,184
metadata_thp:                          0    metadata_thp:                          0
tcache_bytes:                  2,685,328    tcache_bytes:                  4,754,664
tcache_stashed_bytes:                  0    tcache_stashed_bytes:                  0
resident:                  1,097,093,120    resident:                  1,049,624,576
abandoned_vm:                          0    abandoned_vm:                          0
extent_avail:                        187    extent_avail:                          9

I can get full stats if useful.

For reference I rebuilt my ARM program using system alloc (glibc) and I got similar numbers to Intel from top (obviously no jemalloc stats):

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 233542 pds       20   0 2323968   1.0g  17280 S   0.0   6.5   1:17.63 myprog
 233578 pds       20   0 2125632 937216  26944 S   4.1   5.8   1:14.95 myprog

The important details:

Both are using JEMalloc 5.3.0 but I pulled down the latest Git dev HEAD version as of this week and tried it on ARM and got the same behavior.
The ARM system is running Red Hat EL 8, which means it's using 64k page size. The Intel system is running Ubuntu 20.04 LTS, with (obviously) 4k page size. Unfortunately I don't have an easy way to test on ARM with 4k page sizes.
The Intel version is built with memory profiling enabled. The ARM version is built with memory profiling disabled.

Is there a reason that we should expect the ARM implementation to have such a higher mapped (9x Intel) and retained (30x Intel!) values while most other values are similar (I guess tcache_bytes is also almost 2x but not sure if that's relevant)?

I posted something on Gitter but not sure if people hang out there anymore so I filed this.

The text was updated successfully, but these errors were encountered:

interwq · 2024-03-27T22:12:36Z

Can you share the full malloc stats output from both x64 and arm? The higher page size could contribute a bit to the VSIZE but I wouldn't expect it to cause this much difference.

madscientist · 2024-03-28T16:49:20Z

I am adding two files, one for intel and one for arm. I have compressed them; I hope that's not too annoying?
intel-memstats-20240325.txt.gz
arm64-memstats-20240325.txt.gz

interwq · 2024-03-28T19:06:19Z

@madscientist thanks for sharing the stats. I think I know what went wrong -- can you help to check the value of HUGEPAGE) on the arm config? I suspect it's 512M according to https://wiki.debian.org/Hugepages

In some places like when reserving extra VM spaces (in a batched way, to avoid doing mmaps often), we use the huge page size as a heuristic:

jemalloc/src/exp_grow.c

Line 6 in 92aa52c

exp_grow->next = sz_psz2ind(HUGEPAGE);

which isn't great if hugepage size is large as in this case.

As you observed, the impact on RSS should be minimal given that it's mostly unused VM. However the VSIZE is rather confusing, and the core dump will take longer. The good news is, the fix should be straightforward.

madscientist · 2024-03-28T19:19:23Z

You are exactly right; I checked on Intel and LG_HUGEPAGE is 21 == 2M, while on ARM LG_HUGEPAGE is 29 == 512M!

I'll retry after adding --with-lg-hugepage=21 and see how it goes. Just curious, is there some reason for this?

madscientist · 2024-03-28T19:27:16Z

There's also a configure setting for lg-page, which is obviously different (intel = 12 == 4k, arm = 16 == 64k) but I thought this value was tied to the kernel's page size setting and couldn't be changed. Or, is that just for profiling purposes? What is the impact of changing this via the --with-lg-page configure option?

madscientist · 2024-03-28T19:45:31Z

Rebuilding with lg-hugepage=21 gives me good results for VSIZE:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
1015480 pds       20   0 1783936   1.1g  17984 S   2.7   7.0   0:55.12 myprog
1015516 pds       20   0 1449536 988.6m  29824 S   0.0   6.3   0:56.11 myprog

Top-line stats:

active:                    1,028,784,128
mapped:                    1,080,557,568
retained:                    185,597,952
base:                         15,962,352
internal:                      1,245,184
metadata_thp:                          0
tcache_bytes:                  5,352,248
tcache_stashed_bytes:                  0
resident:                  1,046,675,456
abandoned_vm:                          0
extent_avail:                          8

Core file sizes are now equivalent to Intel.

interwq · 2024-03-28T20:07:58Z

What is the impact of changing this via the --with-lg-page configure option?

This is useful for cross compilation, or when you want to allow higher page size compatibility (the specified page size can be higher than the kernel page size, but not the other way around).

Rebuilding with lg-hugepage=21 gives me good results for VSIZE

Okay this confirms the exp_grow default value issue. It should be a safe workaround for you, as long as you don't enable any huge page features (they are all default off, also we haven't evaluated them on arm yet).

I'll fix it on the dev branch soon. Thanks for reporting and helping with the investigation @madscientist !

interwq self-assigned this Mar 28, 2024

interwq linked a pull request Mar 28, 2024 that will close this issue

Fix the VM over-reservation on aarch64 w/ larger pages. #2628

Open

bluestreak01 mentioned this issue May 2, 2024

chore(core): change hugepage size for aarch64 questdb/questdb#4467

Closed

puzpuzpuz mentioned this issue May 3, 2024

chore(core): reduce memory footprint of jemalloc questdb/questdb#4468

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JEMalloc on ARM uses 10x more VIRT than the same work on Intel..? #2624

JEMalloc on ARM uses 10x more VIRT than the same work on Intel..? #2624

madscientist commented Mar 27, 2024

interwq commented Mar 27, 2024

madscientist commented Mar 28, 2024

interwq commented Mar 28, 2024

madscientist commented Mar 28, 2024

madscientist commented Mar 28, 2024

madscientist commented Mar 28, 2024

interwq commented Mar 28, 2024

JEMalloc on ARM uses 10x more VIRT than the same work on Intel..? #2624

JEMalloc on ARM uses 10x more VIRT than the same work on Intel..? #2624

Comments

madscientist commented Mar 27, 2024

interwq commented Mar 27, 2024

madscientist commented Mar 28, 2024

interwq commented Mar 28, 2024

madscientist commented Mar 28, 2024

madscientist commented Mar 28, 2024

madscientist commented Mar 28, 2024

interwq commented Mar 28, 2024