Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the virtual memory consumption on Linux #733

Open
ctk21 opened this issue Nov 5, 2021 · 10 comments
Open

Improve the virtual memory consumption on Linux #733

ctk21 opened this issue Nov 5, 2021 · 10 comments

Comments

@ctk21
Copy link
Collaborator

ctk21 commented Nov 5, 2021

A design assumption in multicore was that we could reserve address space and not commit that address space for use without it impacting users. There are examples (e.g. #732) where reserving large amounts of address space has an impact; note that most of the address space isn't committed.

As things currently stand, the parallel minor collector needs to have a region containing the minor heaps of all domains; this enables for a fast Is_young check based on the top and bottom of the region. Hence at startup multicore reserves the maximum required area and then commits it as needed by domains.

Other designs for orchestrating the minor heaps of domains are possible, but should consider:

  • the virtual memory performance (even those reserved but not committed)
  • performance for domain spawn (and termination)
  • performance (and safety) of Is_young as used in the runtime
  • what happens when the minor heap size is changed via Gc set

To throw some options into the ring:

  • mmap (or malloc) the minor heap area at the end of each minor collection when inside a STW segment; domain spawn gives the new domain a 0 size minor heaps and so force a STW minor collection
  • perform a more dynamic reserve and commit strategy with mmap in the simple strategy above; e.g. only reserve for the current minor heap size and in blocks of 1, 2, 4, 8, etc. domains at each STW minor collection to trade off spawn vs reserved space
  • a revival of domain-local allocation buffers that irons out the performance issues seen previously (see Thread local allocation buffers #484, here and here)
@gadmm
Copy link
Contributor

gadmm commented Nov 5, 2021

Another option is to implement Is_young with a 1- or 2-level bibop and manage minor heap allocation with a page allocator, as experimented here: https://gitlab.com/gadmm/ocaml-large-pages-experiment. Performance is good (seemingly better than current Is_young in a stress test of the write barrier in nnp mode, in a quick test), and it checks all your requirements. The bibop is straightforward to implement efficiently and the page allocator (already useful for solving other issues in the management of pools, as mentioned elsewhere) is also quite simple.

@jmid
Copy link
Collaborator

jmid commented Nov 8, 2021

I don't think it is unreasonable to expect that 80-90% of code hitting 5.00 upon release will be sequential (for a start at least). This indicates that the single-threaded-to-multicore transition is central. A "pay-as-you-go" approach, where a single-threaded "hello world" does not have to reserve virtual memory of worst-case 128 domains with a 4GB minor heap each is preferable IMHO.

With this in mind one idea is:

  • mmap for just 1 domain for a start (to cater sequential users)
    and have a run-time hook in Domain.spawn that "remaps" and only triggers the first time the runtime goes beyond 1 domain.

This will take at most 1 "remap", so it should be less costly for multi-domain programs than paying at each Domain.spawn. There could be an OCAMLRUNPARAM option for users not wanting to pay for the remap and start-up in "multi-domain mode" immediately.
A similar effect can be obtained by @ctk21's second proposal with a larger "fan-out" (1-8-64 or 1-16-256 domains)

@jmid
Copy link
Collaborator

jmid commented Nov 8, 2021

  • An option going in a different direction is to query the number of CPU cores at start-up via a bit of platform-dependent logic (nproc, sysctl, ...) and mmap accordingly. This could help align the domain allocation with the current machine resources.

@jmid
Copy link
Collaborator

jmid commented Nov 8, 2021

@kayceesrk suggested we track a list of things broken by the additional memory requirements:

For AFL-users any invocation of afl-fuzz without a -m none option or a large enough -m option will fail because of AFL's default 50MB virtual memory limit.

@gadmm
Copy link
Contributor

gadmm commented Nov 8, 2021

@jmid Reading more about the memory limit implemented by afl-fuzz, it seems that it is common to have to adjust the limit, but that you rarely want to disable it entirely in order to guard against memory-hungry inputs it might find. Do I understand this correctly? So the following two issues can be considered distinctly: 1) make virtual memory consumption an indication of actual memory consumption, so that the -m option can be used with a meaningful value, and 2) existing documentation and tests using afl do not need to be adapted by keeping a low default virtual memory consumption.

@jmid
Copy link
Collaborator

jmid commented Nov 8, 2021

@jmid Reading more about the memory limit implemented by afl-fuzz, it seems that it is common to have to adjust the limit, but that you rarely want to disable it entirely in order to guard against memory-hungry inputs it might find. Do I understand this correctly?

Yes, I think so. Few of the above links adjusts the default limit when fuzzing OCaml code though.
My concern are new-comers that try to follow one of the above instructions.

So the following two issues can be considered distinctly: 1) make virtual memory consumption an indication of actual memory consumption, so that the -m option can be used with a meaningful value, and 2) existing documentation and tests using afl do not need to be adapted by keeping a low default virtual memory consumption.

I agree that they can be considered separate.
Keeping a list can however help decide whether this issue should be prioritized.

@kayceesrk
Copy link
Contributor

have a run-time hook in Domain.spawn that "remaps" and only triggers the first time the runtime goes beyond 1 domain.

We have a Domain.at_first_spawn function that runs at the first spawn. I'm ok with making the first domain spawn more expensive.

My other thought is that if all the failures due to virtual memory management are only due to AFL, we may go for an alternative strategy. Given that enabling AFL is a configure time option (the compiler will have to be built with afl enabled), and IIUC, AFL for multi-threaded programs has stability problems (I don't know how bad this is), we can make the 5.00+afl serial only. Domain spawn is a fatal error. With that, we can reserve a small minor heap area and only for 1 domain. Something like 16 MB max and requesting a larger minor heap using OCAMLRUNPARAM (https://ocaml.org/manual/runtime.html#s:ocamlrun-options) would be a fatal error.

That said, I suspect that virtual memory reservation would be an issue for running OCaml 5.00 for MirageOS, such as running baremetal on RPI 4: https://github.com/dinosaure/gilbraltar. ocaml-freestanding has a simple memory allocator and doesn't have mmap support. CC @dinosaure @avsm.

@gadmm Regarding the 2-level bibop experiment that you'd mentioned here, have you factored in the cost of synchronization?

I haven't read your code, but if the design is such that the domains request additional pages on demand for the minor heap (until they hit some minor heap limit at which point we'd trigger the minor GC), then the Is_young check needs to have synchronisation as the major GC marking uses Is_young to check whether it needs to mark an OCaml value. See https://github.com/ocaml-multicore/ocaml-multicore/blob/5.00/runtime/major_gc.c#L530-L534.

The alternative may be to pre-allocate the pages for each domain at the domain startup, and make domain startup a stop-the-world operation. As a result, we will have the guarantee that the bibop will not change when domains are executing in non-stop-the-world sections.

Both of these need careful implementation, optimisation and performance analysis with respect to our parallel benchmarks.

@gadmm
Copy link
Contributor

gadmm commented Nov 9, 2021

That said, I suspect that virtual memory reservation would be an issue for running OCaml 5.00 for MirageOS, such as running baremetal on RPI 4: https://github.com/dinosaure/gilbraltar. ocaml-freestanding has a simple memory allocator and doesn't have mmap support. CC @dinosaure @avsm.

WASM has the same issue where you can only grow memory linearly. The page allocator lets you abstract away from mmap; it is straightforward to adapt mine to work with sbrk, but you will miss things like returning memory to the OS (which I guess does not make sense for bare metal anyway).

@gadmm Regarding the 2-level bibop experiment that you'd mentioned here, have you factored in the cost of synchronization?

I haven't read your code, but if the design is such that the domains request additional pages on demand for the minor heap (until they hit some minor heap limit at which point we'd trigger the minor GC), then the Is_young check needs to have synchronisation as the major GC marking uses Is_young to check whether it needs to mark an OCaml value. See https://github.com/ocaml-multicore/ocaml-multicore/blob/5.00/runtime/major_gc.c#L530-L534.

Yes, the point is to allow dynamic growth while being efficient. The page entries are monotonous: they can only go from undefined to some defined value. This way you do not need to synchronise in the hot path. Then you arrange that the check for the slow path (when you hit an undefined value) is for free using branch prediction (e.g. the compiler does it for you with an expect). The slow path can do various things but for Is_young and a 2-level bibop you only need two acquire loads. (In addition, an implementation without a slow path is probably available for Is_young on x86 where acquire loads are for free.)

To cover the 48-bit (or more) address space you need bibop entries to cover large ranges (256MB in my 1-level bibop, and from 4MB to 64MB in Go's 2-level bibop). This has two performance benefits:

  • the page allocator very seldom has to extend the mapping, which means that the slow path is almost never taken.
  • it is very cache-local (down to the same two cache lines in my experiment; it would be only one cache line if somehow you managed to place the heaps next to the static data).

The alternative may be to pre-allocate the pages for each domain at the domain startup, and make domain startup a stop-the-world operation. As a result, we will have the guarantee that the bibop will not change when domains are executing in non-stop-the-world sections.

This is another option (though you might now be convinced that synchronisation is not an issue). It is not clear to me how fast you want domain spawning to be.

Both of these need careful implementation, optimisation and performance analysis with respect to our parallel benchmarks.

More details on my experiment:

  • The baseline for the observed good performance is Stephen's super-optimised marking loop with that uses prefetching, to make the experiment relevant (and challenging). This indeed required careful implementation, optimisation (down to assembly code and details of memory management) and performance analysis.
  • My experiment was with a multicore-ready bibop, with a single thread. I reasoned that this is indicative of multicore performance by a theoretical argument (the slow path is called so little often that it can be ignored). With a single thread I already got to measure the speed of the check including the synchronisation code, with their various potential costs such as instruction latency and throughput, memory loads, branch (mis)predictions, instruction decoding, etc.
  • The performance difference between a 1-level bibop and a 2-level bibop is that with a 1-level bibop you can cache the address of the table in a register. But I saw little impact of this optimisation in practice and it can only be used in hot loops (essentially the marking loop). It is possible that a 2-level bibop gives the same good performance. A 2-level bibop is probably more robust since it avoids the paging tricks I used for the 1-level bibop.
  • The page allocator is called infrequently and does not need to be ultra-fast; I got a very simple and performant best-fit allocator out of skiplists (one skiplist ordered per starting address of the free block, and one skiplist ordered per block size). I did not investigate the cost of synchronisation, but I count on the fact that it is called infrequently.

So the main performance risk would be the implementation of Is_young. If you want to look further into the bibop option, as a next step I can show you an implementation of Is_young with bibop and synchronisation in godbolt, so you can look at the generated assembly on various platforms. I could also extend my experiment with a 2-level bibop to ensure that the good performance remains the same (but it would take me more time).

@Gbury
Copy link
Contributor

Gbury commented Nov 9, 2021

Given that enabling AFL is a configure time option

I'm not sure if that has changed with multicore, but at least as far as I recall on current regular OCaml, all the afl configure option does is enable afl instrumentation by default. In particular, even on a switch not configured with afl, it is possible to instrument programs with afl, though in that case the stdlib will not be instrumented, and instrumenting dependencies of a project might be complicated. In any case, it's currently possible to use afl on a switch not configured with afl (you can verify this by following the example from the manual on a regular switch).

@kayceesrk
Copy link
Contributor

@Gbury Thanks for clarifying how AFL fits in the compilation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants