-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve the virtual memory consumption on Linux #733
Comments
Another option is to implement |
I don't think it is unreasonable to expect that 80-90% of code hitting 5.00 upon release will be sequential (for a start at least). This indicates that the single-threaded-to-multicore transition is central. A "pay-as-you-go" approach, where a single-threaded "hello world" does not have to reserve virtual memory of worst-case 128 domains with a 4GB minor heap each is preferable IMHO. With this in mind one idea is:
This will take at most 1 "remap", so it should be less costly for multi-domain programs than paying at each |
|
@kayceesrk suggested we track a list of things broken by the additional memory requirements: For AFL-users any invocation of
|
@jmid Reading more about the memory limit implemented by |
Yes, I think so. Few of the above links adjusts the default limit when fuzzing OCaml code though.
I agree that they can be considered separate. |
We have a My other thought is that if all the failures due to virtual memory management are only due to AFL, we may go for an alternative strategy. Given that enabling AFL is a configure time option (the compiler will have to be built with afl enabled), and IIUC, AFL for multi-threaded programs has stability problems (I don't know how bad this is), we can make the That said, I suspect that virtual memory reservation would be an issue for running OCaml 5.00 for MirageOS, such as running baremetal on RPI 4: https://github.com/dinosaure/gilbraltar. @gadmm Regarding the 2-level bibop experiment that you'd mentioned here, have you factored in the cost of synchronization? I haven't read your code, but if the design is such that the domains request additional pages on demand for the minor heap (until they hit some minor heap limit at which point we'd trigger the minor GC), then the The alternative may be to pre-allocate the pages for each domain at the domain startup, and make domain startup a stop-the-world operation. As a result, we will have the guarantee that the bibop will not change when domains are executing in non-stop-the-world sections. Both of these need careful implementation, optimisation and performance analysis with respect to our parallel benchmarks. |
WASM has the same issue where you can only grow memory linearly. The page allocator lets you abstract away from mmap; it is straightforward to adapt mine to work with sbrk, but you will miss things like returning memory to the OS (which I guess does not make sense for bare metal anyway).
Yes, the point is to allow dynamic growth while being efficient. The page entries are monotonous: they can only go from undefined to some defined value. This way you do not need to synchronise in the hot path. Then you arrange that the check for the slow path (when you hit an undefined value) is for free using branch prediction (e.g. the compiler does it for you with an expect). The slow path can do various things but for Is_young and a 2-level bibop you only need two acquire loads. (In addition, an implementation without a slow path is probably available for Is_young on x86 where acquire loads are for free.) To cover the 48-bit (or more) address space you need bibop entries to cover large ranges (256MB in my 1-level bibop, and from 4MB to 64MB in Go's 2-level bibop). This has two performance benefits:
This is another option (though you might now be convinced that synchronisation is not an issue). It is not clear to me how fast you want domain spawning to be.
More details on my experiment:
So the main performance risk would be the implementation of Is_young. If you want to look further into the bibop option, as a next step I can show you an implementation of Is_young with bibop and synchronisation in godbolt, so you can look at the generated assembly on various platforms. I could also extend my experiment with a 2-level bibop to ensure that the good performance remains the same (but it would take me more time). |
I'm not sure if that has changed with multicore, but at least as far as I recall on current regular OCaml, all the |
@Gbury Thanks for clarifying how AFL fits in the compilation. |
A design assumption in multicore was that we could reserve address space and not commit that address space for use without it impacting users. There are examples (e.g. #732) where reserving large amounts of address space has an impact; note that most of the address space isn't committed.
As things currently stand, the parallel minor collector needs to have a region containing the minor heaps of all domains; this enables for a fast
Is_young
check based on the top and bottom of the region. Hence at startup multicore reserves the maximum required area and then commits it as needed by domains.Other designs for orchestrating the minor heaps of domains are possible, but should consider:
Is_young
as used in the runtimeGc
setTo throw some options into the ring:
mmap
(ormalloc
) the minor heap area at the end of each minor collection when inside a STW segment; domain spawn gives the new domain a 0 size minor heaps and so force a STW minor collectionmmap
in the simple strategy above; e.g. only reserve for the current minor heap size and in blocks of 1, 2, 4, 8, etc. domains at each STW minor collection to trade off spawn vs reserved spaceThe text was updated successfully, but these errors were encountered: