Fixing remarking to be safe with parallel domains #474

ctk21 · 2021-02-19T15:48:28Z

This PR implements a solution to the problem of remarking pools owned by another domain as in #466.

Remarking occurs when a domain's mark stack overflows its allowable space in realloc_mark_stack. Pools are determined to need remarking and all entries removed from that domain's mark stack for those pools. There is however a race bug should a domain attempt to remark a pool while another domain is allocating into that pool, because all allocations start as being MARKED even when their fields are not safe to use.

The solution implemented pushes the responsibility for remarking a pool to the domain that owns the pool. This solves the race by making sure that the domain doing the remarking either owns the pool or knows that nobody else can use the pool for allocation.

This PR introduces a mechanism for doing this and handles the domain lifecycle interactions needed to make it correct. In particular:

there are now remark_stacks which hold pools that require remarking. There is a stack for each domain and also a global_remark_stack for orphaned pools that need remarking. These stacks live in a static area to make the domain lifecycle implications easier; every domain has a valid stack even while it is terminating.
when a domain determines a pool needs remarking, it attempts to push it to the domain that owns it. If there isn't an owner or the domain that owns it is no longer accepting entries because it is terminating, the pool ends up in the global_remark_stack. When a domain has pushed a pool to a remark stack, it is guaranteed that it will be remarked by somebody before the current phase of the major GC is complete.
when a domain finds work in its remark stack (or the global_remark_stack) it calls pop_from_remark_stack in major_gc.c which: (i) if we own the pool, we remark; (ii) if another domain owns the pool, we push it to the correct accepting remark stack (that domain or global); (iii) if it is an orphaned pool, we lock the orphaned pool adoption lock to ensure that nobody can take ownership of the pool, then remark the pool and then release the adoption lock.
there is an additional global atomic num_domains_to_remark in the major GC which checks that all remark stacks are empty before allowing phases to move forwards.
on domain termination we have to be careful at what point we allow the domain to cease servicing its remark stack. It needs to happen at exactly the point that we know all its pools will be orphaned, otherwise there is the potential for deadlock where the domain ceases servicing its remarking stack, but remarking is needed to complete a major GC STW segment.
while debugging I discovered we were not correctly tracking pool ownership in the adoption paths; I've fixed this and added asserts when we orphan pools to check the ownership tracking is correct.

I've tested the implementation with the testsuite on a bunch of machines and also a big overnight loop of the troublesome domain_serial_spawn_burn (bytecode edition & native).

There are some niggles with how this has ended up:

I think the static remark_stack structures belong in major_gc.c not domain.c to keep it in the same place as the GC; there isn't a strong reason it needs to be in domain.
I'm not overly fond of the efficiency of what is going on with Caml_state->marking_done combined with the remarking stack checks we have to push in front of the guard. I was afraid of changing Caml_state->marking_done because sometimes the Caml_state->marking_done guard is an efficiency optimization, but in other places it is used for correct operation. I'm minded to leave the inefficiency there until there is a known problem, but may have a flash of insight on how to make it all a bit cleaner.

However I'd like feedback before refactoring it.

…ng to domain stacks implemented; setup and teardown working

…adoption process for remarking

…h; addition of asserts to make sure ownership is correct at the point we move pools

… in the remark stack

… and atomically with regard to accepting new work

Sudha247

I went through the PR and am convinced that this solves the bug with remarking pools!
I think it is a good idea to move the remark_stack related structures to major_gc.c from domain.c.

I just had some minor questions.

runtime/domain.c

runtime/major_gc.c

ctk21 added 6 commits February 19, 2021 15:13

Move to global remark stacks in the domain structures; push and pulli…

fee6843

…ng to domain stacks implemented; setup and teardown working

handle remark stack polling, pushing from other domains, locking the …

1a4e4d9

…adoption process for remarking

re-enable mark stack overflow

ee01d12

Fix pool ownership tracking in the case of adopting in allocation pat…

b393667

…h; addition of asserts to make sure ownership is correct at the point we move pools

Avoid getting stuck in finish_major_cycle_callback when there is work…

73a2129

… in the remark stack

attempt shutdown of remark stack at the point we are sure to shutdown…

21bcf03

… and atomically with regard to accepting new work

Sudha247 reviewed Mar 3, 2021

View reviewed changes

runtime/domain.c Show resolved Hide resolved

runtime/domain.c Show resolved Hide resolved

runtime/major_gc.c Show resolved Hide resolved

ctk21 mentioned this pull request Apr 21, 2022

Re-enable mark stack compression ctk21/ocaml#2

Closed

This was referenced May 30, 2022

Mark stack compression ctk21/ocaml#3

Closed

Re-enabling mark stack pruning ocaml/ocaml#11284

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing remarking to be safe with parallel domains #474

Fixing remarking to be safe with parallel domains #474

ctk21 commented Feb 19, 2021

Sudha247 left a comment

Fixing remarking to be safe with parallel domains #474

Are you sure you want to change the base?

Fixing remarking to be safe with parallel domains #474

Conversation

ctk21 commented Feb 19, 2021

Sudha247 left a comment

Choose a reason for hiding this comment