Fixing remarking to be safe with parallel domains #474
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR implements a solution to the problem of remarking pools owned by another domain as in #466.
Remarking occurs when a domain's mark stack overflows its allowable space in
realloc_mark_stack
. Pools are determined to need remarking and all entries removed from that domain's mark stack for those pools. There is however a race bug should a domain attempt to remark a pool while another domain is allocating into that pool, because all allocations start as beingMARKED
even when their fields are not safe to use.The solution implemented pushes the responsibility for remarking a pool to the domain that owns the pool. This solves the race by making sure that the domain doing the remarking either owns the pool or knows that nobody else can use the pool for allocation.
This PR introduces a mechanism for doing this and handles the domain lifecycle interactions needed to make it correct. In particular:
global_remark_stack
for orphaned pools that need remarking. These stacks live in a static area to make the domain lifecycle implications easier; every domain has a valid stack even while it is terminating.global_remark_stack
. When a domain has pushed a pool to a remark stack, it is guaranteed that it will be remarked by somebody before the current phase of the major GC is complete.global_remark_stack
) it callspop_from_remark_stack
inmajor_gc.c
which: (i) if we own the pool, we remark; (ii) if another domain owns the pool, we push it to the correct accepting remark stack (that domain or global); (iii) if it is an orphaned pool, we lock the orphaned pool adoption lock to ensure that nobody can take ownership of the pool, then remark the pool and then release the adoption lock.num_domains_to_remark
in the major GC which checks that all remark stacks are empty before allowing phases to move forwards.I've tested the implementation with the testsuite on a bunch of machines and also a big overnight loop of the troublesome
domain_serial_spawn_burn
(bytecode edition & native).There are some niggles with how this has ended up:
major_gc.c
notdomain.c
to keep it in the same place as the GC; there isn't a strong reason it needs to be in domain.Caml_state->marking_done
combined with the remarking stack checks we have to push in front of the guard. I was afraid of changingCaml_state->marking_done
because sometimes theCaml_state->marking_done
guard is an efficiency optimization, but in other places it is used for correct operation. I'm minded to leave the inefficiency there until there is a known problem, but may have a flash of insight on how to make it all a bit cleaner.However I'd like feedback before refactoring it.