Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing remarking to be safe with parallel domains #474

Draft
wants to merge 6 commits into
base: parallel_minor_gc
Choose a base branch
from

Conversation

ctk21
Copy link
Collaborator

@ctk21 ctk21 commented Feb 19, 2021

This PR implements a solution to the problem of remarking pools owned by another domain as in #466.

Remarking occurs when a domain's mark stack overflows its allowable space in realloc_mark_stack. Pools are determined to need remarking and all entries removed from that domain's mark stack for those pools. There is however a race bug should a domain attempt to remark a pool while another domain is allocating into that pool, because all allocations start as being MARKED even when their fields are not safe to use.

The solution implemented pushes the responsibility for remarking a pool to the domain that owns the pool. This solves the race by making sure that the domain doing the remarking either owns the pool or knows that nobody else can use the pool for allocation.

This PR introduces a mechanism for doing this and handles the domain lifecycle interactions needed to make it correct. In particular:

  • there are now remark_stacks which hold pools that require remarking. There is a stack for each domain and also a global_remark_stack for orphaned pools that need remarking. These stacks live in a static area to make the domain lifecycle implications easier; every domain has a valid stack even while it is terminating.
  • when a domain determines a pool needs remarking, it attempts to push it to the domain that owns it. If there isn't an owner or the domain that owns it is no longer accepting entries because it is terminating, the pool ends up in the global_remark_stack. When a domain has pushed a pool to a remark stack, it is guaranteed that it will be remarked by somebody before the current phase of the major GC is complete.
  • when a domain finds work in its remark stack (or the global_remark_stack) it calls pop_from_remark_stack in major_gc.c which: (i) if we own the pool, we remark; (ii) if another domain owns the pool, we push it to the correct accepting remark stack (that domain or global); (iii) if it is an orphaned pool, we lock the orphaned pool adoption lock to ensure that nobody can take ownership of the pool, then remark the pool and then release the adoption lock.
  • there is an additional global atomic num_domains_to_remark in the major GC which checks that all remark stacks are empty before allowing phases to move forwards.
  • on domain termination we have to be careful at what point we allow the domain to cease servicing its remark stack. It needs to happen at exactly the point that we know all its pools will be orphaned, otherwise there is the potential for deadlock where the domain ceases servicing its remarking stack, but remarking is needed to complete a major GC STW segment.
  • while debugging I discovered we were not correctly tracking pool ownership in the adoption paths; I've fixed this and added asserts when we orphan pools to check the ownership tracking is correct.

I've tested the implementation with the testsuite on a bunch of machines and also a big overnight loop of the troublesome domain_serial_spawn_burn (bytecode edition & native).

There are some niggles with how this has ended up:

  • I think the static remark_stack structures belong in major_gc.c not domain.c to keep it in the same place as the GC; there isn't a strong reason it needs to be in domain.
  • I'm not overly fond of the efficiency of what is going on with Caml_state->marking_done combined with the remarking stack checks we have to push in front of the guard. I was afraid of changing Caml_state->marking_done because sometimes the Caml_state->marking_done guard is an efficiency optimization, but in other places it is used for correct operation. I'm minded to leave the inefficiency there until there is a known problem, but may have a flash of insight on how to make it all a bit cleaner.

However I'd like feedback before refactoring it.

Copy link
Collaborator

@Sudha247 Sudha247 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through the PR and am convinced that this solves the bug with remarking pools!
I think it is a good idea to move the remark_stack related structures to major_gc.c from domain.c.

I just had some minor questions.

runtime/domain.c Show resolved Hide resolved
runtime/domain.c Show resolved Hide resolved
runtime/major_gc.c Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants