Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove the coordinator (controller) thread #1062

Closed
wants to merge 14 commits into from

Conversation

wks
Copy link
Collaborator

@wks wks commented Jan 5, 2024

DRAFT: Need polishing. Need performance evaluation. Need to update bindings.

This PR removes the coordinator (a.k.a. controller) thread. GC worker threads now coordinate themselves.

Major changes include:

  • GCController is removed. All synchronization mechanisms for the communication between the controller and workers are removed, too.
  • The mechanism for triggering GC is redesigned. Mutators now communicate with workers directly, and a variable WorkerMonitorSync::should_schedule_gc is added as the communication channel. To trigger a GC, a mutator shall acquire the mutex WorkerMonitor::sync, set sync.should_schedule_gc to true, and notify one worker. The last parked worker always checks this variable before blocking.
  • The last parked worker (instead of the coordinator) is now responsible for adding the ScheduleCollection work packet when a GC is requested.
  • The last parked worker (instead of the coordinator) is now responsible for opening work packets when open buckets are drained.
  • The work packet EndOfGC is removed. The code is now executed by the last parked worker when GC finishes.
  • GlobalState::gc_status is now Atomic, and it is read by the last parked worker to decide whether GC is in progress. Previously, it was not used in Rust MMTk, and was only used for assertion in the MMTk in JikesRVM
  • Added GlobalState::gc_start_time to record the start time of a GC. It was a local variable in GCController.
  • Added method WorkBucket::add_no_notify to add work packets while holding the mutex WorkerMonitor::sync.
  • Updated comments and deleted outdated comments no longer relevant after this change.

Fixes: #1053
Closes: #1061

@wks
Copy link
Collaborator Author

wks commented Jan 8, 2024

I ran lusearch from DaCapo Chopin on bobcat.moma, 2.5x min heap size w.r.t. OpenJDK's G1, 40 invocations, 5 iterations each, comparing master (build1) and this PR (build2).

The follow plot is normalised to build1:
image
plotty link: https://squirrel.anu.edu.au/plotty/wks/noproject/#0|bobcat-2024-01-08-Mon-075256&benchmark^build^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|20&1^invocation|30&1&mmtk_gc&build;build1|41&Histogram%20(with%20CI)^build^mmtk_gc&

The following plot is not normalised:
image
plotty link: https://squirrel.anu.edu.au/plotty/wks/noproject/#0|bobcat-2024-01-08-Mon-075256&benchmark^build^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|20&1^invocation|41&Histogram%20(with%20CI)^build^mmtk_gc&

From the plots, it looks like this PR reduces the STW time if the absolute (not normalised) number of GC is large. There is 1% to 2% STW time reduction for GenCopy, GenImmix, SemiSpace and StickyImmix. For Immix, the STW time is increased by 0.3%, but the error bar is much larger than the difference. From individual data, it looks like some outlier invocations triggered more GC than others. The following plot shows the number of GCs executed in each invocation when running Immix, sorted and grouped by build. It shows the distribution of the number of GCs is the same, but "build2" (yellow) contains some outliers on the right end of the curve.

image

We see this PR makes GC slightly faster, and mainly impacts the plans that trigger GC often (mainly generational plans). I think the main reason is that this PR reduces one context switch every time all workers stopped (because there is no need to switch to the coordinator). If a plan does more GC, but fewer work in each GC (mostly nursery GC), the synchronisation overhead will be significant. But it is hard to explain why SemiSpace becomes faster this way because it is not generational. Maybe SemiSpace simply triggered GC too many times given this heap size.

@caizixian caizixian requested review from wenyuzhao and caizixian and removed request for caizixian January 9, 2024 09:56
@wks
Copy link
Collaborator Author

wks commented Jan 10, 2024

The same setting (lusearch, 2.5x min heap size, 40 invocations) ran on skunk.moma. This Coffee Lake machine does not have small ("efficiency") cores, so it is expected to have smaller variance.

Normalised to build1:
image

https://squirrel.anu.edu.au/plotty/wks/noproject/#0|skunk-2024-01-10-Wed-093645&benchmark^build^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|20&1^invocation|30&1&benchmark^mmtk_gc&build;build1|41&Histogram%20(with%20CI)^build^mmtk_gc&

STW times are reduced for all plans, and Mutator (other) times are increased for all plans.

I realised that there is one problem with the measurement. The definition of "STW time" changed in this PR. On master, the STW time starts when the coordinator thread receives a message from the mutator (i.e. when the "requested flag" is set and the coordinator thread is notified by the CondVar), but in this PR, the STW time starts at the beginning of the ScheduleCollection work packet. For this reason, a greater proportion of time is calculated as part of the mutator time. This explains why the greater the speed-up of STW time is, the greater the slow-down of mutator time is. Update: No. The GC time printed by the log at the end of GC changed for that reason, but the STW time recorded by Stats didn't change. The recorded STW time always starts at mmtk.set_gc_status(GcStatus::GcPrepare); which always happens in ScheduleCollection for both builds.

The total time reflects the actual speed-up. There is a 0.3% slow down for Immix, likely due to the slightly increased number of GC due to random effects.

The same data without normalisation:
image
https://squirrel.anu.edu.au/plotty/wks/noproject/#0|skunk-2024-01-10-Wed-093645&benchmark^build^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|20&1^invocation|41&Histogram%20(with%20CI)^build^mmtk_gc&

At this heap size, GenCopy and GenImmix spent more time in STW than in mutator (other). This explains why their normalised time.other values apparently increased so much, because we are dividing the same absolute difference with a smaller denominator. Update: Since the way of recording STW time didn't change, we need a different explanation.

@wks
Copy link
Collaborator Author

wks commented Jan 10, 2024

Here are results from running more benchmarks. 3.0x min heap size, 20 invocations. I split the tests onto two machines (with identical hardware and identical binaries)

From the geometric means of the total time, this PR speeds up all the plans tested, but only by 0.1% to 0.3%.

@wks
Copy link
Collaborator Author

wks commented Jan 12, 2024

I repeated the lusearch run on three machines:

Re-ran on skunk.moma:
image
https://squirrel.anu.edu.au/plotty/wks/noproject/#0|skunk-2024-01-11-Thu-100400&benchmark^build^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|20&1^invocation|30&1&benchmark^mmtk_gc&build;build1|41&Histogram%20(with%20CI)^build^mmtk_gc&

fisher.moma:
image
https://squirrel.anu.edu.au/plotty/wks/noproject/#0|fisher-2024-01-11-Thu-100421&benchmark^build^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|20&1^invocation|30&1&benchmark^mmtk_gc&build;build1|41&Histogram%20(with%20CI)^build^mmtk_gc&

mole.moma:
image
https://squirrel.anu.edu.au/plotty/wks/noproject/#0|mole-2024-01-11-Thu-100437&benchmark^build^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|20&1^invocation|30&1&benchmark^mmtk_gc&build;build1|41&Histogram%20(with%20CI)^build^mmtk_gc&

On skunk.moma and fisher.moma, the mutator time (time.other) increased a lot for almost all plans. But on mole.moma, it reduced. This shows that the measured increase in mutator time is not random because it can be repeated. I still can't find a good explanation, yet.

The GC time (time.stw) decreased for all plans on all machine, except Immix on fisher.

The total time (time) decreased for all plans on all machines, but not by much.

@wks
Copy link
Collaborator Author

wks commented Jan 14, 2024

I added a build3 which is #1067. It further refactored over build2 so that it is easy to express the idea of "letting all workers do one thing" (i.e. work towards one "goal") which can be either "GC" or "stopping for fork" at this time. I did the refactoring because I think the current code in this PR compilcates things a bit (although it already greatly simplified the synchronisation by removing coordinator) and made it hard to add forking support.

Result on Skunk. All the same setting as before except adding build3. i.e.:

image

build2 and build3 have the same performance characterestics. time.other increased, time.stw decreased, time decreased. And the geomean shows build3 performs slightly better than build2, probably due to simplified synchronisation and one removed mutex. But it is still only 0.5% improvement over master in total time.

@wks
Copy link
Collaborator Author

wks commented Jan 31, 2024

I am closing this PR in favor for #1067 (Remove coordinator and support forking). The reason is, after we remove the coordinator thread, we need a way for workers to coordinate themselves. But the concrete way how they coordinate themselves is dictated by the requirement of supporting forking. That is, we must implement a coordinating mechanism that (1) does not need a dedicated coordinator thread, and (2) is easy to bring down all GC workers when forking. The code in this PR accomplished (1), but is obviously not suitable for (2), and #1067 has already done non-trivial refactoring over this PR. Therefore, it is a waste of effort to merge this PR before applying #1067.

@wks wks closed this Jan 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove or refactor GcStatus Remove the coordinator (controller) thread
1 participant