Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do new pools "own" the current domain? #77

Open
favonia opened this issue Jul 19, 2022 · 30 comments · Fixed by #78
Open

Do new pools "own" the current domain? #77

favonia opened this issue Jul 19, 2022 · 30 comments · Fixed by #78

Comments

@favonia
Copy link
Contributor

favonia commented Jul 19, 2022

I could not tell whether a newly created pool "owns" the current domain. In one of the test files, there are consecutive calls to setup_pool:

let pool1 = Task.setup_pool ~num_additional_domains:2 ~name:"pool1" () in
let pool2 = Task.setup_pool ~num_additional_domains:2 ~name:"pool2" () in

Therefore, it seems okay to have multiple pools sharing the same domain (that is, the current domain). However, the documentation gave me the impression that I have to manually spawn new domains before setting up pools to avoid overlapping:

domainslib/lib/task.mli

Lines 10 to 13 in 15f04f3

val setup_pool : ?name:string -> num_additional_domains:int -> unit -> pool
(** Sets up a task execution pool with [num_additional_domains + 1] domains
including the current domain. If [name] is provided, the pool is mapped to
[name] which can be looked up later with [lookup_pool name].

I wonder what is the intended semantics? What is the ownership relation between pools and domains? This might be related to #55.

@kayceesrk
Copy link
Contributor

kayceesrk commented Jul 21, 2022

Thanks for bringing this up. The documentation attached to setup_pool is not correct. The current domain does not get attached to the pool that it sets up. The num_additional_domains terminology is also wrong and should be changed to num_domains to reflect how pools are really set up.

The confusion arises from the fact that during T.run pool task, the calling domain also participates in the execution of the task, waiting for the task to run to completion. If the aim is to only submit a task but not wait for completion, one may use T.async pool task, which adds a task to the pool and the domains in the pool execute the task. The result of executing the task can be obtained later using T.await.

CC @Sudha247.

@gasche
Copy link
Contributor

gasche commented Oct 18, 2022

I came here because I was bitten by the new API which I found confusing, and I was about to open an issue saying "hey maybe we should rename this parameter into number_additional_domains. Running in circles!

The code I wrote would do

let num_domains = Domain.recommended_domain_count () in
let pool = Domainslib.setup_pool ~num_domains () in
(* ... *)

and @Sudha247 sharp-eyedly remarked that this was wrong. It may be wrong but it is also very easy to write, which suggests that the current API is problematic.

Can we do another round of brainstorming on how to propose a clear API here?

Remark 1: there is a fundamental difference between the status of pool domains ("worker domains") and the status of the "caller domain". Worker domains loop doing pool work forever. The caller domain wants to do the smallest amount of pool computation to get its result. It is correct to say that the tasks run on [num_domains + 1] domains, but the status of both sides of the addition is very different.

Remark 2: currently the documentation and implementation of get_num_domains is inconsistent with the ~num_domains parameter:

val get_num_domains : pool -> int
(** [get_num_domains pool] returns the total number of domains in [pool]
    including the parent domain. *)

let get_num_domains pool =
  let pd = get_pool_data pool in
  Array.length pd.domains + 1

Short-term suggestions:

  • num_worker_domains?
  • Have a code example in setup_pool that uses recommended_domain_count in the ... recommended way.

Medium-term suggestion. Taking a step back, I think that the problem comes from the fact that Domainslib.Task mixes two things together, an abstraction for a pool of worker domains and an abstraction for asynchronous tasks. I think that the API could be reworked to separate the two, on one side have an asynchronous tasks abstractions with explicit schedulers, and on the other a domain pool abstraction with an explicit (concurrent) work-channel (the interface for other domains to send work to the domain pool). The connection between the two is that the library offers a task scheduler that expects a pool work-channel as parameter.

@gasche
Copy link
Contributor

gasche commented Oct 18, 2022

(If others agree that the current state is not really satisfying, I think the present issue could be reopened.)

@haesbaert
Copy link

haesbaert commented Oct 18, 2022

Would just add a small nit here, IMHO it's easier to only expose an API that gives you "maximum number of workers", and to not write code that expects to be a participating worker. As in:
D0 -> forks recommended_domain_count -> wait for completion
It's ok to waste one extra domain, no one is going to jail, the cost is negligible.

@gasche
Copy link
Contributor

gasche commented Oct 18, 2022

I agree that it's a simpler model to think about. If we wanted to go this way (present this as the expected usage mode), we could probably also simplify the implementation of Task (which currently I find a bit hairy) to follow this model.

On the other hand: I can see how to use Domainslib to build "parallel building blocks" (parallel_for, parmap etc.) that should remain low-overhead in the case where they are given small inputs and it's not worth it to parallelize computations. (This way most users can call them unconditionally instead of having to implement their own fast-path-on-small-inputs logic.) Currently doing this with Domainslib always involves setting up an effect handler (for Task.run), but no inter-domain communication (if we end up in the sequential path of parallel_for where no fork-join is performed). With your proposed approach, if used in the implementation, we always pay for at least one inter-domain communication.

@haesbaert
Copy link

haesbaert commented Oct 18, 2022

I agree that it's a simpler model to think about. If we wanted to go this way (present this as the expected usage mode), we could probably also simplify the implementation of Task (which currently I find a bit hairy) to follow this model.

On the other hand: I can see how to use Domainslib to build "parallel building blocks" (parallel_for, parmap etc.) that should remain low-overhead in the case where they are given small inputs and it's not worth it to parallelize computations. (This way most users can call them unconditionally instead of having to implement their own fast-path-on-small-inputs logic.) Currently doing this with Domainslib always involves setting up an effect handler (for Task.run), but no inter-domain communication (if we end up in the sequential path of parallel_for where no fork-join is performed). With your proposed approach, if used in the implementation, we always pay for at least one inter-domain communication.

And I think this is fine, if you can't afford to pay one IPC or whatever to do a parallel computation on N Domains (where N is close to be normally 8 in 2022) I'd say something is already wrong. If that's too expensive than the computation should not be parallelized (it wouldn't pay off anyway).

(full disclaimer, I'm not using Domainslib, so I might be talking out of line here)

@gasche
Copy link
Contributor

gasche commented Oct 18, 2022

On an unrelated note: one benefit I would expect from rethinking the API a bit is that the Task operations (async, await, parallel_for etc.) should not require a pool parameter anymore. Currently we can only call those functions under run, and we have to pass the pool to run, but also to these functions. (Note that, unlike await, async is not implemented with an effect currently, and this would have to change to be able to do this simplification.)

@favonia
Copy link
Contributor Author

favonia commented Oct 19, 2022

As the "author" of this issue, I support rethinking the API. My original concern is really about the obvious inconsistencies (which are now fixed). Maybe get_num_domains indeed should not add one to the size of the pool. I understand the collision of different intuitions/views and sadly do not have a good proposal here. 😅

@gasche
Copy link
Contributor

gasche commented Oct 19, 2022

There are also implementation issues arising from this API confusion. For example, currently if you share a pool between two independent domains and they both call Task.run on separate tasks, one of the domain may end up running tasks from the other domain instead of completing its own task. This is unexpected (to me), but closer to what the previous documentation would say (in a certain sense the calling domain participates to the pool as any of the worker domains during the duration of run). Many parts of the documentation are in fact consistent with this behavior, but some are not (for example the description of the behavior of pools with 0 domains).

@favonia
Copy link
Contributor Author

favonia commented Oct 19, 2022

This is unexpected (to me), but closer to what the previous documentation would say

Well, I think "previous documentation" here might be misleading. It was not clear (to me) before #78 whether the calling domain (which can be different from the domain setting up the pool) participates at all, and thus I made the PR to document the existing behaviors more precisely. Perhaps you meant the name num_additional_domains matches your intuition better? 🙂 Personally I think there will be confusing cases either way until we redesign the API. (BTW, I believe pools with 0 domains should still match the general description. I added that extra text in #82 just to make the documentation more complete, not because it's a special case handled differently.)

@Sudha247
Copy link
Contributor

Thanks everyone for the comments! Reopening the issue to continue discussion.

@favonia is right in that the previous documentation was inconsistent with the implementation. I can see how the current one could be confusing too.

@Sudha247 Sudha247 reopened this Oct 20, 2022
@kayceesrk
Copy link
Contributor

I found num_additional_domains to be smelly. It is not clear what it is an "addition" of. What's the missing part? Similar to @gasche, my thinking was that we should separate out the notion of pools and asynchronous tasks being executed on the pools. Following that reasoning, a pool is an abstraction that has N "worker" domains. Hence, creating the pool should take num_domains and not num_additional_domains.

Given the above, we should fix the buggy get_num_domains.

I envision task abstraction to be separate from the pool. You can submit tasks to the pool to be executed. Task.async pool task submits the task to the pool. The caller returns immediately with a promise. Task.await pool promise waits for the promise to be resolved. If the promise is not resolved at the time of the call, rather than sitting idle, the current domain also services tasks on the pool.

I also agree with @gasche that, now that we have a top-level handler in Task.run, the pool argument for everything except Task.run could be made an optional argument. I said something to this effect earlier this week to Sudha.

I'm in support of revising the API. The types pool and promise may be in their own modules Pool and Promise to indicate that they are separate? Similarly, the parallel iterators may be moved to their own module Iterator?

@favonia
Copy link
Contributor Author

favonia commented Oct 20, 2022

If the promise is not resolved at the time of the call, rather than sitting idle, the current domain also services tasks on the pool.

What I understand is that this could lead to the somewhat surprising behavior @gasche observed, that is, if a pool is assigned with two tasks by two different domains, say Domains 1 and 2, then Domain 1 could work on the task assigned by Domain 2. As mentioned above by others, I think it's conceptually easier to let the waiting domain idle (at least after ocaml/ocaml#11589) at some cost of efficiency, treating the pool as some cloud service. This would also rules out the arguably confusing "0-sized pools" that could not run any task in this model and thus will be rejected during the setup.


By the way, I believe there are many other alternative designs for the argument pool:

  1. Use a functor to generate a module for each pool---each pool has its own run. I tried this in my own development, and it seems quite clean in terms of programming. The downside is that one needs n run for n pool. See the 5th item below.
  2. Make the pool argument to run mandatory, but others optional.
  3. Make all pool arguments optional, but err when no one specifies pool.
  4. Remove the pool argument from run, and keep the argument mandatory for others.
  5. Use a functor to generate a module for each pool, but with a global top-level run to handle all the effects. This is the functorized version of 4. This shares some similarities with 1 but can save lots of top-level run.

@gasche
Copy link
Contributor

gasche commented Oct 20, 2022

I propose an API sketch in #92. It's not magically solving problems, but I tried to separate the various concerns, and I believe that it would make things easier to think about.

Note: it's interesting to wonder about whether it is possible, in the middle of an Task computation, to run certain tasks on a dedicated pool. (This is the Tezos use-case of using separate pools for IO vs. crypto, when the two can be found in the same computation.) I'm not sure it's worth bending over to support this use-case, but I realized after the fact that my API may support it by taking the not-so-intuitive step of nesting several Task handlers.

@kayceesrk
Copy link
Contributor

As mentioned above by others, I think it's conceptually easier to let the waiting domain idle (at least after ocaml/ocaml#11589) at some cost of efficiency, treating the pool as some cloud service.

This is sub-optimal and I am worried about this choice. Remember that idle domains don't sit idle and must participate in GC. So if I have 4 cores, and I create a pool with 4 domains with the initial domain sitting idle, during a GC, all 5 domains are expected to work on the GC. Given that the minor GC is a stop-the-world parallel GC, the cores will be overcommitted with 5 domains on 4 cores. This would lead to slower minor GC. OTOH, if you only create the pool with 3 domains, while the cores are not overcommitted during a GC, then will be under-committed during actual work.

@gasche
Copy link
Contributor

gasche commented Oct 21, 2022

This is somewhat of a tangent, but do you have a rough idea of the performance cost of this "over-committing during the minor GC" issue that you described? It sounds like something interesting to keep in mind when writing multi-domain OCaml programs, specific to the current runtime choice. What are the ballpark figures to keep in mind? (In general, can we quantify the cost of having more domains than cores for typical workloads?)

(Should I open an issue somewhere else than here to track this question?)

Note that I would expect the issue to matter only for pooled computations that are relatively shorter than the time between two minor GCs. If the pool computation takes several minor GC cycles, then the forced-idle calling domain will have not allocated anything on the GC cycles after the first one, so its minor GC should return immediately and overcommitting should not have a cost, right? (Or maybe the domain would do "idle work" and doing nothing, and this is currently not overcomitting-friendly either?)

@kayceesrk
Copy link
Contributor

kayceesrk commented Oct 21, 2022

When all the cores actively perform “work”, we’ve observed precipitous drop in performance when the cores are overcommitted. With an idle domain, i don’t have a good sense. We may try to run some of the parallel benchmarks from Sandmark with an extra idle domain to quantify the costs. My expectation is that the hit will be not be insignificant. The idle domain at least needs to scan its roots.

Relatedly, I would also like any new proposed API not compromise parallel performance. I would like domainslib to be fast even if the API is hard to use.

@gasche
Copy link
Contributor

gasche commented Oct 21, 2022

The idle domain at least needs to scan its roots.

We could shortcut this part whenever the minor heap is completely empty. (But there is still other shared work to do, on the remembered set etc.; I don't know have a sense of whether root-scanning is really the dominant cost for compute-intensive multi-domain programs.)

I don't regret asking you this here, as I realize that I don't have any intuition for these aspects of the multicore runtime performance characteristics. My own intuition would rather have been that overcommitting is generally fine, unless your workflow and hardware makes you care a lot about memory/core affinity (say on manycore systems with a complex memory hierarchy).

Will someone from the Multicore team write a blog/Discuss post to give ballpark figures on this question by the time of the 5.0 release ? :-) :-)

@haesbaert
Copy link

The idle domain at least needs to scan its roots.

We could shortcut this part whenever the minor heap is completely empty. (But there is still other shared work to do, on the remembered set etc.; I don't know have a sense of whether root-scanning is really the dominant cost for compute-intensive multi-domain programs.)

I don't regret asking you this here, as I realize that I don't have any intuition for these aspects of the multicore runtime performance characteristics. My own intuition would rather have been that overcommitting is generally fine, unless your workflow and hardware makes you care a lot about memory/core affinity (say on manycore systems with a complex memory hierarchy).

I just came to say I'm glad you asked, I had the same bias/intuition and reading KC response I realize I'm very likely wrong.

Will someone from the Multicore team write a blog/Discuss post to give ballpark figures on this question by the time of the 5.0 release ? :-) :-)

@Sudha247
Copy link
Contributor

I've also observed the slowdown introduced by idle domains, presumably due to the shared GC work, as noted by @kayceesrk. I tried to quantify it by benchmarking with idle domains present during the entire course of the program, and with the idle domains doing no mutator work, as would be the case were the waiting domain be kept idle. I picked three examples for this, game_of_life - low GC activity, minilight - moderate GC activity, and binarytrees - high GC activity. The results are below:

Game of Life

GC Activity

No Idle Domain:

allocated_words: 2113083
minor_words: 11800
promoted_words: 592
major_words: 2101875
minor_collections: 6
major_collections: 4
forced_major_collections: 0
heap_words: 2179078
top_heap_words: 2162694
mean_space_overhead: 6477.178613

One idle Domain:

allocated_words: 2113093
minor_words: 11800
promoted_words: 588
major_words: 2101881
minor_collections: 6
major_collections: 4
forced_major_collections: 0
heap_words: 2174982
top_heap_words: 2162694
mean_space_overhead: 6477.178613

Two idle Domains:

allocated_words: 2113384
minor_words: 12096
promoted_words: 782
major_words: 2102070
minor_collections: 9
major_collections: 4
forced_major_collections: 0
heap_words: 2195462
top_heap_words: 2162694
mean_space_overhead: 6477.178613

Running Time

Cores No Idle Domain One Idle Domain Two Idle Domains
1 10.128s 10.130s 10.137s
2 5.118s 5.116s 5.114s
12 0.950s 0.948s 0.949s
24 0.536s 0.537s 0.538s

Minilight

GC Activity

No Idle Domain:

allocated_words: 3103998375
minor_words: 3103932991
promoted_words: 3924479
major_words: 3989863
minor_collections: 11860
major_collections: 18
forced_major_collections: 0
heap_words: 1478706
top_heap_words: 1491258
mean_space_overhead: 42.938824

One Idle Domain:

allocated_words: 3102822177
minor_words: 3102756823
promoted_words: 3925971
major_words: 3991325
minor_collections: 23737
major_collections: 30
forced_major_collections: 0
heap_words: 1171506
top_heap_words: 774458
mean_space_overhead: 33.120843

Two Idle Domains:

allocated_words: 3094857010
minor_words: 3094791635
promoted_words: 3920534
major_words: 3985909
minor_collections: 35526
major_collections: 34
forced_major_collections: 0
heap_words: 1085490
top_heap_words: 589874
mean_space_overhead: 28.858118

Running Time

Cores No Idle Domain One Idle Domain Two Idle Domains
1 12.01s 5m20.24s 15m56.52s
2 6.3s 4m14.51s 8m54.50s
12 1.35s 20.38s 30.82s
24 0.85s 7.5s 11.84s

Binary Trees

GC Activity

No Idle Domain:

allocated_words: 2034239109
minor_words: 2034239043
promoted_words: 757778323
major_words: 757778389
minor_collections: 7820
major_collections: 60
forced_major_collections: 0
heap_words: 52527104
top_heap_words: 64036864
mean_space_overhead: 48.821538

One Idle Domain:

allocated_words: 2034239514
minor_words: 2034239448
promoted_words: 666544045
major_words: 666544111
minor_collections: 15445
major_collections: 62
forced_major_collections: 0
heap_words: 49963008
top_heap_words: 61202432
mean_space_overhead: 0.230622

Two Idle Domains:

allocated_words: 2034239784
minor_words: 2034239744
promoted_words: 670593951
major_words: 670593991
minor_collections: 23095
major_collections: 67
forced_major_collections: 0
heap_words: 43044864
top_heap_words: 57495552
mean_space_overhead: 0.293219

Running Time

Cores No Idle Domain One Idle Domain Two Idle Domains
1 27.712s 1m39.861s 12m42.918s
2 15.579s 2m33.516s 3m34.283s
12 6.217s 13.192s 17.361s
24 4.584s 8.256s 8.877s

Idle domains are introducing more overheads in programs with high GC activity, it's practically neglible when the allocation is low, as observed in game_of_life. I don't have a complete sense of what's causing the slowdowns, I'll dig further into this. I'm guessing applications using domainslib lean towards the former in their behaviour than the latter. I second KC's take that it'd be good to not compromise on performance, and ideally it'd be nice to not have domainslib users think about runtime activities.

We could shortcut this part whenever the minor heap is completely empty.

This can probably work. It will be interesting to see results with this on if someone decided to hack on it. I recall @ctk21 also suggested something similar when we were discussing the idle-domain-introduced GC overheads. But, would the waiting domain's minor heap be entirely empty in this case? IIUC, it may not be the case when something else is going on before the waiting domain reached domainslib operations.

@gasche
Copy link
Contributor

gasche commented Oct 25, 2022

Thanks! Be careful in interpreting the GC statistics above that I think the GC-statistics code is currently buggy. In fact many of the stats you show contain the bug that top_heap_word is smaller than heap_words, which I found yesterday (while... trying my own experiment on the side to understand the influence of domain counts) and reported as ocaml/ocaml#11663 . So let's take those statistics with a grain of salt and focus on time and externally-measured memory measures for now.

When you say "idle domains", what does the idle domain actually do? (Is it blocked on a lock, or looping with cpu_relax, or something else?) Do you have your test code somewhere? Also, what OS are you using to run these experiments?

(I'll upload what I was playing with shortly, in case it can helps. My idle domains were blocked on a lock, and I didn't observe a notable slowdown. On the other hand I measured an around 2x slowdown when moving from recommended_domain_count from recommended_domain_count + 1 active domains.)

@gasche
Copy link
Contributor

gasche commented Oct 25, 2022

I pushed my little benchmark at https://gitlab.com/gasche/idle-domains-bench

On my Linux machine with "4 cores but 8 threads" (recommended_domain_count is 8), with Firefox and what not running at the same time, the results look as follows:

w01: [i00 2.26s (x1.00)] [i01 2.30s (x1.02)] [i02 2.51s (x1.12)] [i05 2.39s (x1.06)] [i10 2.61s (x1.16)]
w02: [i00 2.70s (x1.20)] [i01 2.58s (x1.15)] [i02 2.77s (x1.23)] [i05 2.79s (x1.24)] [i10 3.08s (x1.37)]
w03: [i00 3.08s (x1.37)] [i01 3.02s (x1.34)] [i02 3.16s (x1.40)] [i05 3.67s (x1.63)] [i10 4.41s (x1.96)]
w04: [i00 4.17s (x1.85)] [i01 4.10s (x1.82)] [i02 3.98s (x1.77)] [i05 4.54s (x2.02)] [i10 4.88s (x2.17)]
w05: [i00 4.74s (x2.11)] [i01 4.94s (x2.20)] [i02 4.95s (x2.20)] [i05 5.13s (x2.28)] [i10 5.54s (x2.46)]
w06: [i00 5.60s (x2.49)] [i01 6.64s (x2.95)] [i02 5.95s (x2.64)] [i05 6.08s (x2.70)] [i10 6.66s (x2.96)]
w07: [i00 6.05s (x2.69)] [i01 6.40s (x2.84)] [i02 6.82s (x3.03)] [i05 7.40s (x3.29)] [i10 12.75s (x5.67)]
w08: [i00 8.12s (x1.80)] [i01 9.31s (x2.07)] [i02 9.72s (x2.16)] [i05 11.64s (x2.59)] [i10 13.87s (x3.08)]
w09: [i00 26.88s (x5.97)] [i01 29.07s (x6.46)] [i02 27.96s (x6.21)] [i05 31.32s (x6.96)] [i10 33.25s (x7.39)]
w10: [i00 37.62s (x8.36)] [i01 37.84s (x8.41)] [i02 38.16s (x8.48)] [i05 41.08s (x9.13)] [i10 45.01s (x10.00)]
w11: [i00 45.36s (x10.08)] [i01 46.39s (x10.31)] [i02 46.96s (x10.44)] [i05 48.82s (x10.85)] [i10 52.46s (x11.66)]

each line represents a series of run with the same number of worker domains doing the same computational work (the Knuth-Bendix benchmark of the compiler testsuite). For example the w05: line tells us about the runs with 5 worker domains. The runs on the same line correspond to different numbers of "idle domains" -- trying to take a lock that will not get released before all worker domains are finished. i00 has no idle domain, i01 has one, then 2, 5, 10.

The "x1.37" number reported corresponds to the ratio between the actual runtime of the run and the "expected time" for the run. The "expected time" is the same time as the 1-worker 0-idle run (the "reference run") for all tests up to 8 worker domains, and twice that time for all runs with strictly more than 8 worker domains (because at least one hardware thread needs to run two domains to completion).

What I see on the benchmarks is a stark performance cliff when we move from 8 worker domains (my recommended-domain-coutn) to 9 worker domains, as predicted. (From 8s to 26s.) On the other hand, idle domains appear to have very little cost, even having 10 idle domains is barely noticeable.

My guess is that the Linux thread scheduler does an excellent job with idle domains (really their backup thread), scheduling them often enough that they do not make the STW barrier much longer.

I don't know why the results are so different from @Sudha247's; it could be a mistake in the benchmarking code, or a difference due to the workload of the worker domains, or a difference in the schedulers of our operating systems.

@gasche
Copy link
Contributor

gasche commented Oct 25, 2022

But, would the waiting domain's minor heap be entirely empty in this case? IIUC, it may not be the case when something else is going on before the waiting domain reached domainslib operations.

My reasoning is that if the minor heap is not too large and minor collection runs often enough (which is not necessarily how your OCaml application is configured), and the task itself takes a long time to compute, then a blocked domain would see one non-empty minor collection and then many empty minor collections, so it would benefit from an empty-minor-heap optimization.

@kayceesrk
Copy link
Contributor

Thanks for the numbers. They are interesting to read.

@Sudha247 did you run them on the "tuned" machines? If so, these machines change the scheduling policy in order to schedule the programs on isolated cores. This may be the reason for the increased costs. We use chrt -r 1 IIRC. This number looks suspicious

Cores | No Idle Domain | One Idle Domain | Two Idle Domains
-- | -- | -- | --
1 | 27.712s | 1m39.861s | 12m42.918s

I assume you are running with the rest of the isolated cores (24 in total) available for the idle domain. Hence, it is odd that 1 core program with 1 idle domain would run so much slower.

@gasche the numbers look good. We haven't clearly measured or optimised for idle domains while working on the multicore GC. Also, Simon Marlow mentioned at ICFP that they had to put in a bunch of work in GHC to address the problem that the program may have more HECs (GHC's equivalent of domains) than available cores. This can happen on, as you have mentioned, Firefox running alongside the programs. There are probably some lessons that we can borrow from GHC on this.


With idle domains, what I am uneasy about is that we've recommended to our users that the number of domains spawned must be equal to the number of available cores. This recommendation has influenced how we think about the runtime system design as well as the design of domainslib. I don't have a specific example in mind here, but I'm sort of thinking that this assumption will have influenced the design. We're proposing to break this recommendation by having idle domains. I'm happy to be proved wrong that there is nothing to worry about having idle domains. But it feels odd that we're choosing to take a (possibly only small) runtime overhead since the API is better with idle domains.

@gasche
Copy link
Contributor

gasche commented Oct 25, 2022

Just to be clear, I don't want to insist on having idle domains, and I think that your idea of having the API that maximizes performance, possibly at the cost of some complication, is sound. (I also wouldn't assume that other OSes have schedulers that are also good with idle domains; the assumption that they don't cost too much may not be portable.) But I was curious to quantify the cost to understand what is going on -- and it's very interesting to see two different benchmarks showing such different results.

@kayceesrk
Copy link
Contributor

We should confirm that @Sudha247's results are sound.

With our benchmarking machines, we're doing what one may expect a parallel scientific job run may do. Isolate cores, allocate 1 domain per isolated core and use round-robin scheduling. Under the assumption that the cores are not overcommitted, the scheduler does not have to do any work. This method was able to get 5-20% more performance than running on non-isolated cores on an otherwise idle machine. The performance also remains robust.

In the past experiments, I've struggled with getting robust results with overcommitted and non-isolated cores with some domains doing a little work (they're not idle). I should also say that this experiment was not done systematically, and hence, if there is interest, we should do this properly.

@gasche
Copy link
Contributor

gasche commented Oct 26, 2022

I am thinking that I could post my benchmark sources on Discuss and encourage interested people to run their own experiments. (We would get 50 results of 50 shoddy experiments, instead of one well-designed experiment, but it's better than my one shoddy experiment at least :-) What do you think?

I've struggled with getting robust results with overcommitted and non-isolated cores with some domains doing a little work (they're not idle).

I'm interested in understanding how to write code that does "a little work", it's not obvious. I have tried replacing my waiting-on-a-lock idle domains with looping-on-cpu-relax idle domains, and the results are extremely similar. I think that what you have in mind is closer to programs that don't max out the CPU (they spend enough time blocked in syscalls), and don't allocate too heavily, but still do a bit of both. I might try to write something like that.

@gasche
Copy link
Contributor

gasche commented Oct 26, 2022

Here are results of my benchmarks with an implementation of "idle domains" that still does a bit of work;

w01: [i00 2.41s (x1.04)] [i01 2.79s (x1.20)] [i02 2.81s (x1.21)] [i05 3.32s (x1.43)] [i10 4.03s (x1.74)]
w02: [i00 4.11s (x1.77)] [i01 5.15s (x2.22)] [i02 4.63s (x2.00)] [i05 4.28s (x1.84)] [i10 4.42s (x1.91)]
w03: [i00 3.33s (x1.44)] [i01 3.52s (x1.52)] [i02 4.00s (x1.72)] [i05 5.07s (x2.19)] [i10 5.31s (x2.29)]
w04: [i00 4.17s (x1.80)] [i01 4.93s (x2.12)] [i02 4.55s (x1.96)] [i05 4.92s (x2.12)] [i10 5.92s (x2.55)]
w05: [i00 5.08s (x2.19)] [i01 5.24s (x2.26)] [i02 5.29s (x2.28)] [i05 5.54s (x2.39)] [i10 6.27s (x2.70)]
w06: [i00 5.59s (x2.41)] [i01 5.50s (x2.37)] [i02 5.68s (x2.45)] [i05 6.13s (x2.64)] [i10 7.00s (x3.02)]
w07: [i00 6.94s (x2.99)] [i01 7.82s (x3.37)] [i02 7.61s (x3.28)] [i05 7.54s (x3.25)] [i10 9.40s (x4.05)]
w08: [i00 8.33s (x1.80)] [i01 10.15s (x2.19)] [i02 9.32s (x2.01)] [i05 12.01s (x2.59)] [i10 15.93s (x3.43)]
w09: [i00 28.26s (x6.09)] [i01 27.38s (x5.90)] [i02 26.55s (x5.72)] [i05 26.51s (x5.71)] [i10 28.35s (x6.11)]
w10: [i00 37.87s (x8.16)] [i01 40.20s (x8.66)] [i02 38.57s (x8.31)] [i05 38.73s (x8.35)] [i10 37.36s (x8.05)]
w11: [i00 44.52s (x9.59)] [i01 45.70s (x9.85)] [i02 43.87s (x9.45)] [i05 43.62s (x9.40)] [i10 45.40s (x9.78)]

Idle domain code:

      while Atomic.get idle_run do
        List.init 10 (fun _ -> ())
        |> Sys.opaque_identity
        |> ignore;
        Unix.sleepf 1E-5
      done

The results are worse than with waiting-on-a-lock idle domains -- we typically see around 20% slowdown caused by the presence of a mostly-idle domain -- but still not that bad. In particular having more domains than hardware threads is much less costly when some of those domains are idle than when they are heavy-work domains.

@kayceesrk
Copy link
Contributor

I am thinking that I could post my benchmark sources on Discuss and encourage interested people to run their own experiments. (We would get 50 results of 50 shoddy experiments, instead of one well-designed experiment, but it's better than my one shoddy experiment at least :-) What do you think?

It may be better to do rigorous benchmarking in a controlled setting than to let the community run their own experiments. At the end of the day, one has to collect the results and interpret them. This would be much easier if the experiments were run in a controlled setting.

Taking a step back, I believe that moving both the API forwards and doing performance evaluation in parallel may not be ideal. In the other issue (#92), there are several useful improvements over the current API without impacting the performance. I would be interested to get those merged first before getting too deep into the performance results. Moreover, there are known issues with GC scheduling (ocaml/ocaml#11589) which will affect any of the experiments we do with idle domains. I would prefer to wait until at least 11589 is addressed before we spend time on the experiments.

@Sudha247
Copy link
Contributor

I assume you are running with the rest of the isolated cores (24 in total) available for the idle domain. Hence, it is odd that 1 core program with 1 idle domain would run so much slower.

IIUC the question was about was when we were spawning more domains that Domain.recommended_domain_count, so I ended up creating more domains than cores the benchmark was allowed to use. I don't have the numbers for running them on all available CPUs, but I expect it to be slightly faster with idle domains, in that case.

I think the GC-statistics code is currently buggy.

Right.. It was just to illustrate the difference in allocation/GC activity. I've run into some of them myself - not sure if it's the expected behaviour, but I think every domain used to increment minor GC count separately (this might be fixed now, or the expected behaviour, I haven't checked yet).

When you say "idle domains", what does the idle domain actually do? (Is it blocked on a lock, or looping with cpu_relax, or something else?) Do you have your test code somewhere? Also, what OS are you using to run these experiments?

Okay, I'll admit I did the easiest possible thing and created a new pool to act as extra domain(s). It waits on a condition variable to receive tasks, and I believe doesn't do much work itself (it only sets up a worker). This was run on an Intel Xeon Gold machine with 28 Cores, on Ubuntu 20.04 (5.4.0-125-generic kernel).


Just wanted to clarify some things for reference, overall I agree with KC that spending time on this makes sense after GC issues are addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants