Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make async runtime scale better on SMT machines #850

Open
Minoru opened this issue May 30, 2021 · 15 comments
Open

Make async runtime scale better on SMT machines #850

Minoru opened this issue May 30, 2021 · 15 comments
Labels

Comments

@Minoru
Copy link
Collaborator

Minoru commented May 30, 2021

#844 added a new async runtime, but on SMT machines (eg. Intel's HyperThreading), it doesn't scale too well past the number of cores. Details are in #844 (comment), and there are some ideas further down the thread.

For now, the workaround is to use +RTS -Nx to limit the number of threads to the number of cores.

@fishtreesugar
Copy link

According to the benchmark from scheduler, unliftio's pooledMapConcurrently's quite better than async's mapConcurrently, maybe we could give it a try

@Minoru
Copy link
Collaborator Author

Minoru commented May 31, 2021

Thanks for the pointer! That benchmark focuses on speed, not scalability, so I'm sceptical that it'll make any different. I don't have energy to invest into this right now, but if you do, please try and report back the results!

@vaibhavsagar
Copy link
Contributor

I investigated using pooledMapConcurrently with this patch on top of #863 and it didn't seem to improve the SMT scaling (my laptop uses an Intel i7-8550U with 4 cores and 8 threads):

[nix-shell:~/code/website]$ hyperfine --parameter-scan threads 1 10 --prepare './result/bin/site clean' './result/bin/site build +RTS -N{threads}' 
Benchmark #1: ./result/bin/site build +RTS -N1
  Time (mean ± σ):      3.409 s ±  0.074 s    [User: 3.279 s, System: 0.138 s]
  Range (min … max):    3.295 s …  3.525 s    10 runs
 
Benchmark #2: ./result/bin/site build +RTS -N2
  Time (mean ± σ):      2.196 s ±  0.053 s    [User: 3.799 s, System: 0.362 s]
  Range (min … max):    2.113 s …  2.265 s    10 runs
 
Benchmark #3: ./result/bin/site build +RTS -N3
  Time (mean ± σ):      1.886 s ±  0.060 s    [User: 4.571 s, System: 0.583 s]
  Range (min … max):    1.790 s …  1.963 s    10 runs
 
Benchmark #4: ./result/bin/site build +RTS -N4
  Time (mean ± σ):      1.885 s ±  0.049 s    [User: 5.487 s, System: 0.897 s]
  Range (min … max):    1.833 s …  1.976 s    10 runs
 
Benchmark #5: ./result/bin/site build +RTS -N5
  Time (mean ± σ):      2.164 s ±  0.098 s    [User: 7.585 s, System: 1.639 s]
  Range (min … max):    2.014 s …  2.294 s    10 runs
 
Benchmark #6: ./result/bin/site build +RTS -N6
  Time (mean ± σ):      2.348 s ±  0.096 s    [User: 9.346 s, System: 2.417 s]
  Range (min … max):    2.174 s …  2.506 s    10 runs
 
Benchmark #7: ./result/bin/site build +RTS -N7
  Time (mean ± σ):      2.487 s ±  0.047 s    [User: 11.058 s, System: 3.255 s]
  Range (min … max):    2.414 s …  2.536 s    10 runs
 
Benchmark #8: ./result/bin/site build +RTS -N8
  Time (mean ± σ):      2.746 s ±  0.132 s    [User: 13.610 s, System: 4.138 s]
  Range (min … max):    2.565 s …  3.064 s    10 runs
 
Benchmark #9: ./result/bin/site build +RTS -N9
  Time (mean ± σ):      3.251 s ±  0.138 s    [User: 16.528 s, System: 4.896 s]
  Range (min … max):    3.097 s …  3.506 s    10 runs
 
Benchmark #10: ./result/bin/site build +RTS -N10
  Time (mean ± σ):      3.668 s ±  0.240 s    [User: 19.285 s, System: 5.075 s]
  Range (min … max):    3.385 s …  4.166 s    10 runs
 
Summary
  './result/bin/site build +RTS -N4' ran
    1.00 ± 0.04 times faster than './result/bin/site build +RTS -N3'
    1.15 ± 0.06 times faster than './result/bin/site build +RTS -N5'
    1.17 ± 0.04 times faster than './result/bin/site build +RTS -N2'
    1.25 ± 0.06 times faster than './result/bin/site build +RTS -N6'
    1.32 ± 0.04 times faster than './result/bin/site build +RTS -N7'
    1.46 ± 0.08 times faster than './result/bin/site build +RTS -N8'
    1.73 ± 0.09 times faster than './result/bin/site build +RTS -N9'
    1.81 ± 0.06 times faster than './result/bin/site build +RTS -N1'
    1.95 ± 0.14 times faster than './result/bin/site build +RTS -N10'

@frasertweedale
Copy link
Contributor

I am also experiencing this scaling issue. The increased userland CPU time when using more
capabilities is strange. I would expect to see small (certainly sublinear) increases in CPU time
for additional capabilities. Instead, per @vaibhavsagar's benchmark above, the CPU time
appears to have superlinear growth and quickly overwhelms the advantage gained by parallel
execution.

Profiling didn't reveal anything interesting, the profiles looking overwhelmingly similar for
different numbers of capabilities, apart from total time.

I'm beginning to wonder if this issue might be in the GHC RTS, rather than Hakyll...

@vaibhavsagar
Copy link
Contributor

@frasertweedale I did some investigation with ThreadScope afterwards that wasn't especially insightful, which is why I didn't mention it here, but it did show that some of the overhead was GC-related. When I minimised garbage collection using some of the suggestions here the observed performance did seem to scale better. I'm a relative novice when it comes to parallel Haskell so it's entirely possible that there's something simple that I'm missing.

@frasertweedale
Copy link
Contributor

@vaibhavsagar thanks for the additional info. It is always helpful to mention the dead ends in the investigation. That way, people will know it has been done, and won't waste their time doing the same thing :)

@frasertweedale
Copy link
Contributor

When using multiple capabilities, on GHC 8.8 I get the best results with +RTS -N -qg1, which disables parallel GC for the first generation. On my site this achieves ~70% productivity compared to using compared to ~40% for the default (parallel GC for all generations).

There must be something about Hakyll's design that makes parallel GC particularly inefficient. When actually using multiple capabilities there was an improvement in wall time GCing the second generation, although productivity still decreases considerably. For the first generation, the parallel GC performance is quite terrible.

I'd be interested to see how GHC 8.10+'s --nonmoving-gc RTS option performs, but it cannot be used for the first generation.

I'm suspending my investigation at this point. Single-threaded performance is good enough for me and even with -qg1 I gain little advantage from using multiple capabilities. I've only done these measurements on my Hakyll blog site. YMMV.

@gwern
Copy link
Contributor

gwern commented May 3, 2022

FWIW, I ran into severe performance problems apparently related to these changes when I recently upgraded Hakyll after a while. My writeup: https://groups.google.com/g/hakyll/c/5_evK9wCb7M/m/3oQYlX9PAAAJ

@jaspervdj
Copy link
Owner

I would like to look into this during ZuriHac 2022, I'm not sure if I'll have time before that. My current suspicion is that the combination of an MVar (I'm pretty sure an IORef + strict atomicModifyIORef would be enough) together with the use of big maps in this value are contributing towards this, but I haven't tested anything out yet.

@Minoru
Copy link
Collaborator Author

Minoru commented May 20, 2022

@jaspervdj #903 is much more pressing, if you're in the mood to dig into hard issues :) Sadly I didn't have enough energy to do that, even though I promised. It looks like the Store is not suitable for multithreaded use (which is not entirely surprising), and causes some random problems to us. I've been dragging my feet, but it looks like the async runtime has to be yanked altogether while we're looking for a fix to the Store and investigate Gwern's report.

@jaspervdj
Copy link
Owner

Yeah, I wonder if we should just roll back the concurrent runtime for now given these issues. Is the slight speedup for some sites worth the overhead for others? I’m not sure.

Doing a concurrent runtime still seems doable and worthwhile and I think we can get it with minimal overhead but it just requires a bit more investigation to update or remove some existing abstractions like Store.

@jaspervdj
Copy link
Owner

I have an implementation in https://github.com/jaspervdj/hakyll/tree/async-scheduler which is a bit rough but should generally work and allow us to scale much better. A few things like error handling and checking for cyclic deps still need to be improved though.

@vaibhavsagar
Copy link
Contributor

Does #946 resolve this issue?

@Minoru
Copy link
Collaborator Author

Minoru commented Aug 26, 2023

@vaibhavsagar Not really, see the benchmark results here: #946 (review)

@gwern
Copy link
Contributor

gwern commented May 8, 2024

FWIW, I ran into severe performance problems

Update: I've been using a fork all this time, as mentioned, and so haven't seen any effect of the new scheduler. My Threadripper workstation has died, so I can no longer test high core counts. I've been restarting on a Ubuntu 24 laptop with just 8 virtual cores (4 real, IIRC), and running with 5-7 threads has not shown any major issues with the 4.14.0.0 HEAD (GHC 9.4.7).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants