Compute time doesn't seem to scale well with increasing number of threads past a certain point #92

vlandau · 2021-03-09T20:22:59Z

I have noticed scaling issues with Omniscape's multithreading, where once the number of threads gets to be high enough, compute time actually starts to increase. The problem I was using is quite large, and I'm running it on an expensive VM, so below, instead of recording the actually compute times, I'm showing the projected compute time from ProgressMeter.jl after letting the job run for a while until the ETA stabilizes. These benchmarks were run on an Azure VM with 64 logical cores (and 32 physical cores) and 256GB RAM:

63 threads: ~5hr 15m
32 threads: ~3hr 45m

☝🏻 This mostly makes sense, as only using physical cores could make for more efficient use of the processors.

It gets a bit stranger when switching to use a 32 logical core VM (16 physical cores), with 128GB RAM. Both VMs use Intel Xeon processors, so there shouldn't be any difference in single-thread processor speed. I would expect using 63 logical cores would be faster than using 31, and I'd also expect, base on the above, that on a machine with 32 logical cores and 16 physical cores, using 16 threads would similarly outperform using 31 threads. Indeed, that is not the case. Using 16 threads ran about as fast as 31 threads, not faster. The 16 threads job is also not much slower than the 32 threads job above.

31 threads: ~4hr 45m
16 threads: ~4hr 35m

This Omniscape run used a moving window size of 668, so about 1.4M pixels per Circuitscape solve, this means that Circuitscape solve time is >>> overhead from parallel processing.

I'm hoping there may be ways to make Omniscape scale more favorably with increasing number of threads. Things like continental-scale analyses may not be possible at this time given these numbers. The best solution may involve hierarchical parallel processing, but maybe there are some simpler steps that could be taken to improve scaling.

cc @ViralBShah @ranjanan

ViralBShah · 2021-03-09T20:39:27Z

There’s only 16 physical cores, right? So using more than 16 threads will not help for compute bound workloads. That sort of thing helps for I/O bound workloads, where the threads are not all busy at the same time.

vlandau · 2021-06-14T22:33:27Z

This was on a 16 core/32 thread machine

vlandau · 2021-06-14T22:35:44Z

Primarily I would like to see better performance when increasing from 16 threads on a 16 physical core machine to 32 threads on a 32 physical core machine. By doubling the threads (and # of physical cores in total) I was only getting a modest speed up of ~20%

ViralBShah · 2021-06-15T02:21:55Z

Perhaps memory contention when reading from the same memory locations? Or is more GC happening?

vlandau · 2021-06-15T21:04:02Z

Hmm, that's a good thought about memory being read from the same locations... that will certainly be happening sometimes. I do randomize the solves, so that should help in that adjacent points regions won't often be getting solved at the same time (and overlapping portions of the inputs are less likely to be read simultaneously as a result) There's also plenty of GC for all of the inputs for each Circuitscape solve.

jessjaco · 2022-12-27T22:26:22Z

I'm seeing this too. JULIA_NUM_THREADS of 8-12 seems to be the sweet spot for large processes. Larger values (on machines with enough cores, of course) don't appear to give a performance gain and reduce performance in some cases.

jessjaco · 2023-04-23T17:37:37Z

On further real-world testing, JULIA_NUM_THREADS of 6 seems to be quicker than 8 on large (multi-day) processes

vlandau changed the title ~~Compute time doesn't seem scale well increasing number of threads past a certain point~~ Compute time doesn't seem to scale well increasing number of threads past a certain point Mar 9, 2021

vlandau changed the title ~~Compute time doesn't seem to scale well increasing number of threads past a certain point~~ Compute time doesn't seem to scale well with increasing number of threads past a certain point Mar 9, 2021

vlandau added the performance Related to compute and memory efficiency label Jun 14, 2021

oskeng mentioned this issue Jul 3, 2023

computation time increases with more than six workers #131

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compute time doesn't seem to scale well with increasing number of threads past a certain point #92

Compute time doesn't seem to scale well with increasing number of threads past a certain point #92

vlandau commented Mar 9, 2021

ViralBShah commented Mar 9, 2021

vlandau commented Jun 14, 2021 •

edited

vlandau commented Jun 14, 2021 •

edited

ViralBShah commented Jun 15, 2021

vlandau commented Jun 15, 2021

jessjaco commented Dec 27, 2022

jessjaco commented Apr 23, 2023

Compute time doesn't seem to scale well with increasing number of threads past a certain point #92

Compute time doesn't seem to scale well with increasing number of threads past a certain point #92

Comments

vlandau commented Mar 9, 2021

ViralBShah commented Mar 9, 2021

vlandau commented Jun 14, 2021 • edited

vlandau commented Jun 14, 2021 • edited

ViralBShah commented Jun 15, 2021

vlandau commented Jun 15, 2021

jessjaco commented Dec 27, 2022

jessjaco commented Apr 23, 2023

vlandau commented Jun 14, 2021 •

edited

vlandau commented Jun 14, 2021 •

edited