Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compute time doesn't seem to scale well with increasing number of threads past a certain point #92

Open
vlandau opened this issue Mar 9, 2021 · 7 comments
Labels
performance Related to compute and memory efficiency

Comments

@vlandau
Copy link
Member

vlandau commented Mar 9, 2021

I have noticed scaling issues with Omniscape's multithreading, where once the number of threads gets to be high enough, compute time actually starts to increase. The problem I was using is quite large, and I'm running it on an expensive VM, so below, instead of recording the actually compute times, I'm showing the projected compute time from ProgressMeter.jl after letting the job run for a while until the ETA stabilizes. These benchmarks were run on an Azure VM with 64 logical cores (and 32 physical cores) and 256GB RAM:

  • 63 threads: ~5hr 15m
  • 32 threads: ~3hr 45m

☝🏻 This mostly makes sense, as only using physical cores could make for more efficient use of the processors.

It gets a bit stranger when switching to use a 32 logical core VM (16 physical cores), with 128GB RAM. Both VMs use Intel Xeon processors, so there shouldn't be any difference in single-thread processor speed. I would expect using 63 logical cores would be faster than using 31, and I'd also expect, base on the above, that on a machine with 32 logical cores and 16 physical cores, using 16 threads would similarly outperform using 31 threads. Indeed, that is not the case. Using 16 threads ran about as fast as 31 threads, not faster. The 16 threads job is also not much slower than the 32 threads job above.

  • 31 threads: ~4hr 45m
  • 16 threads: ~4hr 35m

This Omniscape run used a moving window size of 668, so about 1.4M pixels per Circuitscape solve, this means that Circuitscape solve time is >>> overhead from parallel processing.

I'm hoping there may be ways to make Omniscape scale more favorably with increasing number of threads. Things like continental-scale analyses may not be possible at this time given these numbers. The best solution may involve hierarchical parallel processing, but maybe there are some simpler steps that could be taken to improve scaling.

cc @ViralBShah @ranjanan

@vlandau vlandau changed the title Compute time doesn't seem scale well increasing number of threads past a certain point Compute time doesn't seem to scale well increasing number of threads past a certain point Mar 9, 2021
@vlandau vlandau changed the title Compute time doesn't seem to scale well increasing number of threads past a certain point Compute time doesn't seem to scale well with increasing number of threads past a certain point Mar 9, 2021
@ViralBShah
Copy link
Member

There’s only 16 physical cores, right? So using more than 16 threads will not help for compute bound workloads. That sort of thing helps for I/O bound workloads, where the threads are not all busy at the same time.

@vlandau vlandau added the performance Related to compute and memory efficiency label Jun 14, 2021
@vlandau
Copy link
Member Author

vlandau commented Jun 14, 2021

This was on a 16 core/32 thread machine

@vlandau
Copy link
Member Author

vlandau commented Jun 14, 2021

Primarily I would like to see better performance when increasing from 16 threads on a 16 physical core machine to 32 threads on a 32 physical core machine. By doubling the threads (and # of physical cores in total) I was only getting a modest speed up of ~20%

@ViralBShah
Copy link
Member

Perhaps memory contention when reading from the same memory locations? Or is more GC happening?

@vlandau
Copy link
Member Author

vlandau commented Jun 15, 2021

Hmm, that's a good thought about memory being read from the same locations... that will certainly be happening sometimes. I do randomize the solves, so that should help in that adjacent points regions won't often be getting solved at the same time (and overlapping portions of the inputs are less likely to be read simultaneously as a result) There's also plenty of GC for all of the inputs for each Circuitscape solve.

@jessjaco
Copy link
Contributor

I'm seeing this too. JULIA_NUM_THREADS of 8-12 seems to be the sweet spot for large processes. Larger values (on machines with enough cores, of course) don't appear to give a performance gain and reduce performance in some cases.

@jessjaco
Copy link
Contributor

On further real-world testing, JULIA_NUM_THREADS of 6 seems to be quicker than 8 on large (multi-day) processes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Related to compute and memory efficiency
Projects
None yet
Development

No branches or pull requests

3 participants