Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread scalability is suboptimal #38

Open
ogrisel opened this issue Nov 6, 2018 · 13 comments
Open

Thread scalability is suboptimal #38

ogrisel opened this issue Nov 6, 2018 · 13 comments
Labels
perf Computational performance issue or improvement

Comments

@ogrisel
Copy link
Owner

ogrisel commented Nov 6, 2018

As reported in #30 (comment) , the scalability of pygbm is not as good as LightGBM.

Here are some results on a machine with the following CPUs:

Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz: 2 sockets each with 12 cores each which means 48 hyperthreads in total.

1 thread (sequential)

NUMBA_NUM_THREADS=1 OMP_NUM_THREADS=1 python benchmarks/bench_higgs_boson.py  --n-trees 100 --learning-rate 0.1 --n-leaf-nodes 255
Model Time AUC Speed up
LightGBM 1045s 0.8282 1x
pygbm 1129s 0.8192 1x

8 threads

NUMBA_NUM_THREADS=8 OMP_NUM_THREADS=8 python benchmarks/bench_higgs_boson.py  --n-trees 100 --learning-rate 0.1 --n-leaf-nodes 255
Model Time AUC Speed up
LightGBM 160s 0.8282 6.53x
pygbm 356s 0.8193 3.2x

48 (hyper)threads

python benchmarks/bench_higgs_boson.py  --n-trees 100 --learning-rate 0.1 --n-leaf-nodes 255
Model Time AUC Speed up
LightGBM 91s 0.8282 11.5x
pygbm 130s 0.8193 8.7x

All of those pygbm runs used numba 0.40 from anaconda using the tbb backend (which is the fastest for pygbm).

@ogrisel ogrisel added the perf Computational performance issue or improvement label Nov 6, 2018
@ogrisel
Copy link
Owner Author

ogrisel commented Dec 10, 2018

Recent changes in master do parallelize more things and scalability is not as bad as reported anymore. pygbm tend to stay close to 1.5 of the duration of lightgbm at worst.

@stuartarchibald
Copy link

@ogrisel would it be worth taking a look with the new parallel diagnostics output http://numba.pydata.org/numba-doc/latest/user/parallel.html#diagnostics to check what is/isn't parallelized?

@ogrisel
Copy link
Owner Author

ogrisel commented Dec 13, 2018

Thanks for the feedback @stuartarchibald. This code does many calls to several jitted functions. Is there a way to get all the diagnostic reports for all the functions jitted by numba at the end of the benchmark script?

@stuartarchibald
Copy link

Would setting the NUMBA_PARALLEL_DIAGNOSTICS environment variable work for that purpose?

@ogrisel
Copy link
Owner Author

ogrisel commented Dec 13, 2018

Thanks, this is exactly what I was looking for. Sorry for not reading the doc carefully enough.

@ogrisel
Copy link
Owner Author

ogrisel commented Dec 13, 2018

Actually what we really need is to do 2 runs under a profiler, one with NUMBA_NUM_THREADS=1 and one with NUMBA_NUM_THREADS=8 (for instance), and then for each numba function in the critical path, compute the speed up ratio and spot the functions that least benefit from parallel=True in term of speed up and then look at the detailed parallel diagnostic for those.

It's also possible that we have a function in the critical path that is not parallelized at all for some reason.

@stuartarchibald
Copy link

I think this is related numba/numba#3438, as setting the thread count to one is not the same as just switching parallelism off (parallel transformations and scheduling still take place). There are potentially cases where adding more than one thread causes the code to slow down (parallel kernels with negligible per-thread work, but all the overhead of scheduling), and further kernels which cost more to schedule and execute on a thread than to just use the executing thread to run them.

@ogrisel
Copy link
Owner Author

ogrisel commented Dec 13, 2018

I did a quick bench on the current master on a machine with many cores (without profiling for now):

NUMBA_NUM_THREADS=1 OMP_NUM_THREADS=1 python benchmarks/bench_higgs_boson.py --n-trees 100 --n-leaf-nodes 255 --learning-rate=0.5
Model Time AUC Speed up
LightGBM 431s 0.7519 1x
pygbm 460s 0.7522 1x
NUMBA_NUM_THREADS=8 OMP_NUM_THREADS=8 python benchmarks/bench_higgs_boson.py --n-trees 100 --n-leaf-nodes 255 --learning-rate=0.5
Model Time AUC Speed up
LightGBM 83s 0.7519 5.2x
pygbm 146s 0.7536 2.9x

Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz with 2x10 physical cores.

This is not the same machine as previously but the scalability is still sub-optimal, so doing the profiling effort is required to understand where are the scalability bottlenecks.

@Laurae2
Copy link
Contributor

Laurae2 commented Dec 15, 2018

@ogrisel You can apply for a free Intel VTune license for profiling your code if you do research.

It will be much better than the numba profiler.

@stuartarchibald
Copy link

@Laurae2 I'm not sure what the "numba profiler" is, please could you clarify?! Numba has a built in parallel diagnostics tool which tracks transforms made to it's own IR of the Python source as it converts serial code to parallel code, but that's a compile-time diagnostic tool not a performance profiler.

Further, Numba 0.41.0 JIT profiling works with Intel Vtune, set the NUMBA_ENABLE_PROFILING environment variable to non-zero and that will register the LLVM JIT Event listener for Intel VTune.

@Laurae2
Copy link
Contributor

Laurae2 commented Dec 31, 2018

@stuartarchibald You can use the numba profiler here: https://github.com/numba/data_profiler (it just adds the signatures in reality). Incurs overhead penalty.

Still better to use Intel VTune for real profiling though (way more details and easier to pinpoint the issues).

@stuartarchibald
Copy link

@Laurae2 Ah, so that's what you are referring to, thanks. Yes, indeed, they have different purposes...

@Laurae2
Copy link
Contributor

Laurae2 commented Jan 3, 2019

@ogrisel Note that LightGBM number of threads scale with the number of columns. Higgs dataset does not have enough columns for 48 threads (it will underestimate the scalability which gives you a lower scaling target).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
perf Computational performance issue or improvement
Projects
None yet
Development

No branches or pull requests

3 participants