New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix some issues with dynamic algorithm selection in coll/tuned #8186
Conversation
The mca parameters coll_tuned_*_algorithm are ignored unless coll_tuned_use_dynamic_rules is true so mention that in the description. Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
…d fall back to linear Bcast: scatter_allgather and scatter_allgather_ring expect N_elem >= N_procs Allreduce: rabenseifner expects N_elem >= pow2 nearest to N_procs In all cases, the implementations will fall back to a linear implementation, which will most likely yield the worst performance (noted for 4B bcast on 128 ranks) Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine, since most of the tuning data was collected on power of 2 procs, should have considered the non pow-2 fallbacks and done slightly more finer grained tuning. This likely applies to other collectives.
Are we going to need this for 4.1.x? It does seem to fix a serious performance regression. |
Yes, I will backport that to 4.1.x later today. |
@devreal are these the only collectives you saw regressions with? I saw a similar issue with Allgatherv in the 4.1.x branch. Will redo my tests this afternoon to verify.
|
…lgather These selections seem harmful in my measurements and don't seem to be motivated by previous measurement data. Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
@rajachan There is indeed a problem with allgatherv: I believe decisions were generated based on the output of the OSU benchmark, which reports the number of bytes sent by each process. However, the decision logic uses he number of bytes to be received by each process. I'm working on a quick fix based on my measurements. Unfortunately, the v4.1.x backport of this PR is already merged so I will create new PRs for master and v4.1.x. |
Great, thanks! |
We should address this in the collectives tuning scripts so we don't run into this again the next time we tune the defaults (although it is not clear to me right now how we would account for this). Perhaps an issue against https://github.com/open-mpi/ompi-collectives-tuning/ is in order. |
This PR addresses a potential performance issue with the algorithm selection in
coll/tuned
and some minor issues found while digging into it:coll_tuned_*_algorithm
MCA variables should mention that they only take effect if thecoll_tuned_use_dynamic_rules
variable is set to true.const
.