Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix some issues with dynamic algorithm selection in coll/tuned #8186

Merged
merged 5 commits into from Nov 12, 2020

Conversation

devreal
Copy link
Contributor

@devreal devreal commented Nov 5, 2020

This PR addresses a potential performance issue with the algorithm selection in coll/tuned and some minor issues found while digging into it:

  1. Performance: some bcast and allreduce algorithms require the number of elements to be larger than the number of ranks and will fall back to a linear implementation if that is not the case. This hit me bitterly with 4B bcast on 128 ranks on a single node where latency increased 10x compared to 127 ranks.
  2. The documentation for the coll_tuned_*_algorithm MCA variables should mention that they only take effect if the coll_tuned_use_dynamic_rules variable is set to true.
  3. Mark global static array used for the MCA parameters as const.
  4. Fix some glitches in comments.

The mca parameters coll_tuned_*_algorithm are ignored unless coll_tuned_use_dynamic_rules is true so mention that in the description.

Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
…d fall back to linear

Bcast: scatter_allgather and scatter_allgather_ring expect N_elem >= N_procs
Allreduce: rabenseifner expects N_elem >= pow2 nearest to N_procs

In all cases, the implementations will fall back to a linear implementation,
which will most likely yield the worst performance (noted for 4B bcast on 128 ranks)

Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
Copy link
Contributor

@wckzhang wckzhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine, since most of the tuning data was collected on power of 2 procs, should have considered the non pow-2 fallbacks and done slightly more finer grained tuning. This likely applies to other collectives.

@rajachan
Copy link
Member

Are we going to need this for 4.1.x? It does seem to fix a serious performance regression.

@devreal
Copy link
Contributor Author

devreal commented Nov 10, 2020

Yes, I will backport that to 4.1.x later today.

@rajachan
Copy link
Member

rajachan commented Nov 10, 2020

@devreal are these the only collectives you saw regressions with? I saw a similar issue with Allgatherv in the 4.1.x branch. Will redo my tests this afternoon to verify.

msg_size (B) latency (usecs)
1 434.62
2 431.92
4 10256.13
8 10139.07
16 10212.06
32 10087.31
64 9979.71
128 10036.12
256 10285.58
512 10510.48
1024 10599.45
2048 12227.23
4096 15412.86
8192 29796.99
16384 34672.19
32768 44542.43
65536 66748.59
131072 120687.02
262144 221731.62

…lgather

These selections seem harmful in my measurements and don't seem to be
motivated by previous measurement data.

Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
@devreal
Copy link
Contributor Author

devreal commented Nov 12, 2020

@rajachan There is indeed a problem with allgatherv: I believe decisions were generated based on the output of the OSU benchmark, which reports the number of bytes sent by each process. However, the decision logic uses he number of bytes to be received by each process. I'm working on a quick fix based on my measurements. Unfortunately, the v4.1.x backport of this PR is already merged so I will create new PRs for master and v4.1.x.

@rajachan
Copy link
Member

Great, thanks!

@rajachan rajachan merged commit 30831fb into open-mpi:master Nov 12, 2020
@rajachan
Copy link
Member

I believe decisions were generated based on the output of the OSU benchmark, which reports the number of bytes sent by each process. However, the decision logic uses he number of bytes to be received by each process.

We should address this in the collectives tuning scripts so we don't run into this again the next time we tune the defaults (although it is not clear to me right now how we would account for this). Perhaps an issue against https://github.com/open-mpi/ompi-collectives-tuning/ is in order.

@devreal
Copy link
Contributor Author

devreal commented Nov 12, 2020

@devreal devreal deleted the fix-tuned-dynamic branch October 3, 2022 15:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants