Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coll_tuned_dynamic_rules_filename option #8157

Open
miharulidze opened this issue Oct 30, 2020 · 21 comments
Open

coll_tuned_dynamic_rules_filename option #8157

miharulidze opened this issue Oct 30, 2020 · 21 comments

Comments

@miharulidze
Copy link

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

v4.0.3

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Distribution tarball from open-mpi.org.
./configure --prefix=$(pwd)/build --with-ucx=/path-to-ucx-installation/ --enable-orterun-prefix-by-default

Please describe the system on which you are running

  • Operating system/version: CentOS 7.2
  • Computer hardware: x86
  • Network type: Mellanox CX-4 HCAs

Details of the problem

Dear OpenMPI developers,

I'm trying to provide tuned selection of collective algorithms for tuned component.
It seems like there are two ways:

  1. through command line options like --mca coll_tuned_use_dynamic_rules 1 --mca coll_tuned_<COLL NAME>_algorithm <ALGORITHM ID>. This method works fine and I notice the big difference between algorithms while running OSU benchmarks. At the same time, this method not allows me to do fine-grained tuning, like specifying a communicator size, message thresholds, etc.
  2. Specify algorithm selection policy using file with the following options --mca coll_tuned_use_dynamic_rules 1 --mca coll_tuned_dynamic_rules_filename <PATH TO ALGORITHM RULES>. This paper (actually, this the only example of rules file I found in Google) shows an example of Alltoall algorithm selection tuning for different use cases. I also done some experiments with Alltoall, but it showed no difference between several algorithms/thresholds at all.

Here is my questions:

  1. Is this feature still supported in tuned component?
  2. Are there some restrictions to use it, for example it works only with subset of collectives that are implemented in tuned component?
  3. Are there any examples of rules file for other operations?

I'll be grateful for any help.
Thank you in advance!

@bosilca
Copy link
Member

bosilca commented Oct 30, 2020

Yea, both these features are fully supported. Let me first talk about the second one, the configuration file passed through the coll_tuned_dynamic_rules_filename MCA parameter. The format is described in the paper you pointed, but there are many examples in our mailing list. It supports all collectives provided by tuned, for as long as tuned is the module selected for a particular collective (you can enforce this by setting tuned priority to 100). For the sake of completeness here is another example that fiddle with the MPI_Alltoall and MPI_Allreduce collectives:

2 # num of collectives
3 # ID = 3 Alltoall collective (ID in coll_tuned.h)
  # 0: ignore, 1: basic linear, 2: pairwise, 3: modified bruck,
  # 4: linear with sync, 5:two proc only
1 # number of com sizes
64 # comm size 64
2 # number of msg sizes
1024 3 0 0 # for message size 0, bruck 1, topo 0, 0 segmentation
8192 2 0 0 # 8k+, pairwise 2, no topo or segmentation
# end of first collective

2  # ID = 2 Allreduce collective (ID in coll_tuned.h)
1  # number of com sizes
8  # comm sizes 8
2  # number of msg sizes
0 1 0 0  # for message size 0, basic linear 1, topo 0, 0 segmentation
1024 2 0 0 # for messages size > 1024, nonoverlapping 2, topo 0, 0 segmentation
# end of collective rule

The first approach, via coll_tuned_<COLL NAME>_algorithm might have less flexibility to precisely select collective for ranges of processes or message size, but can be dynamically changed during the execution by setting the corresponding MCA parameter right before creating a new communicator. As a result you can finely tailor the collective algorithm behind each collective on the communicators that matter most for your application.

@miharulidze
Copy link
Author

@bosilca Thank you very much for fast reply! Your examples are very useful.

As far as I understand, collective operation IDs are not defined in coll_tuned.h explicitly, but they match the sequence number as they defined in this file, e.g. Allgather has ID 0, Barries has ID 5, etc.

Also, in your example with Alltoall operation:

1 # number of com sizes
64 # comm size 8

Is it a mistake? comment should look like # comm size 8, or I missing something?
The same thing with Allreduce example:

1  # number of com sizes
1  # comm size 2

@bosilca
Copy link
Member

bosilca commented Oct 30, 2020

You're right we moved the collective id in base/coll_base_functions.h.

For the comments they might indeed not be very accurate, I play with the files and missed to update the comments. I'll fix them in my answer.

@mkurnosov
Copy link
Contributor

@miharulidze all algorithms must specify a rule for message size of zero (https://github.com/open-mpi/ompi/blob/master/ompi/mca/coll/tuned/coll_tuned_dynamic_file.c#L200). Otherwise, coll/tuned will switch to fixed rules.

@miharulidze
Copy link
Author

@bosilca , @mkurnosov Thank you for support!

Maybe it's a good idea to add some sort of generic template for such rules file to documentation?
Something like this:

#######################################################################################
############################## RULES FILE TEMPLATE ####################################
#######################################################################################

COLLS_NUM             # num of collectives for which rules are specified

#######################################################################################
############################## FIRST COLLECTIVE RULES #################################
#######################################################################################

# Start first collective rules

COLL_ID_1               # First collective operation ID
COMMS_NUM               # number of communicator sizes for which you want to define rules

############################## FIRST COMMUNICATOR #####################################

COMM_SIZE_1             # Size of first communicator
MSG_SIZES_NUM           # How many threshold do you want to scpecify for COMM_SIZE1?
                        # Should be at least 1 (for msg_size >= 0),
                        # otherwise rules file will be ignored

# Thresholds for COMM_SIZE_1

0 ALG_NUM TOPO SEGM     # Use ALG_NUM for msg_size >= 0
M ALG_NUM TOPO SEGM     # Use ALG_NUM for msg_size >= M
N ALG_NUM TOPO SEGM     # Use ALG_NUM for msg_size >= N

# End of first communicator

############################## SECOND COMMUNICATOR #####################################

COMM_SIZE_2             # Size of second communicator
MSG_SIZES_NUM           # How many threshold do you want to scpecify for COMM_SIZE2?
                        # Sould be at least 1 (for msg_size >= 0),
                        # otherwise rules file will be ignored

# Thresholds for COMM_SIZE2

0 ALG_NUM TOPO SEGM    # Use ALG_NUM for msg_size >= 0
M ALG_NUM TOPO SEGM    # Use ALG_NUM for msg_size >= M
N ALG_NUM TOPO SEGM    # Use ALG_NUM for msg_size >= N

# End of second communicator

############################## Nth COMMUNICATOR #######################################

COMM_SIZE_N            # Size of last (COMMS_NUMth) communicator
MSG_SIZES_NUM          # How many thresholds do you want to scpecify for COMM_SIZE_N?
                       # At least 1 (for msg_size >= 0),
                       # otherwise rules file will be ignored

# Thresholds for COMM_SIZE_N

0 ALG_NUM TOPO SEGM    # Use ALG_NUM for msg_size >= 0
M ALG_NUM TOPO SEGM    # Use ALG_NUM for msg_size >= M
N ALG_NUM TOPO SEGM    # Use ALG_NUM for msg_size >= N

# End of Nth communicator
# End of COLL_ID_1

#######################################################################################
############################## Nth COLLECTIVE RULES ###################################
#######################################################################################

# Define rules for next collective operation

@mkurnosov
Copy link
Contributor

mkurnosov commented Nov 3, 2020

@miharulidze @bosilca I suggest to add a list of the algorithms.

# List of collective communication algorithms (coll/tuned only): 
#   COMM_SIZE -- a communicator size
#   COUNT -- an argument of a collective operation
#   TYPESIZE -- a datatypes size
#   SEGSIZE -- a segment size (SEGM in file-based rules)
#   FANINOUT -- a degree of a tree (TOPO in file-based rules)
#
# MPI_ALLGATHER (COLL_ID 0)
# Alg ID  Algorithm
# 0       Use fixed rules
# 1       Linear
# 2       Bruck
# 3       Recursive doubling (if non-power-of-two number of processes, it will switch to Bruck)
# 4       Ring
# 5       Neighbor Exchange (if odd number of processes, it will switch to Ring)
# 6       Two procs (COMM_SIZE = 2 only)
#
# ALLGATHERV (COLL_ID 1)
# Alg ID  Algorithm
# 0       Use fixed rules
# 1       Linear
# 2       Bruck
# 3       Ring
# 4       Neighbor Exchange (if odd number of processes, it will switch to Ring)
# 5       Two procs (COMM_SIZE = 2 only)
#
# ALLREDUCE (COLL_ID 2)
# Alg ID  Algorithm
# 0       Use fixed rules
# 1       Linear (reduce linear + bcast linear)
# 2       Nonoverlapping (reduce + bcast)
# 3       Recursive doubling
# 4       Ring (commutative ops only; if COUNT < COMM_SIZE, it will switch to Recursive doubling)
# 5       Segmented ring (commutative ops only; if COUNT < COMM_SIZE * SEGSIZE / TYPESIZE, it will switch to Ring)
# 6       Rabenseifner (if op is non-commutative or COUNT < pow(2, floor(log2(COMM_SIZE))), it will switch to Linear)
#
# ALLTOALL (COLL_ID 3)
# Note: if sbuf = MPI_IN_PLACE, algorithms will switch to Linear inplace
# Alg ID  Algorithm
# 0       Use fixed rules
# 1       Linear
# 2       Pairwise
# 3       Modified Bruck
# 4       Linear sync (with limited number of outstanding requests)
# 5       Two proc (COMM_SIZE = 2 only)
#
# ALLTOALLV (COLL_ID 4)
# Note: if sbuf = MPI_IN_PLACE, algorithms will switch to Linear Inplace
# Alg ID  Algorithm
# 0       Use fixed rules
# 1       Basic linear
# 2       Pairwise
#
# ALLTOALLW (COLL_ID 5) -- not yet implemented
#
# BARRIER (COLL_ID 6)
# Alg ID  Algorithm
# 0       Use fixed rules
# 1       Linear
# 2       Double ring
# 3       Recursive doubling
# 4       Bruck
# 5       Two proc (COMM_SIZE = 2 only)
# 6       Tree
#
# BCAST (COLL_ID 7)
# Alg ID  Algorithm
# 0       Use fixed rules
# 1       Basic linear
# 2       Chain (FANINOUT chains/pipelines with message segment of SEGSIZE bytes)
# 3       Pipeline (segment of SEGSIZE bytes)
# 4       Split binary tree (segment of SEGSIZE bytes; if SEGSIZE > COUNT/2*TYPESIZE, it will switch to Chain with FANINOUT=1)
# 5       Binary tree (segment of SEGSIZE bytes)
# 6       Binomial tree (segment of SEGSIZE bytes)
# 7       Knomial tree (segment of SEGSIZE bytes)
# 8       Scatter-allgather (if COUNT < COMM_SIZE, it will switch to Linear)
# 9       Scatter-allgather-ring (if COUNT < COMM_SIZE, it will switch to Linear)
#
# EXSCAN (COLL_ID 8)
# Alg ID  Algorithm
# 0       Linear
# 1       Linear
# 2       Recursive doubling
#
# GATHER (COLL_ID 9)
# Alg ID  Algorithm
# 0       Use fixed rules
# 1       Linear
# 2       Binomial tree
# 3       Linear sync (segment of SEGSIZE bytes)
#
# GATHERV (COLL_ID 10) -- not yet implemented
#
# REDUCE (COLL_ID 11)
# Alg ID  Algorithm
# 0       Use fixed rules
# 1       Linear
# 2       Chain (FANINOUT chains/pipelines with message segment of SEGSIZE bytes)
# 3       Pipeline (segment of SEGSIZE bytes)
# 4       Binary tree (segment of SEGSIZE bytes
# 5       Binomial tree (segment of SEGSIZE bytes)
# 6       In-order binary tree (segment of SEGSIZE bytes)
# 7       Rabenseifner (if op is non-commutative or COUNT < pow(2, floor(log2(COMM_SIZE))), it will switch to Linear)
#
# REDUCESCATTER (COLL_ID 12)
# Alg ID  Algorithm
# 0       Use fixed rules
# 1       Non-overlapping (reduce + scatterv)
# 2       Recursive halving (commutative ops only)
# 3       Ring (commutative ops only)
# 4       Butterfly
#
# REDUCESCATTERBLOCK (COLL_ID 13)
# Alg ID  Algorithm
# 0       Use fixed rules
# 1       Linear (reduce + scatter)
# 2       Recursive doubling
# 3       Recursive halving (if op is non-commutative, it will switch to Linear)
# 4       Butterfly
#
# SCAN (COLL_ID 14)
# Alg ID  Algorithm
# 0       Linear
# 1       Linear
# 2       Recursive doubling (commutative ops only)
#
# SCATTER (COLL_ID 15)
# Alg ID  Algorithm
# 0       Use fixed rules
# 1       Linear
# 2       Binomial tree
# 3       Linear nb
#
# SCATTERV (COLL_ID 16) -- not yet implemented
# NEIGHBOR_ALLGATHER (COLL_ID 17) -- not yet implemented
# NEIGHBOR_ALLGATHERV (COLL_ID 18) -- not yet implemented
# NEIGHBOR_ALLTOALL (COLL_ID 19) -- not yet implemented
# NEIGHBOR_ALLTOALLV (COLL_ID 20) -- not yet implemented
# NEIGHBOR_ALLTOALLW (COLL_ID 21) -- not yet implemented
#

@bosilca
Copy link
Member

bosilca commented Nov 3, 2020

Based on prior experiences we are not really good at investing time in maintaining the documentation. Instead of listing the algorithms themselves I would add text explaining how a user can list all algorithms for each collective using ompi_info.

@yqin
Copy link

yqin commented Sep 1, 2021

Not sure if it's just me or actually a bug in the file processing part of the code. I'm trying to play with self-defined dynamic rules for Scatter, because the current fixed decision has the following logic,

https://github.com/open-mpi/ompi/blob/master/ompi/mca/coll/tuned/coll_tuned_decision_fixed.c#L1551-L1555

1551         if (total_dsize < 512) {                                                 
1552             alg = 2;                                                             
1553         } else {                                                                 
1554             alg = 3;                                                             
1555         }                                                                        

I'm simply trying to raise the switch point from 512B to 8192B. So I have the following definition file.

1               # num of collectives

# first collective
15              # ID = 15 Scatter collective (ID in coll_tuned.h)
1               # number of com sizes
64              # comm size 64
2               # number of msg sizes
0 2 0 0         # for message size 0, binomial, no topo or segmentation
8192 3 0 0      # for message size 8k+, linear nb, no topo or segmentation
# end of first collective

However when I try to run it, looks to me that only the first character of the message size was parsed, i.e., 8192->8. For example,

$ mpirun -mca pml ucx -mca coll_hcoll_enable 0 -mca coll_tuned_use_dynamic_rules 1 -mca coll_tuned_dynamic_rules_filename $PWD/scatter_dyn.conf osu_scatter -f

# OSU MPI Scatter Latency Test v5.6.2
# Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
1                       2.23              0.88              3.45        1000
2                       2.74              1.02              4.14        1000
4                       3.60              1.63              5.72        1000
8                     178.76              0.18            372.34        1000
16                    178.93              0.20            373.37        1000
32                    184.07              0.31            381.15        1000
64                    191.70              0.30            398.27        1000
128                   210.57              0.51            421.86        1000
256                   246.15              0.56            493.67        1000
512                   259.43              0.68            518.93        1000
1024                  284.47              0.92            564.12        1000
2048                  338.04              1.27            661.85        1000
4096                  468.41              1.94            909.64        1000
8192                  725.54              3.35           1402.59        1000
16384                1273.38              4.32           2612.10         100
32768                2578.37             11.49           5301.16         100
65536                5637.87             20.81          11665.03         100
131072               6671.02             24.77          13351.72         100
262144              13379.54             30.36          26665.75         100

I did multiple tries and they all showed the same behavior. Am I missing something here?

@wckzhang
Copy link
Contributor

wckzhang commented Sep 1, 2021 via email

@yqin
Copy link

yqin commented Sep 1, 2021

@wckzhang Well, apparently it is using the rules if you look at the osu_scatter output I attached, just changing the switch point from 512B to 8B, instead of 8192B I intended to use. Also my understanding of the comm size is same as message size, which should be interpreted as "anything larger than comm size 64". I could be wrong though since I have not read the code. BTW, forgot to mention, I also played with multiple of comm sizes and no effect to above behavior.

@wckzhang
Copy link
Contributor

wckzhang commented Sep 1, 2021 via email

@yqin
Copy link

yqin commented Sep 1, 2021

What comm size are you using?

My comm size is 1280 for above testing.

The parsing code is fairly sensitive and if you have slight formatting errors it will disregard the file completely.

This is exactly as I thought as well. If it is dropped completely then I know my format is wrong so I can fix it. But if you take a look at the result I posted, apparently it worked for 2 message size regions that I set. Just the second region started from 8B instead of 8192B.

@yqin
Copy link

yqin commented Sep 1, 2021

I played a bit more with this and looks like my earlier comment on parsing the first character of the message size was wrong. It appears to behave like this because my comm size is 1280 so that was just a coincidence. The actual config file that gives me what I need looks like below,

1               # num of collectives
# first collective
15              # ID = 15 Scatter collective (ID in coll_tuned.h)
1               # number of com sizes
64              # comm size 64
2               # number of msg sizes
0 2 0 0         # for message size 0, binomial, no topo or segmentation
10485760 3 0 0  # ???
# end of first collective

The second range 10485760 is calculated from (8192 * 1280) with 8192 being the message range I'd like it to start from and 1280 is the comm size. Why so? I don't know because that still looks like a bug to me. My understanding is this value is comm size agnostic.

@wckzhang
Copy link
Contributor

wckzhang commented Sep 1, 2021

Ah...you're hitting this issue. So message size has a vague definition and I brought up an issue about this at one point on the discrepancies, let me see if I can find the issue.

@wckzhang
Copy link
Contributor

wckzhang commented Sep 1, 2021

See: #7672

@wckzhang
Copy link
Contributor

wckzhang commented Sep 1, 2021

I don't really like where we're at with the message sizes, but there's a table in that issue you can refer to for correct sizing.

@wckzhang
Copy link
Contributor

wckzhang commented Sep 1, 2021

There's also an issue in the collectives-tuning repo - open-mpi/ompi-collectives-tuning#24
which also describes the discrepancy. Since the code to generate tuning files doesn't take this into account (I do remember writing that code but I never merged it I suppose).

@wckzhang
Copy link
Contributor

wckzhang commented Sep 1, 2021

Oh the table is a little outdated now since I revised scatter and gather to use datatype size * com size * s count

@yqin
Copy link

yqin commented Sep 2, 2021

So this explains it. But I have to say that it is very counter-intuitive and inconsistent with existing fixed decision rules. For example, take a look at the code snippet that I posted for scatter.

https://github.com/open-mpi/ompi/blob/master/ompi/mca/coll/tuned/coll_tuned_decision_fixed.c#L1551-L1555

1551         if (total_dsize < 512) {                                                 
1552             alg = 2;                                                             
1553         } else {                                                                 
1554             alg = 3;                                                             
1555         }                        

If I just want to do similar with dynamic rule file, but simply change 512 to 8192 so that I can achieve something like: for all comm size that is >= 64, when total_dsize < 8192 use algo 2 and algo 3 for the rest. How to achieve that? With the current restriction of the file format, do we have to list all comm size and calculate the msg size for that?

@wckzhang
Copy link
Contributor

wckzhang commented Sep 2, 2021

Yeah I also think it's counter-intuitive and inconsistent since com size is already taken into account in the tuning file, why does it need to be taken into account again? Unfortunately there isn't a way to do that with the dynamic code today. @bosilca has major interests in this area, should we re-discuss the message size issue?

@yqin
Copy link

yqin commented Sep 2, 2021

Agreed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants