Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ember: CommSplitevent is failing #2252

Open
saichennaintel opened this issue Oct 20, 2023 · 2 comments
Open

Ember: CommSplitevent is failing #2252

saichennaintel opened this issue Oct 20, 2023 · 2 comments
Assignees

Comments

@saichennaintel
Copy link

saichennaintel commented Oct 20, 2023

1 - Detailed description of problem or enhancement

enQ_commSplit event is not functioning as expected. The subsequent enQ_rank() and enQ_size() functions used to obtain the rank of size of the new communicator processes have stale/wrong values.

2 - Describe how to reproduce

I have included the test motifs used to check the functionality of enQ_commSplit event. The motif puts each process into two different communicators and dumps out the local rank and size of those communicators. I have also included the test configuration script to run this motif. Update the Makefile.am to include the two files ("mpi/motifs/embercommsplittest.h ; mpi/motifs/embercommsplittest.cc" ) before building Ember.

Finally, I have also included a sample MPI program which represents the actual motif for validation. Compiling and running the mpicommsplittest.cc will dump the actual expected output.

Steps to reproduce:

  1. Update Makefile.am to include embercommsplittest.cc and embercommsplittest.h files.
  2. make all && make install
  3. cd tests && sst -v dragon_32_commsplittest.py
  4. SST/Actual output:

Rank: 31 World size: 32 dense_allreduce_group_color: 3 dense_allreduce_group_rank: 1897463408 dense_allreduce_group_size: 1897463472 mp_group_rank: 32592 mp_group_size: 1897463536
Rank: 30 World size: 32 dense_allreduce_group_color: 2 dense_allreduce_group_rank: 1718187119 dense_allreduce_group_size: 3440 mp_group_rank: 0 mp_group_size: 80
Rank: 29 World size: 32 dense_allreduce_group_color: 1 dense_allreduce_group_rank: 109 dense_allreduce_group_size: 4048 mp_group_rank: 0 mp_group_size: 80
Rank: 28 World size: 32 dense_allreduce_group_color: 0 dense_allreduce_group_rank: 23 dense_allreduce_group_size: 23 mp_group_rank: 0 mp_group_size: 1953451378
Rank: 27 World size: 32 dense_allreduce_group_color: 3 dense_allreduce_group_rank: 1118797312 dense_allreduce_group_size: 0 mp_group_rank: 21958 mp_group_size: 0
Rank: 26 World size: 32 dense_allreduce_group_color: 2 dense_allreduce_group_rank: 1107322473 dense_allreduce_group_size: 1109473616 mp_group_rank: 21958 mp_group_size: 81
Rank: 25 World size: 32 dense_allreduce_group_color: 1 dense_allreduce_group_rank: 1953451346 dense_allreduce_group_size: 1098640784 mp_group_rank: 26217 mp_group_size: 81
Rank: 24 World size: 32 dense_allreduce_group_color: 0 dense_allreduce_group_rank: 1769238349 dense_allreduce_group_size: 1280 mp_group_rank: 102 mp_group_size: 80
Rank: 23 World size: 32 dense_allreduce_group_color: 3 dense_allreduce_group_rank: 1711302249 dense_allreduce_group_size: 1110017616 mp_group_rank: 26112 mp_group_size: 81
Rank: 22 World size: 32 dense_allreduce_group_color: 2 dense_allreduce_group_rank: 2077162720 dense_allreduce_group_size: 0 mp_group_rank: 32592 mp_group_size: 0
Rank: 21 World size: 32 dense_allreduce_group_color: 1 dense_allreduce_group_rank: 1113782176 dense_allreduce_group_size: 0 mp_group_rank: 21958 mp_group_size: 0
Rank: 20 World size: 32 dense_allreduce_group_color: 0 dense_allreduce_group_rank: 1953451380 dense_allreduce_group_size: 1101149040 mp_group_rank: 26217 mp_group_size: 145
Rank: 19 World size: 32 dense_allreduce_group_color: 3 dense_allreduce_group_rank: 1797346712 dense_allreduce_group_size: 1797346848 mp_group_rank: 32592 mp_group_size: 1797346984
Rank: 18 World size: 32 dense_allreduce_group_color: 2 dense_allreduce_group_rank: 1110162408 dense_allreduce_group_size: 1118800832 mp_group_rank: 21958 mp_group_size: 81
Rank: 17 World size: 32 dense_allreduce_group_color: 1 dense_allreduce_group_rank: 161 dense_allreduce_group_size: 1118800416 mp_group_rank: 0 mp_group_size: 1118805504
Rank: 16 World size: 32 dense_allreduce_group_color: 0 dense_allreduce_group_rank: 1718187119 dense_allreduce_group_size: 1099305984 mp_group_rank: 32512 mp_group_size: 81
Rank: 7 World size: 32 dense_allreduce_group_color: 3 dense_allreduce_group_rank: 1797259366 dense_allreduce_group_size: 1109398144 mp_group_rank: 32592 mp_group_size: 161
Rank: 6 World size: 32 dense_allreduce_group_color: 2 dense_allreduce_group_rank: 10 dense_allreduce_group_size: 1280524622 mp_group_rank: 0 mp_group_size: 1795188329
Rank: 5 World size: 32 dense_allreduce_group_color: 1 dense_allreduce_group_rank: 161 dense_allreduce_group_size: 1118784752 mp_group_rank: 0 mp_group_size: 2077162720
Rank: 4 World size: 32 dense_allreduce_group_color: 0 dense_allreduce_group_rank: 1869182049 dense_allreduce_group_size: 560 mp_group_rank: 29550 mp_group_size: 81
Rank: 3 World size: 32 dense_allreduce_group_color: 3 dense_allreduce_group_rank: 115201 dense_allreduce_group_size: 1696 mp_group_rank: 0 mp_group_size: 80
Rank: 0 World size: 32 dense_allreduce_group_color: 0 dense_allreduce_group_rank: 18 dense_allreduce_group_size: 18 mp_group_rank: 0 mp_group_size: 6711668
Rank: 1 World size: 32 dense_allreduce_group_color: 1 dense_allreduce_group_rank: 1118812624 dense_allreduce_group_size: 560 mp_group_rank: 21958 mp_group_size: 80
Rank: 2 World size: 32 dense_allreduce_group_color: 2 dense_allreduce_group_rank: 1118784768 dense_allreduce_group_size: 1113049072 mp_group_rank: 21958 mp_group_size: 17
Rank: 8 World size: 32 dense_allreduce_group_color: 0 dense_allreduce_group_rank: 1118815344 dense_allreduce_group_size: 1118815312 mp_group_rank: 21958 mp_group_size: 9
Rank: 9 World size: 32 dense_allreduce_group_color: 1 dense_allreduce_group_rank: 1797383704 dense_allreduce_group_size: 1797383840 mp_group_rank: 32592 mp_group_size: 1797383976
Rank: 10 World size: 32 dense_allreduce_group_color: 2 dense_allreduce_group_rank: 1797420968 dense_allreduce_group_size: 1797421104 mp_group_rank: 32592 mp_group_size: 1797421240
Rank: 11 World size: 32 dense_allreduce_group_color: 3 dense_allreduce_group_rank: 1797403560 dense_allreduce_group_size: 560 mp_group_rank: 32592 mp_group_size: 80
Rank: 12 World size: 32 dense_allreduce_group_color: 0 dense_allreduce_group_rank: 0 dense_allreduce_group_size: 1114383536 mp_group_rank: 0 mp_group_size: 20
Rank: 13 World size: 32 dense_allreduce_group_color: 1 dense_allreduce_group_rank: 1797458776 dense_allreduce_group_size: 1797458912 mp_group_rank: 32592 mp_group_size: 1797459048
Rank: 14 World size: 32 dense_allreduce_group_color: 2 dense_allreduce_group_rank: 1797496040 dense_allreduce_group_size: 1797496176 mp_group_rank: 32592 mp_group_size: 1797496312
Rank: 15 World size: 32 dense_allreduce_group_color: 3 dense_allreduce_group_rank: 1797478632 dense_allreduce_group_size: 560 mp_group_rank: 32592 mp_group_size: 80

  1. Actual Output (obtained by: mpicxx mpicommsplittest.cpp -o commsplittest && mpirun -np 32 commsplittest):

Rank: 2 dense_allreduce_group_color: 2 dense_allreduce_group_rank: 0 dense_allreduce_group_size: 8 mp_group_rank: 2 mp_group_size: 4
Rank: 4 dense_allreduce_group_color: 0 dense_allreduce_group_rank: 1 dense_allreduce_group_size: 8 mp_group_rank: 0 mp_group_size: 4
Rank: 5 dense_allreduce_group_color: 1 dense_allreduce_group_rank: 1 dense_allreduce_group_size: 8 mp_group_rank: 1 mp_group_size: 4
Rank: 12 dense_allreduce_group_color: 0 dense_allreduce_group_rank: 3 dense_allreduce_group_size: 8 mp_group_rank: 0 mp_group_size: 4
Rank: 18 dense_allreduce_group_color: 2 dense_allreduce_group_rank: 4 dense_allreduce_group_size: 8 mp_group_rank: 2 mp_group_size: 4
Rank: 19 dense_allreduce_group_color: 3 dense_allreduce_group_rank: 4 dense_allreduce_group_size: 8 mp_group_rank: 3 mp_group_size: 4
Rank: 21 dense_allreduce_group_color: 1 dense_allreduce_group_rank: 5 dense_allreduce_group_size: 8 mp_group_rank: 1 mp_group_size: 4
Rank: 22 dense_allreduce_group_color: 2 dense_allreduce_group_rank: 5 dense_allreduce_group_size: 8 mp_group_rank: 2 mp_group_size: 4
Rank: 0 dense_allreduce_group_color: 0 dense_allreduce_group_rank: 0 dense_allreduce_group_size: 8 mp_group_rank: 0 mp_group_size: 4
Rank: 1 dense_allreduce_group_color: 1 dense_allreduce_group_rank: 0 dense_allreduce_group_size: 8 mp_group_rank: 1 mp_group_size: 4
Rank: 3 dense_allreduce_group_color: 3 dense_allreduce_group_rank: 0 dense_allreduce_group_size: 8 mp_group_rank: 3 mp_group_size: 4
Rank: 6 dense_allreduce_group_color: 2 dense_allreduce_group_rank: 1 dense_allreduce_group_size: 8 mp_group_rank: 2 mp_group_size: 4
Rank: 7 dense_allreduce_group_color: 3 dense_allreduce_group_rank: 1 dense_allreduce_group_size: 8 mp_group_rank: 3 mp_group_size: 4
Rank: 8 dense_allreduce_group_color: 0 dense_allreduce_group_rank: 2 dense_allreduce_group_size: 8 mp_group_rank: 0 mp_group_size: 4
Rank: 9 dense_allreduce_group_color: 1 dense_allreduce_group_rank: 2 dense_allreduce_group_size: 8 mp_group_rank: 1 mp_group_size: 4
Rank: 10 dense_allreduce_group_color: 2 dense_allreduce_group_rank: 2 dense_allreduce_group_size: 8 mp_group_rank: 2 mp_group_size: 4
Rank: 11 dense_allreduce_group_color: 3 dense_allreduce_group_rank: 2 dense_allreduce_group_size: 8 mp_group_rank: 3 mp_group_size: 4
Rank: 13 dense_allreduce_group_color: 1 dense_allreduce_group_rank: 3 dense_allreduce_group_size: 8 mp_group_rank: 1 mp_group_size: 4
Rank: 14 dense_allreduce_group_color: 2 dense_allreduce_group_rank: 3 dense_allreduce_group_size: 8 mp_group_rank: 2 mp_group_size: 4
Rank: 15 dense_allreduce_group_color: 3 dense_allreduce_group_rank: 3 dense_allreduce_group_size: 8 mp_group_rank: 3 mp_group_size: 4
Rank: 16 dense_allreduce_group_color: 0 dense_allreduce_group_rank: 4 dense_allreduce_group_size: 8 mp_group_rank: 0 mp_group_size: 4
Rank: 17 dense_allreduce_group_color: 1 dense_allreduce_group_rank: 4 dense_allreduce_group_size: 8 mp_group_rank: 1 mp_group_size: 4
Rank: 20 dense_allreduce_group_color: 0 dense_allreduce_group_rank: 5 dense_allreduce_group_size: 8 mp_group_rank: 0 mp_group_size: 4
Rank: 23 dense_allreduce_group_color: 3 dense_allreduce_group_rank: 5 dense_allreduce_group_size: 8 mp_group_rank: 3 mp_group_size: 4
Rank: 24 dense_allreduce_group_color: 0 dense_allreduce_group_rank: 6 dense_allreduce_group_size: 8 mp_group_rank: 0 mp_group_size: 4
Rank: 25 dense_allreduce_group_color: 1 dense_allreduce_group_rank: 6 dense_allreduce_group_size: 8 mp_group_rank: 1 mp_group_size: 4
Rank: 26 dense_allreduce_group_color: 2 dense_allreduce_group_rank: 6 dense_allreduce_group_size: 8 mp_group_rank: 2 mp_group_size: 4
Rank: 27 dense_allreduce_group_color: 3 dense_allreduce_group_rank: 6 dense_allreduce_group_size: 8 mp_group_rank: 3 mp_group_size: 4
Rank: 28 dense_allreduce_group_color: 0 dense_allreduce_group_rank: 7 dense_allreduce_group_size: 8 mp_group_rank: 0 mp_group_size: 4
Rank: 29 dense_allreduce_group_color: 1 dense_allreduce_group_rank: 7 dense_allreduce_group_size: 8 mp_group_rank: 1 mp_group_size: 4
Rank: 30 dense_allreduce_group_color: 2 dense_allreduce_group_rank: 7 dense_allreduce_group_size: 8 mp_group_rank: 2 mp_group_size: 4
Rank: 31 dense_allreduce_group_color: 3 dense_allreduce_group_rank: 7 dense_allreduce_group_size: 8 mp_group_rank: 3 mp_group_size: 4

3 - What Operating system(s) and versions

Ubuntu 22.04.3 LTS

dragon_32_commsplittest.txt
embercommsplittest.txt
embercommsplittest_header.txt
mpicommsplittest.txt

@saichennaintel
Copy link
Author

Any update on this issue?

@feldergast
Copy link
Contributor

I took a quick look at the code. It's been a while since I've worked with ember and its quirks, but I think what's happening is the enqueued events haven't actually executed at the time you're printing out the values. the enQ_* functions just add things to a queue of events to execute, but none of those will actually get executed until after the generate function returns. If you put the print out in the generate function to run the second time generate gets called, all the enqueued events will have run and the values should be correct at that point. You could do the print when m_loopIndex == 1, or you could just put it in the top of the if (m_loopIndex == m_iterations) loop and it will print before the last iteration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants