New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Erroneous MPI_Reduce Results with Data Sets larger than 2KB in Multi-Server Configurations #6893
Comments
What is your output? |
test.txt Through testing, I found two ways to achieve the correct output:
|
I am not able to reproduce your results yet. What is your output of |
MPICH Version: 4.1 |
I reproduced the bug in mpich-4.1. The current main branch doesn't have this issue. It appears we have fixed it at some point. I'll try to determine where we have fixed it. |
You should be able to use this environment variable to disable the problematic reduce algorithm.
Still, I will pick up the fix for 4.1.3 if/when that release is made. |
The 4.1.3 release included the fix for this issue. |
Description
I am experiencing an issue where MPI_Reduce, specifically during addition reduction operations, yields incorrect results when dealing with data sets larger than 2KB in size. This problem only manifests in a distributed environment spanning multiple servers, under the condition that the root process (rank 0) is exclusively running on its own server, and at least one of the other servers hosts more than one MPI process.
Environment
MPICH version: 4.1
Operating System: centos 7.4.1708
Configuration and code
machine file
Program Execution Command
mpirun -machinefile machinefile ./mpiexe > test.txt
The text was updated successfully, but these errors were encountered: