You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am facing an odd issue where I am seeing the LL protocol being used despite setting the protocol to SIMPLE in my nccl.conf file. To confirm this, I added a basic print statement to collectives.cc to get the protocol (shown below) and then profiled my execution using NSIGHT. My debug file lists SIMPLE as the protocol set by the env. but NSIGHT and the info.protocol return do not agree.
if (info.protocol != -1){
std::cout << ncclProtoToString(info.protocol) << std::endl;
}
In an attempt to fix this, I went into tuning.cc line 270 and set the value to int protoEnable[NCCL_NUM_PROTOCOLS] = { 0, 0, 1 }; which in theory should have disabled the LL and LL128 protocls. When that did not work, I then went into tuning.cc's for loop on line 303 and manually used if statements to set only NCCL_PROTO_SIMPLE pEnable = 1. That still did not work...
I am using NCCL version 2.21.5+cuda12.4. If I just misunderstood how the source code works I apologize in advance.
The text was updated successfully, but these errors were encountered:
You should not rely on the NCCL kernel name, it means nothing. We tried to rename it to ncclGenericKernel at some point but then users were complaining that having some name, even if incorrect, but somewhat related to the operation, was helping them read their profiling. So we went back to the specific name, even though it's called "LL" regardless of the protocol we actually use.
We could probably just remove the LL in the name, although it might make the code even more complicated.
Yes. I have both an .nccl_conf file and when that didn't work I also tried exporting nccl_proto=Simple (with correct caps) in the env. My debug file reports that the proto is set to SIMPLE by the env.
Could you point me to where the NCCL_PROTO is finally selected? I can find where different protos are enabled / disabled, but I cannot find the final selection.
I am facing an odd issue where I am seeing the LL protocol being used despite setting the protocol to SIMPLE in my nccl.conf file. To confirm this, I added a basic print statement to
collectives.cc
to get the protocol (shown below) and then profiled my execution using NSIGHT. My debug file lists SIMPLE as the protocol set by the env. but NSIGHT and the info.protocol return do not agree.In an attempt to fix this, I went into
tuning.cc
line 270 and set the value toint protoEnable[NCCL_NUM_PROTOCOLS] = { 0, 0, 1 };
which in theory should have disabled the LL and LL128 protocls. When that did not work, I then went intotuning.cc
's for loop on line 303 and manually used if statements to set onlyNCCL_PROTO_SIMPLE pEnable = 1
. That still did not work...I am using NCCL version 2.21.5+cuda12.4. If I just misunderstood how the source code works I apologize in advance.
The text was updated successfully, but these errors were encountered: