Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AllReduce info.proto returns 0 (LL protocol) despite both setting env variable to SIMPLE / Tuning.cc protoEnable array disabling LL/LL128 #1246

Open
OckermanSethGVSU opened this issue Apr 7, 2024 · 5 comments

Comments

@OckermanSethGVSU
Copy link

I am facing an odd issue where I am seeing the LL protocol being used despite setting the protocol to SIMPLE in my nccl.conf file. To confirm this, I added a basic print statement to collectives.cc to get the protocol (shown below) and then profiled my execution using NSIGHT. My debug file lists SIMPLE as the protocol set by the env. but NSIGHT and the info.protocol return do not agree.

if (info.protocol != -1){
   
      std::cout << ncclProtoToString(info.protocol) << std::endl;
    
  }

In an attempt to fix this, I went into tuning.cc line 270 and set the value to int protoEnable[NCCL_NUM_PROTOCOLS] = { 0, 0, 1 }; which in theory should have disabled the LL and LL128 protocls. When that did not work, I then went into tuning.cc's for loop on line 303 and manually used if statements to set only NCCL_PROTO_SIMPLE pEnable = 1. That still did not work...

I am using NCCL version 2.21.5+cuda12.4. If I just misunderstood how the source code works I apologize in advance.

@sjeaugey
Copy link
Member

sjeaugey commented Apr 8, 2024

You should not rely on the NCCL kernel name, it means nothing. We tried to rename it to ncclGenericKernel at some point but then users were complaining that having some name, even if incorrect, but somewhat related to the operation, was helping them read their profiling. So we went back to the specific name, even though it's called "LL" regardless of the protocol we actually use.

We could probably just remove the LL in the name, although it might make the code even more complicated.

@OckermanSethGVSU
Copy link
Author

OckermanSethGVSU commented Apr 8, 2024

Is that true also of the info call in AllReduce itself?

I added a print statement to AllReduce that prints "info.proto" and it returns 0, potentially indicating the LL protocol.

@sjeaugey
Copy link
Member

sjeaugey commented Apr 8, 2024

info should be correct. Did you try to set NCCL_PROTO=SIMPLE in the environment? That's what will eventually set the proto enable flags.

@OckermanSethGVSU
Copy link
Author

Yes. I have both an .nccl_conf file and when that didn't work I also tried exporting nccl_proto=Simple (with correct caps) in the env. My debug file reports that the proto is set to SIMPLE by the env.

@OckermanSethGVSU
Copy link
Author

Could you point me to where the NCCL_PROTO is finally selected? I can find where different protos are enabled / disabled, but I cannot find the final selection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants