Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Improvement] Allow multiple CCL inits from same process but from different threads #13

Open
umamaheswararao opened this issue Feb 21, 2020 · 3 comments

Comments

@umamaheswararao
Copy link

Currently we can initialize multiple XGBoost Rabit instances from same process but from different thread. In Spark, its possible to have multiple tasks run on same executor. A executor is single JVM process and multiple tasks will in separate thread respectively.

Its not very critical requirement for us at this stage, but users can run that way, as Spark allows that.
So, it will be good to support multiple thread to initialize oneCCL.

@umamaheswararao umamaheswararao changed the title [Improvement] Allow multiple CCL inits from same process [Improvement] Allow multiple CCL inits from same process but from different threads Feb 21, 2020
@XapaJIaMnu
Copy link

This is also an issue for us. We have neural network "One thread, one worker" model that we want to keep consistent with our GPU implementation (based on NCCL). Neural network toolkits usually perform data level parallelism and when running of the CPU we use each CPU thread as a single worker.

Normally, on each machine we start a single process that initiates all the available cpu threads (or GPUs) as workers and we use a communication backend to communicate gradients. NCCL allows us to have multiple GPUs attached to the same process as workers, but we can't achieve that with oneCCL/MPI, as they work at the process level and are oblivious to any threading within the process.

With oneCCL as a workaround, we have to start N separate processes on each machine (where N corresponds to the number of CPU threads) and use oneCCL to handle the communication, which could otherwise be done internally.

We would rather not rewrite our full communication model in order to make use of the more efficient in-process communication.

@mshiryaev
Copy link
Contributor

@XapaJIaMnu - CCL relies on multiple threads (CCL workers) to parallelize communication of current process. Having each CPU core mapped on CCL rank (not depending whether all ranks within single process or each rank in separate process) will not allow to have budget for CCL workers. Typical scenario for training on CPU (e.g. with PyTorch/DDP or with Horovod) is one process per CPU socket where cores are split between compute and communication threads. So the whole CPU socket is considered like single compute device, and it is mapped on single CCL rank.

@XapaJIaMnu
Copy link

@mshiryaev our use case is machine translation models running the transformer architecture. Our input sequences are long, our models are big, our computational dependencies are strictly sequential and our matrices are not large enough to use OMP parallelism. Typically for a single mini-batch a worker will spend multiple seconds (or even minutes with very large mini-batches) computing forward and backward passes, and then only fractions of a second to communicate the gradients. For this scenario it is wasteful to not allocate every single CPU core to an individual worker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants