Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build: UCC integration #575

Open
3 of 5 tasks
nirandaperera opened this issue Mar 7, 2022 · 8 comments
Open
3 of 5 tasks

build: UCC integration #575

nirandaperera opened this issue Mar 7, 2022 · 8 comments
Assignees

Comments

@nirandaperera
Copy link
Collaborator

nirandaperera commented Mar 7, 2022

References:

  1. UCC
  2. torch-ucc
  3. torch-ucc fb research

Note: UCX requires 1.11<= (current conda is 1.12 which works!)

Roadmap:

@vibhatha
Copy link
Collaborator

vibhatha commented Mar 8, 2022

@nirandaperera I did the first step too. Building along with MPI didn't work.
I reported it here: openucx/ucc#436

@nirandaperera
Copy link
Collaborator Author

They've now added a comprehensive example, which we can directly use.
https://github.com/openucx/ucc/wiki/UCC-Allreduce-example

@Sergei-Lebedev
Copy link

fyi, torch_ucc was moved to another repo https://github.com/facebookresearch/torch_ucc

@vibhatha
Copy link
Collaborator

Thanks a lot for the pointer @Sergei-Lebedev

@esaliya
Copy link
Collaborator

esaliya commented Mar 15, 2022 via email

@Sergei-Lebedev
Copy link

Hi @esaliya, subranks info might me useful, but I think it can be reconstructed using prefix store without adding any additional options to PG constructor. In UCC we don't need this info because UCC team allgather is used instead.
The bigger challenge for us is that Pytorch world group is not strictly defined, for instance it's allowed to create default PG with backend A and then create subgroup with backend B (see example below). Because of this fact it's hard to utilize resource sharing within UCC and be fully compatible with Pytorch semantic

import os
import torch
import torch.distributed as dist
import torch_ucc

os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12345'
os.environ['RANK']        = os.environ['OMPI_COMM_WORLD_RANK']
os.environ['WORLD_SIZE']  = os.environ['OMPI_COMM_WORLD_SIZE']

dist.init_process_group('gloo')
sg = dist.new_group(ranks=[0, 1], backend='ucc')
if dist.get_rank() in [0, 1]:
  sg.barrier()
dist.barrier()

@nirandaperera
Copy link
Collaborator Author

nirandaperera commented May 29, 2022

@kaiyingshan following are the steps that needs to be done to build UCC.

  • install conda (miniconda would be the easiest)
  • create a conda env using conda/environments/cylon.yml (this will install ucx 1.12 to the environment)
  • Install UCC as follows
git clone --single-branch -b v1.0.0 https://github.com/openucx/ucc.git $HOME/ucc
cd $HOME/ucc
./autogen.sh
./configure --prefix=$HOME/ucc/install --with-ucx=$CONDA/envs/cylon_dev
make install
  • Build cylon with UCX and UCC
python build.py -cmake-flags="-DCYLON_UCX=1 -DCYLON_UCC=1 -DUCC_INSTALL_PREFIX=$HOME/ucc/install" -ipath="$HOME/cylon/install" --cpp --python --test

If you are running ucc_example.cpp locally, make sure to add conda libs and UCC libs to the LD_LIBRARY_PATH

@kaiyingshan
Copy link
Collaborator

It seems like it fails to build nondeterministically on my computer, maybe it's because I'm using wsl.. I'll try to figure out the cause

@laszewsk laszewsk changed the title UCC integration build: UCC integration Aug 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants