You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello,
Thanks for sharing the PYtorch code for embedding training.
If we look at thepytorch_xvectors/pytorch_run.sh,
CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --nproc_per_node=1
train_xent.py exp/xvector_nnet_1a/egs/ If we look at the above line,it seems like you are training the DNN on using single GPU. Is it possible to train using multiple gpus?
Further if we look at the train_utils.py script,
def prepareModel(args):
elif args.trainingMode == 'init':
net.to(device)
net = torch.nn.parallel.DistributedDataParallel(net,
device_ids=[0],
output_device=0)
if torch.cuda.device_count() > 1:
print("Using ", torch.cuda.device_count(), "GPUs!")
net = nn.DataParallel(net)
Why we are using both torch.nn.parallel.DistributedDataParallel and net = nn.DataParallel(net) ?
When I tried to train, it's training using single GPU. How it needs to modified to train on multiple gpus?
I look forward to hearing from you.
Thanks.
K. Ahilan
The text was updated successfully, but these errors were encountered:
I think the code can be run on multiple GPUs using DataParallel, but I haven't figured out how to do the same since I did not have access to a node with multiple GPUs in my university cluster.
I use DistributedDataParallel since this spawns multiple processes in a single GPU which greatly improves training time. This feature was particularly useful since I had access to a single V100 node, and each process was ~4GB.
I have included the if statement for multiple GPU check as a debug option in case I ever got access to a multiple GPU node, but that never happened 😄
I'll leave this issue open in case if someone figures out how to do this
Hello,
Thanks for sharing the PYtorch code for embedding training.
If we look at thepytorch_xvectors/pytorch_run.sh,
CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --nproc_per_node=1
train_xent.py exp/xvector_nnet_1a/egs/
If we look at the above line,it seems like you are training the DNN on using single GPU. Is it possible to train using multiple gpus?
Further if we look at the train_utils.py script,
def prepareModel(args):
elif args.trainingMode == 'init':
net.to(device)
net = torch.nn.parallel.DistributedDataParallel(net,
device_ids=[0],
output_device=0)
if torch.cuda.device_count() > 1:
print("Using ", torch.cuda.device_count(), "GPUs!")
net = nn.DataParallel(net)
Why we are using both torch.nn.parallel.DistributedDataParallel and net = nn.DataParallel(net) ?
When I tried to train, it's training using single GPU. How it needs to modified to train on multiple gpus?
I look forward to hearing from you.
Thanks.
K. Ahilan
The text was updated successfully, but these errors were encountered: