Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training on multi-gpu very slow #237

Open
sun-peach opened this issue Jul 15, 2020 · 4 comments
Open

Training on multi-gpu very slow #237

sun-peach opened this issue Jul 15, 2020 · 4 comments

Comments

@sun-peach
Copy link

I am training my ASR model with pytorch-kaldia, and notice the training time is very slow, 10% of 1 chunk is 10 mins. I have 10 chunk and will run to 15 epochs, which leads to about 10 days training.

My dataset has about 2k hours audio, and I split them in 10 chunks. I use multi-gpu, my GPU memory is 32G. I am following cfg/librispeech_liGRU_fmllr.cfg, except I use Adam instead and 4 liGRU layers (instead of the 5 layers set originally).

I have searched in the "Issues" and learnt that the developers have already optimized the multi-GPU training process. But I still see my GPU utils is around 30%, which means not fully used. I would like to know is there anyway that I can speed up the training a little bit?

Thank you very much!

@TParcollet
Copy link
Collaborator

Hi ! So this is quite a hard problem in itself. 2K hours is a lot, and 10 days of training on a Single GPU sounds reasonable to me. You can 1: Consider something else than LiGRU (LSTM and GRU are faster thanks to CUDNN, but they also give worse performances). 2. Multi-GPU with DataParallel is bottlenecked by Python, and the only solution is to go with DistributedDataParallel (Which is impossible to adapt for pytorch-Kaldi I think). So you should just do mutligpu=true and then do batch_size = max_batch_size_for_one_gpu * number_of_gpu_you_got. Training time doesn't scale linearly with the number of GPU but you can easily go down to 3 days with 4 GPUs.

@sun-peach
Copy link
Author

Thank you. I use the setting listed below:

use_cuda=True
multi_gpu=True
N_epochs_tr=15
N_chunks=50
batch_size_train=16
max_seq_length_train=1500
increase_seq_length_train=True
start_seq_len_train=300
multply_factor_seq_len_train=5
batch_size_valid=8
max_seq_length_valid=1400

It seems that it will take about 12 days. (My sequence length is long). If you think all my setting is reasonable, then I will just wait.

@TParcollet
Copy link
Collaborator

How many GPUs do you have ?

@sun-peach
Copy link
Author

4GPUs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants