Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Horovod discrepancies on eval_during_training_every_epochs #396

Open
taipin opened this issue Jun 25, 2019 · 1 comment
Open

Horovod discrepancies on eval_during_training_every_epochs #396

taipin opened this issue Jun 25, 2019 · 1 comment

Comments

@taipin
Copy link

taipin commented Jun 25, 2019

When I ran tf_cnn_benchmarks with and without horovod, I got different evaluation sequences.

Without horovod:
python tf_cnn_benchmarks.py --data_dir ${HOME}/mldl/data/imagenet --model resnet50 --batch_size 128 --num_epochs 2 --eval_during_training_every_n_epochs 1 --num_gpus 6
I got two evaluation points: one for each epoch, as specified in the command line options.
...
Running evaluation at global_step 1679
...
Running final evaluation at global_step 3347

However, with horovod, there was only one evaluation point at the end of the the whole training (2 epochs).
mpirun -np 6 -H p10login1:6 python tf_cnn_benchmarks.py --data_dir ${HOME}/mldl/data/imagenet --model resnet50 --batch_size 128 --num_epochs 2 --eval_during_training_every_n_epochs 1 --num_gpus 1 --variable_update horovod
...
Running final evaluation at global_step 3347

It did not do evaluation at the end of the first epoch, which is step number 1679. I found the cause is from self.batch_size on lines 1553 and 1554 of benchmark_cnn.py. Replacing it by (self.batch_size * self_num_workers) seemed working.

@reedwm
Copy link
Member

reedwm commented Jan 17, 2020

Unfortunately Horovod is not well tested. Since tf_cnn_benchmarks is unmaintained and I don't know how to run with Horovod, this will likely not be fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants