Horovod discrepancies on eval_during_training_every_epochs #396

taipin · 2019-06-25T16:57:19Z

When I ran tf_cnn_benchmarks with and without horovod, I got different evaluation sequences.

Without horovod:
python tf_cnn_benchmarks.py --data_dir ${HOME}/mldl/data/imagenet --model resnet50 --batch_size 128 --num_epochs 2 --eval_during_training_every_n_epochs 1 --num_gpus 6
I got two evaluation points: one for each epoch, as specified in the command line options.
...
Running evaluation at global_step 1679
...
Running final evaluation at global_step 3347

However, with horovod, there was only one evaluation point at the end of the the whole training (2 epochs).
mpirun -np 6 -H p10login1:6 python tf_cnn_benchmarks.py --data_dir ${HOME}/mldl/data/imagenet --model resnet50 --batch_size 128 --num_epochs 2 --eval_during_training_every_n_epochs 1 --num_gpus 1 --variable_update horovod
...
Running final evaluation at global_step 3347

It did not do evaluation at the end of the first epoch, which is step number 1679. I found the cause is from self.batch_size on lines 1553 and 1554 of benchmark_cnn.py. Replacing it by (self.batch_size * self_num_workers) seemed working.

reedwm · 2020-01-17T01:18:37Z

Unfortunately Horovod is not well tested. Since tf_cnn_benchmarks is unmaintained and I don't know how to run with Horovod, this will likely not be fixed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Horovod discrepancies on eval_during_training_every_epochs #396

Horovod discrepancies on eval_during_training_every_epochs #396

taipin commented Jun 25, 2019

reedwm commented Jan 17, 2020

Horovod discrepancies on eval_during_training_every_epochs #396

Horovod discrepancies on eval_during_training_every_epochs #396

Comments

taipin commented Jun 25, 2019

reedwm commented Jan 17, 2020