You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I ran tf_cnn_benchmarks with and without horovod, I got different evaluation sequences.
Without horovod:
python tf_cnn_benchmarks.py --data_dir ${HOME}/mldl/data/imagenet --model resnet50 --batch_size 128 --num_epochs 2 --eval_during_training_every_n_epochs 1 --num_gpus 6
I got two evaluation points: one for each epoch, as specified in the command line options.
...
Running evaluation at global_step 1679
...
Running final evaluation at global_step 3347
However, with horovod, there was only one evaluation point at the end of the the whole training (2 epochs).
mpirun -np 6 -H p10login1:6 python tf_cnn_benchmarks.py --data_dir ${HOME}/mldl/data/imagenet --model resnet50 --batch_size 128 --num_epochs 2 --eval_during_training_every_n_epochs 1 --num_gpus 1 --variable_update horovod
...
Running final evaluation at global_step 3347
It did not do evaluation at the end of the first epoch, which is step number 1679. I found the cause is from self.batch_size on lines 1553 and 1554 of benchmark_cnn.py. Replacing it by (self.batch_size * self_num_workers) seemed working.
The text was updated successfully, but these errors were encountered:
Unfortunately Horovod is not well tested. Since tf_cnn_benchmarks is unmaintained and I don't know how to run with Horovod, this will likely not be fixed.
When I ran tf_cnn_benchmarks with and without horovod, I got different evaluation sequences.
Without horovod:
python tf_cnn_benchmarks.py --data_dir ${HOME}/mldl/data/imagenet --model resnet50 --batch_size 128 --num_epochs 2 --eval_during_training_every_n_epochs 1 --num_gpus 6
I got two evaluation points: one for each epoch, as specified in the command line options.
...
Running evaluation at global_step 1679
...
Running final evaluation at global_step 3347
However, with horovod, there was only one evaluation point at the end of the the whole training (2 epochs).
mpirun -np 6 -H p10login1:6 python tf_cnn_benchmarks.py --data_dir ${HOME}/mldl/data/imagenet --model resnet50 --batch_size 128 --num_epochs 2 --eval_during_training_every_n_epochs 1 --num_gpus 1 --variable_update horovod
...
Running final evaluation at global_step 3347
It did not do evaluation at the end of the first epoch, which is step number 1679. I found the cause is from self.batch_size on lines 1553 and 1554 of benchmark_cnn.py. Replacing it by (self.batch_size * self_num_workers) seemed working.
The text was updated successfully, but these errors were encountered: