main.py: TensorBoard in case of Multi-processing Distributed Data Parallel Training #1190

jecampagne · 2023-09-13T11:19:44Z

Dear developers
It is so great that you've provided a examples/imagenet/main.py script which looks amazing.
I'm looking how to setup a Multi-processing Distributed Data Parallel Training, for instance 8 GPUs on a single node but I can also use multi-nodes multi-gpus. I must say that I have never had so great infrastructure that I'm discovering at the same times.

Now, I was used to view the evolution of the Accuracies (Top 1, Top 5, train/val) during the training (rather common isn't it), but looking at the code (main.py) I do not see the

from torch.utils.tensorboard import SummaryWriter
...
    writer = SummaryWriter(logs_dir)
...

and similar code used in the train/validate routines like

    if writer is not None:
        suffix = "train"
        writer.add_scalar(f'top5_{suffix}', top5.avg, global_step=epoch)
        writer.add_scalar(f'top1_{suffix}', top1.avg, global_step=epoch)

Now, in the multi-gpus processing I would imagine that one has to deal with "which gpu among the whole sets of gpus should/must do the job". But I am pretty sure that many experts are doing such things routinely.

Is there a foreseen new version of main.py that would integrate such TensorBoard features in case of Multi-processing Distributed Data Parallel Training? In the mean while may be someone can help to setup such modifications.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

main.py: TensorBoard in case of Multi-processing Distributed Data Parallel Training #1190

main.py: TensorBoard in case of Multi-processing Distributed Data Parallel Training #1190

jecampagne commented Sep 13, 2023

main.py: TensorBoard in case of Multi-processing Distributed Data Parallel Training #1190

main.py: TensorBoard in case of Multi-processing Distributed Data Parallel Training #1190

Comments

jecampagne commented Sep 13, 2023