Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

main.py: TensorBoard in case of Multi-processing Distributed Data Parallel Training #1190

Open
jecampagne opened this issue Sep 13, 2023 · 0 comments

Comments

@jecampagne
Copy link

Dear developers
It is so great that you've provided a examples/imagenet/main.py script which looks amazing.
I'm looking how to setup a Multi-processing Distributed Data Parallel Training, for instance 8 GPUs on a single node but I can also use multi-nodes multi-gpus. I must say that I have never had so great infrastructure that I'm discovering at the same times.

Now, I was used to view the evolution of the Accuracies (Top 1, Top 5, train/val) during the training (rather common isn't it), but looking at the code (main.py) I do not see the

from torch.utils.tensorboard import SummaryWriter
...
    writer = SummaryWriter(logs_dir)
...

and similar code used in the train/validate routines like

    if writer is not None:
        suffix = "train"
        writer.add_scalar(f'top5_{suffix}', top5.avg, global_step=epoch)
        writer.add_scalar(f'top1_{suffix}', top1.avg, global_step=epoch)

Now, in the multi-gpus processing I would imagine that one has to deal with "which gpu among the whole sets of gpus should/must do the job". But I am pretty sure that many experts are doing such things routinely.

Is there a foreseen new version of main.py that would integrate such TensorBoard features in case of Multi-processing Distributed Data Parallel Training? In the mean while may be someone can help to setup such modifications.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant