Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long training time for ResNet50 on ImageNet-1k #1236

Open
iamsh4shank opened this issue Feb 27, 2024 · 1 comment
Open

Long training time for ResNet50 on ImageNet-1k #1236

iamsh4shank opened this issue Feb 27, 2024 · 1 comment

Comments

@iamsh4shank
Copy link

Context

I am training a ResNet50 on ImageNet-1k using this script, it takes around 2 hours for one epoch and as I have to train for 90 epochs then it takes a lot of time to finish the training. I even tried to distribute it for 4 GPUs but still same results.

  • Pytorch version: 2.20
  • Operating System and version: Ubuntu 20.04
@shreyaannshh
Copy link

Make sure you've got the latest version of CUDA and cuDNN installed along with the latest NVIDIA GPU drivers.
Install CUDA toolkit and make sure to match the CUDA version with the one supported by your PyTorch installation.
You can try Anaconda or Miniconda to manage your python environment as they help avoid conflicts with system packages.
Install PyTorch with GPU support using the appropriate version for your CUDA installation
If you're using multiple GPUs, consider installing NVIDIA NCCL (NVIDIA Collective Communication Library) for optimized GPU communication
Set the following environment variables in your training script to enable multi-GPU training
Execute your training script with the necessary commands to utilize multiple GPUs

Hope it Helps!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants