Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

多卡训练自己数据集时卡在第一个batch就不动了 #397

Open
zxs23130 opened this issue Jan 11, 2024 · 6 comments
Open

多卡训练自己数据集时卡在第一个batch就不动了 #397

zxs23130 opened this issue Jan 11, 2024 · 6 comments

Comments

@zxs23130
Copy link

微信图片_20240111100004
微信图片_20240111100012

@layumi
Copy link
Owner

layumi commented Jan 17, 2024

你好,

  1. 你用的是什么卡? 4090的话把P2P封了,有可能出现这个问题。
  2. 单卡跑的话是OK的么?

@zxs23130
Copy link
Author

zxs23130 commented Jan 17, 2024 via email

@layumi
Copy link
Owner

layumi commented Jan 24, 2024

你好 @zxs23130
感谢! 我找到原因了 应该也是 torch.compile() 的 兼容性。

暂时你可以把 torch.compile() 注释了。

@zxs23130
Copy link
Author

zxs23130 commented Jan 24, 2024 via email

@layumi
Copy link
Owner

layumi commented Jan 24, 2024

应该和这个case 一样。 目前 DP不支持 compile pytorch/pytorch#94636

我之后传一个DDP版本上来。用下面这行命令就能跑上。

bash DDP.sh 

@layumi
Copy link
Owner

layumi commented Jan 24, 2024

另外 现在pytorch 对 DP支持比较差 我试了一下 会出现类似NaN的情况
https://discuss.pytorch.org/t/nan-loss-with-dataparallel/26501

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants