Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] 多卡情况下,训练后eval和离线test的精度不能保证一致 #1536

Open
2 tasks done
whlook opened this issue Apr 28, 2024 · 0 comments
Open
2 tasks done
Labels
bug Something isn't working

Comments

@whlook
Copy link
Contributor

whlook commented Apr 28, 2024

Prerequisite

Environment

image

Reproduces the problem - code sample

如果模型带有BN(不是syncbn)进行多卡训练(2卡)后,进行eval的测试,每个rank的bn是不一样的,导致最后测试的精度与test不一致;离线test是重新load同一个pth,所以每次test结果都一致

Reproduces the problem - command or script

必现,在DDP环境下,并且使用了BN会出现

Reproduces the problem - error message

None

Additional information

  1. eval after train应该保证与test一样的可靠性
  2. test中所有rank所使用的权重参数都是一样的
  3. train之后的eval每个rank所使用的bn参数是不一样的
  4. 在val之前应该做好model同步工作(TODO)
@whlook whlook added the bug Something isn't working label Apr 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant