Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于benchmark实验结果的疑问 #421

Open
frankxyy opened this issue Nov 7, 2022 · 2 comments
Open

关于benchmark实验结果的疑问 #421

frankxyy opened this issue Nov 7, 2022 · 2 comments

Comments

@frankxyy
Copy link

frankxyy commented Nov 7, 2022

image

image

在相同的1n1g的机器资源下,为什么对于tensor model parallel,bs更大,samples/s 还小了?

@chengtbf
Copy link

chengtbf commented Nov 7, 2022

  1. 可以看一下 ac 这个参数(activation checkpointing) ,这个是反向重计算,会额外在反向的时候做一遍前向,从而大幅降低显存开销(可以跑更大的 batch size)约 40%,但是会有 20% 左右的性能开销。

视前向计算在整体的占比,如果是 acc 场景, 占比会更大一些,约 1/3 = 前向 /( 前向 + 反向),一般网络,反向计算量是前向的两倍。

tensor model parallel 中用到了 ac,所以才可以跑 128 这么大的 bs,代价就是会多做一次前向。

  1. 并不是增大 bs 一定会增加速度。当 GPU 利用率打满以后,再增加 bs ,并不会增加吞吐。

@frankxyy
Copy link
Author

frankxyy commented Nov 8, 2022

哦哦,了解了,这样看来对于bert,使用tensor parallel没有效果啊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants