Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

请教一下,怎么感觉LLaMA2-7B模型单机A800*8*80G 用8张卡预训练TP4-PP1-DP2时间和TP1-PP1-DP8时间不合理 #24

Open
13416157913 opened this issue Sep 18, 2023 · 1 comment

Comments

@13416157913
Copy link

用的是相同的数据进行测试,为什么TP4-PP1-DP2(张量并行度4,数据并行度2)的average token/sec/GPU : 6247.2 比
TP1-PP1-DP8(张量并行度1,数据并行度8)的average token/sec/GPU : 8707.8 值小,按理说TP4-PP1-DP2(张量并行度4,数据并行度2)训练速度应该比TP1-PP1-DP8(张量并行度1,数据并行度8)慢,因为单位时间内处理的token数少,为什么反而在该例子中:
TP4-PP1-DP2(张量并行度4,数据并行度2)的elapsed time per iteration (ms): 5245.2 (时间短)?
TP1-PP1-DP8(张量并行度1,数据并行度8)的elapsed time per iteration (ms): 15088.2 (时间长)?

下面是TP4-PP1-DP2(张量并行度4,数据并行度2)的运行日志:
iteration 20/ 1000 | consumed samples: 1280 | elapsed time per iteration (ms): 5245.2 | average overall token/sec : 49977.6 | average token/sec/GPU : 6247.2 | learning rate: 6.000E-06 | global batch size: 64 | lm loss: 3.193359E+00 | loss scale: 1.0 | grad norm: 7.081 | number of skipped iterations: 0 | number of nan iterations: 0 |
iteration 21/ 1000 | consumed samples: 1344 | elapsed time per iteration (ms): 5240.6 | average overall token/sec : 50021.7 | average token/sec/GPU : 6252.7 | learning rate: 6.000E-06 | global batch size: 64 | lm loss: 3.044922E+00 | loss scale: 1.0 | grad norm: 23.787 | number of skipped iterations: 0 | number of nan iterations: 0 |
iteration 22/ 1000 | consumed samples: 1408 | elapsed time per iteration (ms): 5223.7 | average overall token/sec : 50184.0 | average token/sec/GPU : 6273.0 | learning rate: 6.000E-06 | global batch size: 64 | lm loss: 3.158203E+00 | loss scale: 1.0 | grad norm: 91.750 | number of skipped iterations: 0 | number of nan iterations: 0 |

====================================================================================

以下是TP1-PP1-DP8(张量并行度1,数据并行度8)的运行日志:
iteration 20/ 1000 | consumed samples: 5120 | elapsed time per iteration (ms): 15088.2 | average overall token/sec : 69496.2 | average token/sec/GPU : 8687.0 | learning rate: 6.000E-06 | global batch size: 256 | lm loss: 2.751221E+00 | loss scale: 1.0 | grad norm: 4.944 | number of skipped iterations: 0 | number of nan iterations: 0 |
iteration 21/ 1000 | consumed samples: 5376 | elapsed time per iteration (ms): 15052.3 | average overall token/sec : 69662.3 | average token/sec/GPU : 8707.8 | learning rate: 6.000E-06 | global batch size: 256 | lm loss: 2.908936E+00 | loss scale: 1.0 | grad norm: 9.224 | number of skipped iterations: 0 | number of nan iterations: 0 |
iteration 22/ 1000 | consumed samples: 5632 | elapsed time per iteration (ms): 15057.8 | average overall token/sec : 69636.9 | average token/sec/GPU : 8704.6 | learning rate: 6.000E-06 | global batch size: 256 | lm loss: 2.847168E+00 | loss scale: 1.0 | grad norm: 8.277 | number of skipped iterations: 0 | number of nan iterations: 0 |

@li-yi-dong
Copy link
Collaborator

global batch size

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants