You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
用的是相同的数据进行测试,为什么TP4-PP1-DP2(张量并行度4,数据并行度2)的average token/sec/GPU : 6247.2 比
TP1-PP1-DP8(张量并行度1,数据并行度8)的average token/sec/GPU : 8707.8 值小,按理说TP4-PP1-DP2(张量并行度4,数据并行度2)训练速度应该比TP1-PP1-DP8(张量并行度1,数据并行度8)慢,因为单位时间内处理的token数少,为什么反而在该例子中:
TP4-PP1-DP2(张量并行度4,数据并行度2)的elapsed time per iteration (ms): 5245.2 (时间短)?
TP1-PP1-DP8(张量并行度1,数据并行度8)的elapsed time per iteration (ms): 15088.2 (时间长)?
下面是TP4-PP1-DP2(张量并行度4,数据并行度2)的运行日志:
iteration 20/ 1000 | consumed samples: 1280 | elapsed time per iteration (ms): 5245.2 | average overall token/sec : 49977.6 | average token/sec/GPU : 6247.2 | learning rate: 6.000E-06 | global batch size: 64 | lm loss: 3.193359E+00 | loss scale: 1.0 | grad norm: 7.081 | number of skipped iterations: 0 | number of nan iterations: 0 |
iteration 21/ 1000 | consumed samples: 1344 | elapsed time per iteration (ms): 5240.6 | average overall token/sec : 50021.7 | average token/sec/GPU : 6252.7 | learning rate: 6.000E-06 | global batch size: 64 | lm loss: 3.044922E+00 | loss scale: 1.0 | grad norm: 23.787 | number of skipped iterations: 0 | number of nan iterations: 0 |
iteration 22/ 1000 | consumed samples: 1408 | elapsed time per iteration (ms): 5223.7 | average overall token/sec : 50184.0 | average token/sec/GPU : 6273.0 | learning rate: 6.000E-06 | global batch size: 64 | lm loss: 3.158203E+00 | loss scale: 1.0 | grad norm: 91.750 | number of skipped iterations: 0 | number of nan iterations: 0 |
以下是TP1-PP1-DP8(张量并行度1,数据并行度8)的运行日志:
iteration 20/ 1000 | consumed samples: 5120 | elapsed time per iteration (ms): 15088.2 | average overall token/sec : 69496.2 | average token/sec/GPU : 8687.0 | learning rate: 6.000E-06 | global batch size: 256 | lm loss: 2.751221E+00 | loss scale: 1.0 | grad norm: 4.944 | number of skipped iterations: 0 | number of nan iterations: 0 |
iteration 21/ 1000 | consumed samples: 5376 | elapsed time per iteration (ms): 15052.3 | average overall token/sec : 69662.3 | average token/sec/GPU : 8707.8 | learning rate: 6.000E-06 | global batch size: 256 | lm loss: 2.908936E+00 | loss scale: 1.0 | grad norm: 9.224 | number of skipped iterations: 0 | number of nan iterations: 0 |
iteration 22/ 1000 | consumed samples: 5632 | elapsed time per iteration (ms): 15057.8 | average overall token/sec : 69636.9 | average token/sec/GPU : 8704.6 | learning rate: 6.000E-06 | global batch size: 256 | lm loss: 2.847168E+00 | loss scale: 1.0 | grad norm: 8.277 | number of skipped iterations: 0 | number of nan iterations: 0 |
The text was updated successfully, but these errors were encountered:
用的是相同的数据进行测试,为什么TP4-PP1-DP2(张量并行度4,数据并行度2)的average token/sec/GPU : 6247.2 比
TP1-PP1-DP8(张量并行度1,数据并行度8)的average token/sec/GPU : 8707.8 值小,按理说TP4-PP1-DP2(张量并行度4,数据并行度2)训练速度应该比TP1-PP1-DP8(张量并行度1,数据并行度8)慢,因为单位时间内处理的token数少,为什么反而在该例子中:
TP4-PP1-DP2(张量并行度4,数据并行度2)的elapsed time per iteration (ms): 5245.2 (时间短)?
TP1-PP1-DP8(张量并行度1,数据并行度8)的elapsed time per iteration (ms): 15088.2 (时间长)?
下面是TP4-PP1-DP2(张量并行度4,数据并行度2)的运行日志:
iteration 20/ 1000 | consumed samples: 1280 | elapsed time per iteration (ms): 5245.2 | average overall token/sec : 49977.6 | average token/sec/GPU : 6247.2 | learning rate: 6.000E-06 | global batch size: 64 | lm loss: 3.193359E+00 | loss scale: 1.0 | grad norm: 7.081 | number of skipped iterations: 0 | number of nan iterations: 0 |
iteration 21/ 1000 | consumed samples: 1344 | elapsed time per iteration (ms): 5240.6 | average overall token/sec : 50021.7 | average token/sec/GPU : 6252.7 | learning rate: 6.000E-06 | global batch size: 64 | lm loss: 3.044922E+00 | loss scale: 1.0 | grad norm: 23.787 | number of skipped iterations: 0 | number of nan iterations: 0 |
iteration 22/ 1000 | consumed samples: 1408 | elapsed time per iteration (ms): 5223.7 | average overall token/sec : 50184.0 | average token/sec/GPU : 6273.0 | learning rate: 6.000E-06 | global batch size: 64 | lm loss: 3.158203E+00 | loss scale: 1.0 | grad norm: 91.750 | number of skipped iterations: 0 | number of nan iterations: 0 |
====================================================================================
以下是TP1-PP1-DP8(张量并行度1,数据并行度8)的运行日志:
iteration 20/ 1000 | consumed samples: 5120 | elapsed time per iteration (ms): 15088.2 | average overall token/sec : 69496.2 | average token/sec/GPU : 8687.0 | learning rate: 6.000E-06 | global batch size: 256 | lm loss: 2.751221E+00 | loss scale: 1.0 | grad norm: 4.944 | number of skipped iterations: 0 | number of nan iterations: 0 |
iteration 21/ 1000 | consumed samples: 5376 | elapsed time per iteration (ms): 15052.3 | average overall token/sec : 69662.3 | average token/sec/GPU : 8707.8 | learning rate: 6.000E-06 | global batch size: 256 | lm loss: 2.908936E+00 | loss scale: 1.0 | grad norm: 9.224 | number of skipped iterations: 0 | number of nan iterations: 0 |
iteration 22/ 1000 | consumed samples: 5632 | elapsed time per iteration (ms): 15057.8 | average overall token/sec : 69636.9 | average token/sec/GPU : 8704.6 | learning rate: 6.000E-06 | global batch size: 256 | lm loss: 2.847168E+00 | loss scale: 1.0 | grad norm: 8.277 | number of skipped iterations: 0 | number of nan iterations: 0 |
The text was updated successfully, but these errors were encountered: