We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
您好,
iter 1: 1.764603 iter 2: 1.693627 iter 3: 1.751950 iter 4: 1.745651 iter 5: 1.781831
iter 1: 1.764603 iter 2: 1.693627 iter 3: 1.751948 iter 4: 1.745644 iter 5: 1.781826
两次训练唯一的区别是OverlappedDistributedOptimizer使用了--no-contiguous-buffers-in-local-ddp。同时两者都使用了--gradient-accumulation-fusion
请问这可能是什么原因导致的?
The text was updated successfully, but these errors were encountered:
两个Optimizer 累加与求平均的顺序不同,在BF16 下会有略微的差异。
看您贴出来的5个step 差异在BF16当前数值范围的一个精度以内,您可以多跑几个step 看看差异会不会继续扩大。
Sorry, something went wrong.
两个Optimizer 累加与求平均的顺序不同,在BF16 下会有略微的差异。 看您贴出来的5个step 差异在BF16当前数值范围的一个精度以内,您可以多跑几个step 看看差异会不会继续扩大。
您好,我测的是FP16,也会存在微小差异么?另外这两个Optimizer累加和求平均的顺序分别是怎样的,能否详述下?
看起来也在FP16 一个精度以内。具体的顺序您可以直接看代码,主要是gradient accumulation 的部分。
No branches or pull requests
您好,
最近我尝试使用DistributedOptimizer和OverlappedDistributedOptimizer跑LLaMA-7B,训练参数为tp1-pp1-dp8-zero1,但是发现两者的loss在第三个iter后开始出现gap,如下所示:
OverlappedDistributedOptimizer
iter 1: 1.764603
iter 2: 1.693627
iter 3: 1.751950
iter 4: 1.745651
iter 5: 1.781831
DistributedOptimizer
iter 1: 1.764603
iter 2: 1.693627
iter 3: 1.751948
iter 4: 1.745644
iter 5: 1.781826
两次训练唯一的区别是OverlappedDistributedOptimizer使用了--no-contiguous-buffers-in-local-ddp。同时两者都使用了--gradient-accumulation-fusion
请问这可能是什么原因导致的?
The text was updated successfully, but these errors were encountered: