Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss对齐 #31

Open
wuziyou199217 opened this issue Sep 22, 2023 · 3 comments
Open

Loss对齐 #31

wuziyou199217 opened this issue Sep 22, 2023 · 3 comments

Comments

@wuziyou199217
Copy link

您好,

最近我尝试使用DistributedOptimizer和OverlappedDistributedOptimizer跑LLaMA-7B,训练参数为tp1-pp1-dp8-zero1,但是发现两者的loss在第三个iter后开始出现gap,如下所示:

OverlappedDistributedOptimizer

iter 1: 1.764603
iter 2: 1.693627
iter 3: 1.751950
iter 4: 1.745651
iter 5: 1.781831


DistributedOptimizer

iter 1: 1.764603
iter 2: 1.693627
iter 3: 1.751948
iter 4: 1.745644
iter 5: 1.781826

两次训练唯一的区别是OverlappedDistributedOptimizer使用了--no-contiguous-buffers-in-local-ddp。同时两者都使用了--gradient-accumulation-fusion

请问这可能是什么原因导致的?

@li-yi-dong
Copy link
Collaborator

两个Optimizer 累加与求平均的顺序不同,在BF16 下会有略微的差异。

看您贴出来的5个step 差异在BF16当前数值范围的一个精度以内,您可以多跑几个step 看看差异会不会继续扩大。

@wuziyou199217
Copy link
Author

两个Optimizer 累加与求平均的顺序不同,在BF16 下会有略微的差异。

看您贴出来的5个step 差异在BF16当前数值范围的一个精度以内,您可以多跑几个step 看看差异会不会继续扩大。

您好,我测的是FP16,也会存在微小差异么?另外这两个Optimizer累加和求平均的顺序分别是怎样的,能否详述下?

@li-yi-dong
Copy link
Collaborator

image
看起来也在FP16 一个精度以内。具体的顺序您可以直接看代码,主要是gradient accumulation 的部分。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants