Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adafactor sum instead of mean? #5290

Open
isamu-isozaki opened this issue Aug 20, 2023 · 0 comments 路 May be fixed by #5291
Open

Adafactor sum instead of mean? #5290

isamu-isozaki opened this issue Aug 20, 2023 · 0 comments 路 May be fixed by #5291

Comments

@isamu-isozaki
Copy link

isamu-isozaki commented Aug 20, 2023

馃悰 Bug

Hi! I was looking at the pseudocode for the adafactor paper, and I noticed that it is slightly different from the fairseq implementation in that, in fairseq's adafactor we use mean to reduce

r_factor = (
    (exp_avg_sq_row / exp_avg_sq_row.mean(dim=-1, keepdim=True))
    .rsqrt_()
    .unsqueeze(-1)
)

and

if factored:
    exp_avg_sq_row = state["exp_avg_sq_row"]
    exp_avg_sq_col = state["exp_avg_sq_col"]
    
    exp_avg_sq_row.mul_(beta2t).add_(
        update.mean(dim=-1), alpha=1.0 - beta2t
    )
    exp_avg_sq_col.mul_(beta2t).add_(
        update.mean(dim=-2), alpha=1.0 - beta2t
    )

when they are all supposed to be sum instead? Here's the pseudo-code from the paper

pseudocode

I made a pr to fix this issue if needed!

@isamu-isozaki isamu-isozaki linked a pull request Aug 20, 2023 that will close this issue
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant