Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WandbCallback always (!) uploads entire model checkpoint to wandb #30896

Closed
2 of 4 tasks
mgerstgrasser opened this issue May 19, 2024 · 6 comments · Fixed by #30897
Closed
2 of 4 tasks

WandbCallback always (!) uploads entire model checkpoint to wandb #30896

mgerstgrasser opened this issue May 19, 2024 · 6 comments · Fixed by #30897
Labels

Comments

@mgerstgrasser
Copy link
Contributor

System Info

transformers==4.41.0

Who can help?

@pacman100 @muellerzr @amyeroberts

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. enable --report_to wandb in any script using Trainer.
  2. Initial model gets uploaded to wandb in its entirety, with no way of disabling this.

Expected behavior

I would expect this to either (a) not upload the initial model checkpoint at all, or (b) only do this if explicitly configured.
As it is, it seems that every run in 4.41.0 that logs to wandb will upload the entire initial model checkpoint to wandb.

This seems to be caused by #30135

@mgerstgrasser mgerstgrasser changed the title WandbCallback always (!) logs entire model checkpoint to wandb WandbCallback always (!) uploads entire model checkpoint to wandb May 19, 2024
@qubvel
Copy link
Member

qubvel commented May 20, 2024

Hi @mgerstgrasser, thanks for reporting.

Not sure I got the issue, did you mean that once enabled, wandb will upload the model in all subsequent script runs even if --report_to wandb is not specified?

@mgerstgrasser
Copy link
Contributor Author

mgerstgrasser commented May 20, 2024

Hi @mgerstgrasser, thanks for reporting.

Not sure I got the issue, did you mean that once enabled, wandb will upload the model in all subsequent script runs even if --report_to wandb is not specified?

No, I mean that right now in 4.41.0, enabling --report_to wandb will upload the entire initial model checkpoint to wandb right at the start of training, irrespective of whether you've enabled WANDB_LOG_MODEL or not. I.e. for any training run in 4.41.0 that has wandb enabled, the callback will upload gigabytes of data to wandb immediately.

@wongjingping
Copy link

+1 on this observation - I recently upgraded my transformers version and noticed this issue. Can we turning off the uploading of model weights by default, and require an explicit parameter to enable it? This will introduce a lot of bandwidth consumption unknowingly to the end user, and was quite an unpleasant surprise that took me a day to figure out unfortunately :\

@amyeroberts
Copy link
Collaborator

@mgerstgrasser Would you like to add a flag in your PR #30897 to control this behaviour? As we haven't heard from @parambharat, this issue is not being flagged by others (thanks @wongjingping!), and it seems like something we generally might not want to do I'd say lets add and we can always change the future default behaviour if necessary.

@mgerstgrasser
Copy link
Contributor Author

@mgerstgrasser Would you like to add a flag in your PR #30897 to control this behaviour? As we haven't heard from @parambharat, this issue is not being flagged by others (thanks @wongjingping!), and it seems like something we generally might not want to do I'd say lets add and we can always change the future default behaviour if necessary.

Done!

@amyeroberts
Copy link
Collaborator

Update: we're going to be reverting #30135 in a patch release, that'll be released soon.

The callback changes will stay on main, so:

  • Stable releases will respect previous wandb behaviour
  • Development branch will have the current behaviour + @mgerstgrasser's PR on top once merged

This will give us time to review/test and make sure this integration is working well for all. Thanks for iterating on this so quickly @mgerstgrasser!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants