Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix accelerate failing tests #30836

Merged
merged 16 commits into from
May 23, 2024
Merged

Fix accelerate failing tests #30836

merged 16 commits into from
May 23, 2024

Conversation

SunMarc
Copy link
Member

@SunMarc SunMarc commented May 15, 2024

What does this PR do ?

This PR fixes some tests (~ 30) are that contained in the accelerate_tests flag. Recently, I ran the following workflow to have an overview of the failing tests : https://github.com/huggingface/transformers/actions/runs/9081904684 or the following CLI: RUN_SLOW=True pytest -m accelerate_tests tests/
Related PR #30808

Model fixed:

  • BertModel
  • ClipModel
  • GPTSanJapaneseModel
  • M2M100Model
  • MambaModel - same fix as jamba
  • T5/MT5Model/UMT5Model - same fix since this is copied from code
  • Whisper
  • SiglipVisionModel for MP. As for the offload, I skipped the tests since fixing them would require to introduce a new attribute in PretrainedConfig (_preload_module_classes to be used in dispatch_model), not sure if it is worth it if this is the only model for now.

Tests skipped:

  • DbrxModel : because offload is not working and enabling it is not simple (core modeling code). However model parallelism works.

See the run here with most of the tests passing . There are only 3 tests failing but I skipped them in the latest commit (siglipvisionmodel)
cc @ydshieh

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Approving but for the ssm cache I find it a bit weird that we have to do these x.to().
Also in general, accelerate is supposed to handle allllllll the devices no? 👿

@@ -244,6 +244,7 @@ def slow_forward(self, input_states, cache_params: Optional[MambaCache]=None):
# 2. Convolution sequence transformation
if cache_params is not None:
ssm_state = cache_params.ssm_states[self.layer_idx].clone()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the cache param is always on the wrong device we are always gonna have to do the transfer here no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes that's right !

@@ -1388,7 +1388,7 @@ def forward(
inputs_embeds, past_key_values_length=past_key_values_length, position_ids=position_ids
)

hidden_states = inputs_embeds + positions
hidden_states = inputs_embeds + positions.to(inputs_embeds.device)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what was the issue with this? It' s a bit weird here, device placement of the inputs is not automatic? Or is the embedding layer alone place on another device than the embed_position?

Copy link
Member Author

@SunMarc SunMarc May 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inputs_embeds and postions can be on different devices (could be fixed if we find a way to make sure that they are on the same device, however, we would probably need to add another attribute in PretrainedConfig). We can't really move automatically these inputs to the right device since we do the ops in the forward itself. Wrapping them in a nn.Module to perform the op would work but that's not a good solution.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright got it!

@@ -962,7 +962,7 @@ class BertModel(BertPreTrainedModel):
`add_cross_attention` set to `True`; an `encoder_hidden_states` is then expected as an input to the forward pass.
"""

_no_split_modules = ["BertEmbeddings"]
_no_split_modules = ["BertEmbeddings", "BertLayer"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a general solution that will work for everyone? (I mean why was this failing the test)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that will work for everyone ! I don't know why the layer were not added to _no_split_modules, this is what we do for all models.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long due! Thanks for updating 😄

@ydshieh
Copy link
Collaborator

ydshieh commented May 21, 2024

Just wanna to know if those are failing for quite some time or it's a recent changes (in either transformers or accelerate) causing those failures? If it's the later case, could you elaborate it a bit more?

@SunMarc
Copy link
Member Author

SunMarc commented May 21, 2024

Just wanna to know if those are failing for quite some time or it's a recent changes (in either transformers or accelerate) causing those failures? If it's the later case, could you elaborate it a bit more?

I'm not sure how long these tests have been failing. However, most of the tests were failing for a good reason (didn't put the right no_split_modules, modeling code not compatible)

Copy link
Collaborator

@ydshieh ydshieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, thanks~

@SunMarc SunMarc reopened this May 22, 2024
@SunMarc SunMarc merged commit 8366b57 into main May 23, 2024
24 checks passed
@SunMarc SunMarc deleted the fix-accelerate-tests branch May 23, 2024 15:18
itazap pushed a commit that referenced this pull request May 24, 2024
* Fix accelerate tests

* fix clip

* skip dbrx tests

* fix GPTSan

* fix M2M100Model

* same fix as jamba

* fix mt5

* Fix T5Model

* Fix umt5 model

* fix switch_transformers

* fix whisper

* fix gptsan again

* fix siglip recent test

* skip siglip tests

* wrong place fixed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants