Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with pad_token == eos_token : model not "learning when to stop" #33

Open
antmaiorino opened this issue Jan 22, 2024 · 1 comment

Comments

@antmaiorino
Copy link

antmaiorino commented Jan 22, 2024

Hey @mlabonne thanks a lot for the great resources!

I have been reading the Fine_tune_Llama_2_in_Google_Colab.ipynb notebook and I am encountering an issue.

Just to play around I have tried adapting your notebook to fine-tune a model to perform PII masking using this dataset (to do it very quickly I adapted the format such that examples look like this: <s>[INST] Mise à jour : l'heure de début de la thérapie physique a été modifiée à 8:46 AM. Lieu : Suite 348 Iva Junctions. Veuillez nous excuser pour le désagrément. [/INST] Mise à jour : l'heure de début de la thérapie physique a été modifiée à [TIME_1]. Lieu : [SECONDARYADDRESS_1] [STREET_1]. Veuillez nous excuser pour le désagrément. </s>).

After fine-tuning the model I noticed that it was continuously generating text, effectively never producing the EOS_TOKEN and thus only stopping at the max sequence length.

By looking online it seems that this might be related to the default DataCollatorForLanguageModeling (which gets passed to the SFTTrainer class by default).During training with that collator I think that the PAD tokens are getting masked out and excluded from the loss computation, thus leading the model not to "learn when to stop", and I see that you have added the PAD token to be the same as the EOS token with the following lines:

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # Fix weird overflow issue with fp16 training

Do you know if this might actually be the issue here / do you have an idea for a fix? I tried to comment out the line where you set the 2 tokens to be the same, but in that case my model trains for a while and then the loss suddendly drops to 0 so something must be wrong!

@antmaiorino
Copy link
Author

antmaiorino commented Jan 25, 2024

Update on the issue above:
I have been digging into the issue a bit more, and since I think I managed to solve it I'm going to add a comment in case it's useful for someone else.

Basically I noticed that the line tokenizer.pad_token = tokenizer.eos_token was not really useful for this model since the pad_token is already set to <unk> for the llama2-7b models.
Moreover, when using the SFTTrainer class the tokenization process automatically adds the special tokens (i.e. the BOS token <s> at the start of the sentence). This was also redundant as the preprocessed dataset already has a <s> token at the beginning.

So the modifications I applied that fixed my issues are:

  • I dropped the line tokenizer.pad_token = tokenizer.eos_token
  • I modified my dataset to make it start directly with [INST] instead of adding also the <s> token at the very beginning

These 2 "fixes" made the fine-tuning process work, as opposed to what happened by applying either one of these 2 or none, which was either causing the model to generate endless text or bringing the loss to 0.0 after just a couple of batches.
If anybody has more insight into why this might have been happening I'd be glad to hear it out, but in the meantime I'd say it works fine now!

@antmaiorino antmaiorino changed the title Issue with pad_token == eos_token : model not "learning when to stop" Loss randomly dropping to 0 during fine-tuning Jan 29, 2024
@antmaiorino antmaiorino changed the title Loss randomly dropping to 0 during fine-tuning Issue with pad_token == eos_token : model not "learning when to stop" Jan 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant