You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Just to play around I have tried adapting your notebook to fine-tune a model to perform PII masking using this dataset (to do it very quickly I adapted the format such that examples look like this: <s>[INST] Mise à jour : l'heure de début de la thérapie physique a été modifiée à 8:46 AM. Lieu : Suite 348 Iva Junctions. Veuillez nous excuser pour le désagrément. [/INST] Mise à jour : l'heure de début de la thérapie physique a été modifiée à [TIME_1]. Lieu : [SECONDARYADDRESS_1] [STREET_1]. Veuillez nous excuser pour le désagrément. </s>).
After fine-tuning the model I noticed that it was continuously generating text, effectively never producing the EOS_TOKEN and thus only stopping at the max sequence length.
By looking online it seems that this might be related to the default DataCollatorForLanguageModeling (which gets passed to the SFTTrainer class by default).During training with that collator I think that the PAD tokens are getting masked out and excluded from the loss computation, thus leading the model not to "learn when to stop", and I see that you have added the PAD token to be the same as the EOS token with the following lines:
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # Fix weird overflow issue with fp16 training
Do you know if this might actually be the issue here / do you have an idea for a fix? I tried to comment out the line where you set the 2 tokens to be the same, but in that case my model trains for a while and then the loss suddendly drops to 0 so something must be wrong!
The text was updated successfully, but these errors were encountered:
Update on the issue above:
I have been digging into the issue a bit more, and since I think I managed to solve it I'm going to add a comment in case it's useful for someone else.
Basically I noticed that the line tokenizer.pad_token = tokenizer.eos_token was not really useful for this model since the pad_token is already set to <unk> for the llama2-7b models.
Moreover, when using the SFTTrainer class the tokenization process automatically adds the special tokens (i.e. the BOS token <s> at the start of the sentence). This was also redundant as the preprocessed dataset already has a <s> token at the beginning.
So the modifications I applied that fixed my issues are:
I dropped the line tokenizer.pad_token = tokenizer.eos_token
I modified my dataset to make it start directly with [INST] instead of adding also the <s> token at the very beginning
These 2 "fixes" made the fine-tuning process work, as opposed to what happened by applying either one of these 2 or none, which was either causing the model to generate endless text or bringing the loss to 0.0 after just a couple of batches.
If anybody has more insight into why this might have been happening I'd be glad to hear it out, but in the meantime I'd say it works fine now!
antmaiorino
changed the title
Issue with pad_token == eos_token : model not "learning when to stop"
Loss randomly dropping to 0 during fine-tuning
Jan 29, 2024
antmaiorino
changed the title
Loss randomly dropping to 0 during fine-tuning
Issue with pad_token == eos_token : model not "learning when to stop"
Jan 29, 2024
Hey @mlabonne thanks a lot for the great resources!
I have been reading the Fine_tune_Llama_2_in_Google_Colab.ipynb notebook and I am encountering an issue.
Just to play around I have tried adapting your notebook to fine-tune a model to perform PII masking using this dataset (to do it very quickly I adapted the format such that examples look like this:
<s>[INST] Mise à jour : l'heure de début de la thérapie physique a été modifiée à 8:46 AM. Lieu : Suite 348 Iva Junctions. Veuillez nous excuser pour le désagrément. [/INST] Mise à jour : l'heure de début de la thérapie physique a été modifiée à [TIME_1]. Lieu : [SECONDARYADDRESS_1] [STREET_1]. Veuillez nous excuser pour le désagrément. </s>
).After fine-tuning the model I noticed that it was continuously generating text, effectively never producing the EOS_TOKEN and thus only stopping at the max sequence length.
By looking online it seems that this might be related to the default
DataCollatorForLanguageModeling
(which gets passed to the SFTTrainer class by default).During training with that collator I think that the PAD tokens are getting masked out and excluded from the loss computation, thus leading the model not to "learn when to stop", and I see that you have added the PAD token to be the same as the EOS token with the following lines:Do you know if this might actually be the issue here / do you have an idea for a fix? I tried to comment out the line where you set the 2 tokens to be the same, but in that case my model trains for a while and then the loss suddendly drops to 0 so something must be wrong!
The text was updated successfully, but these errors were encountered: