Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizers padding_side was not validate to be "right" in trainer_sft.py #3657

Open
theblackcat102 opened this issue Aug 17, 2023 · 1 comment
Assignees
Labels
bug Something isn't working ml

Comments

@theblackcat102
Copy link
Collaborator

theblackcat102 commented Aug 17, 2023

from transformers import AutoTokenizer
AutoTokenizer.from_pretrained("OpenAssistant/llama2-13b-orca-8k-3319").padding_side
>> 'left'
AutoTokenizer.from_pretrained("TheBloke/Llama-2-13B-fp16")
>> 'left'
AutoTokenizer.from_pretrained("mosaicml/mpt-7b").padding_side
>> 'right'
AutoTokenizer.from_pretrained("huggyllama/llama-7b").padding_side
>> 'left'
AutoTokenizer.from_pretrained("OpenAssistant/llama-30b-sft-v8.2-2.4k-steps-system").padding_side
>> 'left'

Since llama models are using left padding, the supervised training dialoguecollator would cause the label_mask to pad in a different direction as the tokenizer.pad (input_ids, attention_mask), as torch.stack (label_mask) implements the right padding strategy.

Printing out the dataloader results in trainer_sft.py would also verify the issue

    train_dataloader = DataLoader(train, collate_fn=train_collate_fn, batch_size=9, shuffle=True)
    for batch in train_dataloader:
        for idx, question in enumerate(batch['input_ids']):
            print('-------')
            print(tokenizer.decode(question[batch['label_masks'][idx]]).replace('</s>', '')+'\n')

I think there's no padding_side assigned to right in the trainer_sft.py pipeline, so by default llama models we have trained are bit faulty

@theblackcat102 theblackcat102 added the bug Something isn't working label Aug 17, 2023
@theblackcat102
Copy link
Collaborator Author

An easy fix would be setting padding_side = 'left' in DialogueDataCollator post_init function

@dataclass
class DialogueDataCollator:
    ...
    def __post_init__(self):
        assert self.tokenizer.eos_token
        self.tokenizer.padding_side = 'right'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ml
Projects
None yet
Development

No branches or pull requests

2 participants