-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support llama3 fine-tune #3289
base: main
Are you sure you want to change the base?
Support llama3 fine-tune #3289
Conversation
Currently, the tokenizer will automatically add another 'bos_token_id' for the input prompt. Since the prompt contains the 'bos_token_id', the result input_id contains two 'bos_token_id'.
|
Oh, I see. So actually we need to change this part of code and avoid input_ids[0][0] == input_ids[0][1] == tokenizer. bos_token_id ? |
@MrZhengXin I think we can add
|
but i don't know why it will auto add the bos_token, i check the config file, it not obviously set to auto add bos_token, anybody konw why the tokenizer will auto add the bos_token? |
@MrZhengXin @Oscarjia It works, but it seems to cause another issue in the following function (https://github.com/lm-sys/FastChat/blob/main/fastchat/train/train_with_template.py#L144) When we use |
Why are these changes needed?
Support llama3 fine-tune, which is the extension of #3259.
Also, the length-1 tokenization mismatch is fixed.
Related issue number (if applicable)
Checks
format.sh
to lint the changes in this PR.