Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support llama3 fine-tune #3289

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Support llama3 fine-tune #3289

wants to merge 3 commits into from

Conversation

MrZhengXin
Copy link
Contributor

Why are these changes needed?

Support llama3 fine-tune, which is the extension of #3259.
Also, the length-1 tokenization mismatch is fixed.

Related issue number (if applicable)

Checks

  • I've run format.sh to lint the changes in this PR.
  • [] I've included any doc changes needed.
  • I've made sure the relevant tests are passing (if applicable).

@MrZhengXin MrZhengXin mentioned this pull request Apr 28, 2024
2 tasks
@meet-cjli
Copy link

Currently, the tokenizer will automatically add another 'bos_token_id' for the input prompt. Since the prompt contains the 'bos_token_id', the result input_id contains two 'bos_token_id'.

https://github.com/lm-sys/FastChat/blob/main/fastchat/train/train_with_template.py#L101C1-L107C16

@MrZhengXin
Copy link
Contributor Author

Currently, the tokenizer will automatically add another 'bos_token_id' for the input prompt. Since the prompt contains the 'bos_token_id', the result input_id contains two 'bos_token_id'.

https://github.com/lm-sys/FastChat/blob/main/fastchat/train/train_with_template.py#L101C1-L107C16

Oh, I see. So actually we need to change this part of code and avoid input_ids[0][0] == input_ids[0][1] == tokenizer. bos_token_id ?

@Oscarjia
Copy link

Oscarjia commented May 5, 2024

@MrZhengXin I think we can add add_special_tokens=False # Do not add special tokens, it can works.

def tokenize_conversations(conversations, tokenizer):
    input_ids = tokenizer(
        conversations,
        return_tensors="pt",
        padding="max_length",
        max_length=tokenizer.model_max_length,
        truncation=True,
    add_special_tokens=False  # Do not add special tokens
    ).input_ids
    targets = input_ids.clone()
    return input_ids, targets

@Oscarjia
Copy link

Oscarjia commented May 5, 2024

@MrZhengXin I think we can add add_special_tokens=False # Do not add special tokens, it can works.

def tokenize_conversations(conversations, tokenizer):
    input_ids = tokenizer(
        conversations,
        return_tensors="pt",
        padding="max_length",
        max_length=tokenizer.model_max_length,
        truncation=True,
    add_special_tokens=False  # Do not add special tokens
    ).input_ids
    targets = input_ids.clone()
    return input_ids, targets

but i don't know why it will auto add the bos_token, i check the config file, it not obviously set to auto add bos_token, anybody konw why the tokenizer will auto add the bos_token?

@meet-cjli
Copy link

@MrZhengXin @Oscarjia It works, but it seems to cause another issue in the following function

(https://github.com/lm-sys/FastChat/blob/main/fastchat/train/train_with_template.py#L144)

When we use user_turn_separator to split the conversation, the first item would be <|begin_of_text|> if the conversation has no system prompt. However, we have ignored <|begin_of_text|> by target[:cur_len] = IGNORE_TOKEN_ID, where cur_len is 1. It ignores it again first iteration of the loop. So it will ignore <|begin_of_text|> twice and cause the 1 length mismatch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants