Support llama3 fine-tune #3289

MrZhengXin · 2024-04-27T11:26:33Z

Why are these changes needed?

Support llama3 fine-tune, which is the extension of #3259.
Also, the length-1 tokenization mismatch is fixed.

Related issue number (if applicable)

Checks

I've run format.sh to lint the changes in this PR.
[] I've included any doc changes needed.
I've made sure the relevant tests are passing (if applicable).

meet-cjli · 2024-05-05T02:40:12Z

Currently, the tokenizer will automatically add another 'bos_token_id' for the input prompt. Since the prompt contains the 'bos_token_id', the result input_id contains two 'bos_token_id'.

https://github.com/lm-sys/FastChat/blob/main/fastchat/train/train_with_template.py#L101C1-L107C16

MrZhengXin · 2024-05-05T05:20:34Z

Currently, the tokenizer will automatically add another 'bos_token_id' for the input prompt. Since the prompt contains the 'bos_token_id', the result input_id contains two 'bos_token_id'.

https://github.com/lm-sys/FastChat/blob/main/fastchat/train/train_with_template.py#L101C1-L107C16

Oh, I see. So actually we need to change this part of code and avoid input_ids[0][0] == input_ids[0][1] == tokenizer. bos_token_id ?

Oscarjia · 2024-05-05T08:16:25Z

@MrZhengXin I think we can add add_special_tokens=False # Do not add special tokens, it can works.

def tokenize_conversations(conversations, tokenizer):
    input_ids = tokenizer(
        conversations,
        return_tensors="pt",
        padding="max_length",
        max_length=tokenizer.model_max_length,
        truncation=True,
    add_special_tokens=False  # Do not add special tokens
    ).input_ids
    targets = input_ids.clone()
    return input_ids, targets

Oscarjia · 2024-05-05T08:18:32Z

@MrZhengXin I think we can add add_special_tokens=False # Do not add special tokens, it can works.

def tokenize_conversations(conversations, tokenizer):
    input_ids = tokenizer(
        conversations,
        return_tensors="pt",
        padding="max_length",
        max_length=tokenizer.model_max_length,
        truncation=True,
    add_special_tokens=False  # Do not add special tokens
    ).input_ids
    targets = input_ids.clone()
    return input_ids, targets

but i don't know why it will auto add the bos_token, i check the config file, it not obviously set to auto add bos_token, anybody konw why the tokenizer will auto add the bos_token?

meet-cjli · 2024-05-05T16:11:45Z

@MrZhengXin @Oscarjia It works, but it seems to cause another issue in the following function

(https://github.com/lm-sys/FastChat/blob/main/fastchat/train/train_with_template.py#L144)

When we use user_turn_separator to split the conversation, the first item would be <|begin_of_text|> if the conversation has no system prompt. However, we have ignored <|begin_of_text|> by target[:cur_len] = IGNORE_TOKEN_ID, where cur_len is 1. It ignores it again first iteration of the loop. So it will ignore <|begin_of_text|> twice and cause the 1 length mismatch.

MrZhengXin added 3 commits April 27, 2024 19:04

save model under deepspeed (lm-sys#2689)

974cd41

Support Llama 3 fine-tune

385fbf6

reformat

6f73314

MrZhengXin mentioned this pull request Apr 28, 2024

Support Llama-3 fine-tune #3288

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support llama3 fine-tune #3289

Support llama3 fine-tune #3289

MrZhengXin commented Apr 27, 2024

meet-cjli commented May 5, 2024

MrZhengXin commented May 5, 2024

Oscarjia commented May 5, 2024

Oscarjia commented May 5, 2024 •

edited

meet-cjli commented May 5, 2024

Support llama3 fine-tune #3289

Are you sure you want to change the base?

Support llama3 fine-tune #3289

Conversation

MrZhengXin commented Apr 27, 2024

Why are these changes needed?

Related issue number (if applicable)

Checks

meet-cjli commented May 5, 2024

MrZhengXin commented May 5, 2024

Oscarjia commented May 5, 2024

Oscarjia commented May 5, 2024 • edited

meet-cjli commented May 5, 2024

Oscarjia commented May 5, 2024 •

edited