Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apply chat template for HF #660

Merged
merged 10 commits into from May 9, 2024
Merged

Apply chat template for HF #660

merged 10 commits into from May 9, 2024

Conversation

haqishen
Copy link
Collaborator

@haqishen haqishen commented Apr 8, 2024

close #551

Copy link
Collaborator

@pascal-pfeiffer pascal-pfeiffer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on that issue @haqishen .

A few things, I noticed:

  • chat template is also applied to seq2seq models.
  • chat template is not applied in the downloaded models.
  • system prompt should be optional
  • it probably doesn't need an extra cfg yaml for the unit tests, you can reuse the existing ones.

# push pipeline to hub
template_env = Environment(loader=FileSystemLoader(searchpath="llm_studio/src/"))

pipeline_template = template_env.get_template("h2oai_pipeline_template.py")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also remove that file as it is no longer needed

@@ -124,6 +215,7 @@ def publish_model_to_hugging_face(
repo_id = f"{user_id}/{hf_repo_friendly_name(model_name)}"

# push tokenizer to hub
tokenizer.chat_template = get_chat_template(cfg)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also add to local download

Comment on lines 110 to 136
def test_chat_template_custom_special_tokens():

test_directory = os.path.abspath(os.path.dirname(__file__))
cfg_path = os.path.join(test_directory, "../test_data/cfg_chat_template.yaml")
cfg = load_config_yaml(cfg_path)
cfg.dataset.system_column = "system"
cfg.dataset.text_system_start = "[SYS]"
cfg.dataset.text_prompt_start = "[USR]"
cfg.dataset.text_answer_separator = "[ANS]"

tokenizer = get_tokenizer(cfg)
tokenizer.chat_template = get_chat_template(cfg)

chat = [
{"role": "system", "content": "[system prompt]"},
{"role": "user", "content": "[user prompt]"},
{"role": "assistant", "content": "[assistant response]"},
{"role": "user", "content": "[user prompt2]"},
]

input = tokenizer.apply_chat_template(
chat,
tokenize=False,
add_generation_prompt=True,
)
expected = "[SYS][system prompt]</s>[USR][user prompt]</s>[ANS][assistant response]</s>[USR][user prompt2]</s>[ANS]" # noqa
assert input == expected
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this test? Covers the exact same thing as test_chat_template_with_system_prompt, doesn't it?

@haqishen
Copy link
Collaborator Author

haqishen commented May 7, 2024

Hi @pascal-pfeiffer

Thanks for your comments and sorry for the late.

I have a few questions:

chat template is not applied in the downloaded models.

Chat template actually is not used for building input text within the project (at least for now), so I don't get the point of this ↑ Do I miss something?

system prompt should be optional

https://github.com/h2oai/h2o-llmstudio/pull/660/files#diff-002a8aeff9716c32e7b21b06df53b9905cae17ae5f75c398705384b1d9507295R107

It's controlled here, and also there is unit test for not using system prompt here:

https://github.com/h2oai/h2o-llmstudio/pull/660/files#diff-f62f85ce93eeb67e1138700aef5af102db092272242ac76b40d7ad6b9a69e71cR47

Or do I misunderstand your comment?

@pascal-pfeiffer
Copy link
Collaborator

Thanks for asking, I'll try to clarify below:

Chat template actually is not used for building input text within the project (at least for now), so I don't get the point of this ↑ Do I miss something?

The downloaded model should be identical to the one pushed to huggingface. There is little benefit of maintaining different versions especially in the critical chat template. So, my point is to align the build process for model zip vs huggingface model repo.

https://github.com/h2oai/h2o-llmstudio/pull/660/files#diff-002a8aeff9716c32e7b21b06df53b9905cae17ae5f75c398705384b1d9507295R107
It's controlled here, and also there is unit test for not using system prompt here:
https://github.com/h2oai/h2o-llmstudio/pull/660/files#diff-f62f85ce93eeb67e1138700aef5af102db092272242ac76b40d7ad6b9a69e71cR47
Or do I misunderstand your comment?

  1. Train a model with system prompt
  2. Your PR will add a required system prompt to the chat template
  3. User is not able to use the model without providing a system prompt, and face an error "TemplateError: Conversation roles must alternate system/user/assistant/user/assistant/..."

@haqishen
Copy link
Collaborator Author

haqishen commented May 8, 2024

@pascal-pfeiffer

Thanks for your answer, it make a lot of sense.

The current logic is that if an LLM uses a system prompt during training, it must also use a system prompt during generation. Conversely, if the LLM does not use a system prompt during training, it cannot use one during generation.

Your idea is that whether or not an LLM uses a system prompt during training, there should be the option to choose whether to use a system prompt during generation. Is this understanding correct?

@pascal-pfeiffer
Copy link
Collaborator

The current logic is that if an LLM uses a system prompt during training, it must also use a system prompt during generation. Conversely, if the LLM does not use a system prompt during training, it cannot use one during generation

Your idea is that whether or not an LLM uses a system prompt during training, there should be the option to choose whether to use a system prompt during generation. Is this understanding correct?

  1. Train with system prompt -> Inference can be done either with or without system prompt. Glimpsing at the code below we even support "" in system prompts, which removes the whole system part of the prompt. Which means this case can be covered in the training data.
    def get_systems(self, cfg, df):
        if cfg.dataset.system_column != "None":
            if cfg.dataset.system_column not in df.columns:
                logger.warning(
                    f"System column {cfg.dataset.system_column} not found."
                    f"Disabling functionality."
                )
                systems = ["" for _ in range(len(self.prompts))]
            else:
                systems = df[cfg.dataset.system_column].astype(str).tolist()
        else:
            systems = ["" for _ in range(len(self.prompts))]
        return systems
  1. Train without system prompt -> Inference can only be done without system prompt (already the case in your PR)

@haqishen
Copy link
Collaborator Author

haqishen commented May 9, 2024

@pascal-pfeiffer
Updated, please check!

Copy link
Collaborator

@pascal-pfeiffer pascal-pfeiffer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes @haqishen
Only few nitpicks in the code comments.

Comment on lines 231 to 232
if cfg.problem_type != "text_sequence_to_sequence_modeling":
tokenizer.chat_template = get_chat_template(cfg)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the template being used in classification tasks.

Comment on lines 1640 to 1641
if cfg.problem_type != "text_sequence_to_sequence_modeling":
tokenizer.chat_template = get_chat_template(cfg)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the template being used in classification tasks.

Comment on lines 11 to 27
def build_expected(cfg, eos_token, chat):
expected = ""
for msg in chat:
if msg["role"] == "user":
expected += f"{cfg.dataset.text_prompt_start}{msg['content']}"
if cfg.dataset.add_eos_token_to_prompt:
expected += eos_token
elif msg["role"] == "assistant":
expected += f"{cfg.dataset.text_answer_separator}{msg['content']}"
if cfg.dataset.add_eos_token_to_answer:
expected += eos_token
elif msg["role"] == "system":
expected += f"{cfg.dataset.text_system_start}{msg['content']}"
if cfg.dataset.add_eos_token_to_system:
expected += eos_token
expected += cfg.dataset.text_answer_separator
return expected.replace("\\n", "\n")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly liked the old test better with explicit string.
This function itself looks like it needs a unit test.

Not too important, now for this PR. I am fine with merging it, but let's keep in mind to make tests more explicit and test multiple tokenizers/models. At least the big model families out there.

Copy link
Collaborator

@pascal-pfeiffer pascal-pfeiffer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for quickly addressing the suggestions. Looks good to me!

@haqishen haqishen merged commit 38dd5b4 into main May 9, 2024
3 checks passed
@haqishen haqishen deleted the chat_temp branch May 9, 2024 16:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Automatically add chat template to tokenizer_config.json
2 participants