Apply chat template for HF #660

haqishen · 2024-04-08T02:24:25Z

close #551

pascal-pfeiffer

Thank you for working on that issue @haqishen .

A few things, I noticed:

chat template is also applied to seq2seq models.
chat template is not applied in the downloaded models.
system prompt should be optional
it probably doesn't need an extra cfg yaml for the unit tests, you can reuse the existing ones.

pascal-pfeiffer · 2024-04-22T12:15:22Z

llm_studio/app_utils/hugging_face_utils.py

-    # push pipeline to hub
-    template_env = Environment(loader=FileSystemLoader(searchpath="llm_studio/src/"))
-
-    pipeline_template = template_env.get_template("h2oai_pipeline_template.py")


Please also remove that file as it is no longer needed

pascal-pfeiffer · 2024-04-22T12:19:23Z

llm_studio/app_utils/hugging_face_utils.py

@@ -124,6 +215,7 @@ def publish_model_to_hugging_face(
    repo_id = f"{user_id}/{hf_repo_friendly_name(model_name)}"

    # push tokenizer to hub
+    tokenizer.chat_template = get_chat_template(cfg)


also add to local download

pascal-pfeiffer · 2024-04-22T14:38:21Z

tests/src/utils/test_chat_template.py

+def test_chat_template_custom_special_tokens():
+
+    test_directory = os.path.abspath(os.path.dirname(__file__))
+    cfg_path = os.path.join(test_directory, "../test_data/cfg_chat_template.yaml")
+    cfg = load_config_yaml(cfg_path)
+    cfg.dataset.system_column = "system"
+    cfg.dataset.text_system_start = "[SYS]"
+    cfg.dataset.text_prompt_start = "[USR]"
+    cfg.dataset.text_answer_separator = "[ANS]"
+
+    tokenizer = get_tokenizer(cfg)
+    tokenizer.chat_template = get_chat_template(cfg)
+
+    chat = [
+        {"role": "system", "content": "[system prompt]"},
+        {"role": "user", "content": "[user prompt]"},
+        {"role": "assistant", "content": "[assistant response]"},
+        {"role": "user", "content": "[user prompt2]"},
+    ]
+
+    input = tokenizer.apply_chat_template(
+        chat,
+        tokenize=False,
+        add_generation_prompt=True,
+    )
+    expected = "[SYS][system prompt]</s>[USR][user prompt]</s>[ANS][assistant response]</s>[USR][user prompt2]</s>[ANS]"  # noqa
+    assert input == expected


What is the purpose of this test? Covers the exact same thing as test_chat_template_with_system_prompt, doesn't it?

haqishen · 2024-05-07T07:47:38Z

Hi @pascal-pfeiffer

Thanks for your comments and sorry for the late.

I have a few questions:

chat template is not applied in the downloaded models.

Chat template actually is not used for building input text within the project (at least for now), so I don't get the point of this ↑ Do I miss something?

system prompt should be optional

https://github.com/h2oai/h2o-llmstudio/pull/660/files#diff-002a8aeff9716c32e7b21b06df53b9905cae17ae5f75c398705384b1d9507295R107

It's controlled here, and also there is unit test for not using system prompt here:

https://github.com/h2oai/h2o-llmstudio/pull/660/files#diff-f62f85ce93eeb67e1138700aef5af102db092272242ac76b40d7ad6b9a69e71cR47

Or do I misunderstand your comment?

pascal-pfeiffer · 2024-05-07T11:10:06Z

Thanks for asking, I'll try to clarify below:

Chat template actually is not used for building input text within the project (at least for now), so I don't get the point of this ↑ Do I miss something?

The downloaded model should be identical to the one pushed to huggingface. There is little benefit of maintaining different versions especially in the critical chat template. So, my point is to align the build process for model zip vs huggingface model repo.

https://github.com/h2oai/h2o-llmstudio/pull/660/files#diff-002a8aeff9716c32e7b21b06df53b9905cae17ae5f75c398705384b1d9507295R107
It's controlled here, and also there is unit test for not using system prompt here:
https://github.com/h2oai/h2o-llmstudio/pull/660/files#diff-f62f85ce93eeb67e1138700aef5af102db092272242ac76b40d7ad6b9a69e71cR47
Or do I misunderstand your comment?

Train a model with system prompt
Your PR will add a required system prompt to the chat template
User is not able to use the model without providing a system prompt, and face an error "TemplateError: Conversation roles must alternate system/user/assistant/user/assistant/..."

haqishen · 2024-05-08T03:56:24Z

@pascal-pfeiffer

Thanks for your answer, it make a lot of sense.

The current logic is that if an LLM uses a system prompt during training, it must also use a system prompt during generation. Conversely, if the LLM does not use a system prompt during training, it cannot use one during generation.

Your idea is that whether or not an LLM uses a system prompt during training, there should be the option to choose whether to use a system prompt during generation. Is this understanding correct?

pascal-pfeiffer · 2024-05-08T08:48:26Z

The current logic is that if an LLM uses a system prompt during training, it must also use a system prompt during generation. Conversely, if the LLM does not use a system prompt during training, it cannot use one during generation

Your idea is that whether or not an LLM uses a system prompt during training, there should be the option to choose whether to use a system prompt during generation. Is this understanding correct?

Train with system prompt -> Inference can be done either with or without system prompt. Glimpsing at the code below we even support "" in system prompts, which removes the whole system part of the prompt. Which means this case can be covered in the training data.

    def get_systems(self, cfg, df):
        if cfg.dataset.system_column != "None":
            if cfg.dataset.system_column not in df.columns:
                logger.warning(
                    f"System column {cfg.dataset.system_column} not found."
                    f"Disabling functionality."
                )
                systems = ["" for _ in range(len(self.prompts))]
            else:
                systems = df[cfg.dataset.system_column].astype(str).tolist()
        else:
            systems = ["" for _ in range(len(self.prompts))]
        return systems

Train without system prompt -> Inference can only be done without system prompt (already the case in your PR)

…_temp

haqishen · 2024-05-09T08:03:30Z

@pascal-pfeiffer
Updated, please check!

pascal-pfeiffer

Thanks for the changes @haqishen
Only few nitpicks in the code comments.

pascal-pfeiffer · 2024-05-09T08:34:17Z

llm_studio/app_utils/hugging_face_utils.py

+    if cfg.problem_type != "text_sequence_to_sequence_modeling":
+        tokenizer.chat_template = get_chat_template(cfg)


I don't see the template being used in classification tasks.

pascal-pfeiffer · 2024-05-09T08:34:23Z

llm_studio/app_utils/sections/experiment.py

+        if cfg.problem_type != "text_sequence_to_sequence_modeling":
+            tokenizer.chat_template = get_chat_template(cfg)


I don't see the template being used in classification tasks.

pascal-pfeiffer · 2024-05-09T08:35:23Z

tests/src/utils/test_chat_template.py

+def build_expected(cfg, eos_token, chat):
+    expected = ""
+    for msg in chat:
+        if msg["role"] == "user":
+            expected += f"{cfg.dataset.text_prompt_start}{msg['content']}"
+            if cfg.dataset.add_eos_token_to_prompt:
+                expected += eos_token
+        elif msg["role"] == "assistant":
+            expected += f"{cfg.dataset.text_answer_separator}{msg['content']}"
+            if cfg.dataset.add_eos_token_to_answer:
+                expected += eos_token
+        elif msg["role"] == "system":
+            expected += f"{cfg.dataset.text_system_start}{msg['content']}"
+            if cfg.dataset.add_eos_token_to_system:
+                expected += eos_token
+    expected += cfg.dataset.text_answer_separator
+    return expected.replace("\\n", "\n")


I honestly liked the old test better with explicit string.
This function itself looks like it needs a unit test.

Not too important, now for this PR. I am fine with merging it, but let's keep in mind to make tests more explicit and test multiple tokenizers/models. At least the big model families out there.

pascal-pfeiffer

Thank you for quickly addressing the suggestions. Looks good to me!

haqishen added 3 commits April 8, 2024 11:23

chat template

90c6e98

make style

9f97bc3

Merge branch 'main' into chat_temp

c1a376f

pascal-pfeiffer requested changes Apr 22, 2024

View reviewed changes

haqishen added 3 commits May 9, 2024 16:26

update

bb6305e

XMerge branch 'chat_temp' of github.com:h2oai/h2o-llmstudio into chat…

7dfa8bd

…_temp

Merge branch 'main' into chat_temp

0c9afe4

haqishen requested a review from pascal-pfeiffer May 9, 2024 08:02

pascal-pfeiffer requested changes May 9, 2024

View reviewed changes

haqishen added 2 commits May 9, 2024 19:27

update

ed20219

make style

61e9435

pascal-pfeiffer approved these changes May 9, 2024

View reviewed changes

haqishen added 2 commits May 9, 2024 21:14

minor fix

b409f16

minor fix

5f0b739

haqishen merged commit 38dd5b4 into main May 9, 2024
3 checks passed

haqishen deleted the chat_temp branch May 9, 2024 16:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apply chat template for HF #660

Apply chat template for HF #660

haqishen commented Apr 8, 2024

pascal-pfeiffer left a comment

pascal-pfeiffer Apr 22, 2024

pascal-pfeiffer Apr 22, 2024 •

edited

pascal-pfeiffer Apr 22, 2024

haqishen commented May 7, 2024

pascal-pfeiffer commented May 7, 2024

haqishen commented May 8, 2024

pascal-pfeiffer commented May 8, 2024

haqishen commented May 9, 2024

pascal-pfeiffer left a comment •

edited

pascal-pfeiffer May 9, 2024

pascal-pfeiffer May 9, 2024

pascal-pfeiffer May 9, 2024 •

edited

pascal-pfeiffer left a comment

		if cfg.problem_type != "text_sequence_to_sequence_modeling":
		tokenizer.chat_template = get_chat_template(cfg)

Apply chat template for HF #660

Apply chat template for HF #660

Conversation

haqishen commented Apr 8, 2024

pascal-pfeiffer left a comment

Choose a reason for hiding this comment

pascal-pfeiffer Apr 22, 2024

Choose a reason for hiding this comment

pascal-pfeiffer Apr 22, 2024 • edited

Choose a reason for hiding this comment

pascal-pfeiffer Apr 22, 2024

Choose a reason for hiding this comment

haqishen commented May 7, 2024

pascal-pfeiffer commented May 7, 2024

haqishen commented May 8, 2024

pascal-pfeiffer commented May 8, 2024

haqishen commented May 9, 2024

pascal-pfeiffer left a comment • edited

Choose a reason for hiding this comment

pascal-pfeiffer May 9, 2024

Choose a reason for hiding this comment

pascal-pfeiffer May 9, 2024

Choose a reason for hiding this comment

pascal-pfeiffer May 9, 2024 • edited

Choose a reason for hiding this comment

pascal-pfeiffer left a comment

Choose a reason for hiding this comment

pascal-pfeiffer Apr 22, 2024 •

edited

pascal-pfeiffer left a comment •

edited

pascal-pfeiffer May 9, 2024 •

edited