New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apply chat template for HF #660
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on that issue @haqishen .
A few things, I noticed:
- chat template is also applied to seq2seq models.
- chat template is not applied in the downloaded models.
- system prompt should be optional
- it probably doesn't need an extra cfg yaml for the unit tests, you can reuse the existing ones.
# push pipeline to hub | ||
template_env = Environment(loader=FileSystemLoader(searchpath="llm_studio/src/")) | ||
|
||
pipeline_template = template_env.get_template("h2oai_pipeline_template.py") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also remove that file as it is no longer needed
@@ -124,6 +215,7 @@ def publish_model_to_hugging_face( | |||
repo_id = f"{user_id}/{hf_repo_friendly_name(model_name)}" | |||
|
|||
# push tokenizer to hub | |||
tokenizer.chat_template = get_chat_template(cfg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also add to local download
def test_chat_template_custom_special_tokens(): | ||
|
||
test_directory = os.path.abspath(os.path.dirname(__file__)) | ||
cfg_path = os.path.join(test_directory, "../test_data/cfg_chat_template.yaml") | ||
cfg = load_config_yaml(cfg_path) | ||
cfg.dataset.system_column = "system" | ||
cfg.dataset.text_system_start = "[SYS]" | ||
cfg.dataset.text_prompt_start = "[USR]" | ||
cfg.dataset.text_answer_separator = "[ANS]" | ||
|
||
tokenizer = get_tokenizer(cfg) | ||
tokenizer.chat_template = get_chat_template(cfg) | ||
|
||
chat = [ | ||
{"role": "system", "content": "[system prompt]"}, | ||
{"role": "user", "content": "[user prompt]"}, | ||
{"role": "assistant", "content": "[assistant response]"}, | ||
{"role": "user", "content": "[user prompt2]"}, | ||
] | ||
|
||
input = tokenizer.apply_chat_template( | ||
chat, | ||
tokenize=False, | ||
add_generation_prompt=True, | ||
) | ||
expected = "[SYS][system prompt]</s>[USR][user prompt]</s>[ANS][assistant response]</s>[USR][user prompt2]</s>[ANS]" # noqa | ||
assert input == expected |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of this test? Covers the exact same thing as test_chat_template_with_system_prompt
, doesn't it?
Thanks for your comments and sorry for the late. I have a few questions:
Chat template actually is not used for building input text within the project (at least for now), so I don't get the point of this ↑ Do I miss something?
It's controlled here, and also there is unit test for not using system prompt here: Or do I misunderstand your comment? |
Thanks for asking, I'll try to clarify below:
The downloaded model should be identical to the one pushed to huggingface. There is little benefit of maintaining different versions especially in the critical chat template. So, my point is to align the build process for model zip vs huggingface model repo.
|
Thanks for your answer, it make a lot of sense. The current logic is that if an LLM uses a system prompt during training, it must also use a system prompt during generation. Conversely, if the LLM does not use a system prompt during training, it cannot use one during generation. Your idea is that whether or not an LLM uses a system prompt during training, there should be the option to choose whether to use a system prompt during generation. Is this understanding correct? |
def get_systems(self, cfg, df):
if cfg.dataset.system_column != "None":
if cfg.dataset.system_column not in df.columns:
logger.warning(
f"System column {cfg.dataset.system_column} not found."
f"Disabling functionality."
)
systems = ["" for _ in range(len(self.prompts))]
else:
systems = df[cfg.dataset.system_column].astype(str).tolist()
else:
systems = ["" for _ in range(len(self.prompts))]
return systems
|
@pascal-pfeiffer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes @haqishen
Only few nitpicks in the code comments.
if cfg.problem_type != "text_sequence_to_sequence_modeling": | ||
tokenizer.chat_template = get_chat_template(cfg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the template being used in classification tasks.
if cfg.problem_type != "text_sequence_to_sequence_modeling": | ||
tokenizer.chat_template = get_chat_template(cfg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the template being used in classification tasks.
def build_expected(cfg, eos_token, chat): | ||
expected = "" | ||
for msg in chat: | ||
if msg["role"] == "user": | ||
expected += f"{cfg.dataset.text_prompt_start}{msg['content']}" | ||
if cfg.dataset.add_eos_token_to_prompt: | ||
expected += eos_token | ||
elif msg["role"] == "assistant": | ||
expected += f"{cfg.dataset.text_answer_separator}{msg['content']}" | ||
if cfg.dataset.add_eos_token_to_answer: | ||
expected += eos_token | ||
elif msg["role"] == "system": | ||
expected += f"{cfg.dataset.text_system_start}{msg['content']}" | ||
if cfg.dataset.add_eos_token_to_system: | ||
expected += eos_token | ||
expected += cfg.dataset.text_answer_separator | ||
return expected.replace("\\n", "\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I honestly liked the old test better with explicit string.
This function itself looks like it needs a unit test.
Not too important, now for this PR. I am fine with merging it, but let's keep in mind to make tests more explicit and test multiple tokenizers/models. At least the big model families out there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for quickly addressing the suggestions. Looks good to me!
close #551