Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support FIM for models using ChatML format #142

Open
ChrisDeadman opened this issue Feb 29, 2024 · 22 comments
Open

Support FIM for models using ChatML format #142

ChrisDeadman opened this issue Feb 29, 2024 · 22 comments
Labels
enhancement New feature or request

Comments

@ChrisDeadman
Copy link

ChrisDeadman commented Feb 29, 2024

First of all: Your extension is awesome, thanks for all your effort in making it better constantly! 馃憤馃徏

FIM doesn't work for Mistral-7B-Instruct-v0.2-code-ft.
I know that the ChatML format is mostly suited for turn-based conversations.
However, except for the suggestions, you've already refactored your code to use the turn-based Ollama endpoint...

I get the reason why you have to use the generate endpoint of Ollama for FIM, which sucks a bit because you have to support all the different turn-templates in that case 馃
This is still a big issue for all client applications that want to control models to answer in a specific way.

If you would try out ChatML support tho, you could try appending the start of the expected response from the model after the template, e.g.:

<|im_start|>system
You are an awesome coder, auto-complete the following code:<|im_end|>
<|im_start|>user
here goes the code<|im_end|>
<|im_start|>assistant
Sure, here is the auto-completion:   <-- the model will think it answered like that
``` <-- followed by three backticks and a newline to force the model to generate code.

I have tried this out manually and it works. Basically everything you write after <|im_start|>assistant will make the model think it started it's answer like that (works like that for basically all models, not just for ChatML-based models).

For reference, this is the correct huggingface tokenizer template for ChatML:

{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}
{% for message in messages %}{{'<|im_start|>' + message['role'] + '\\n' + message['content'] + <|im_end|> + '\\n'}}{% endfor %}
{% if add_generation_prompt %}{{ '<|im_start|>assistant\\n' }}{% endif %}
@rjmacarthy
Copy link
Owner

Thanks @ChrisDeadman.

This is very interesting I will look into it for a future release.

@rjmacarthy rjmacarthy added the enhancement New feature or request label Feb 29, 2024
@rjmacarthy
Copy link
Owner

Hey @ChrisDeadman I tried this but didn't get much luck, how can we deal with prefix and suffix, can we write a hbs template for it?

@ChrisDeadman
Copy link
Author

Yes that should be possible, would make it easier for me to try out - if I find something usable I could make PR of a chatml template then.
But as far as I can see in the code under src/extension/fim-templates.ts, hbs templates are not yet supported for fim?

@ChrisDeadman
Copy link
Author

ChrisDeadman commented Mar 12, 2024

on second thought, the templates need to support some kind of flag which tells your response parser that the last part of the template (e.g. the 3 backticks) should be prepended to the actual model response before parsing it.
Otherwise the stuff we suggested the model should start it's response with is missing.

@ChrisDeadman
Copy link
Author

ChrisDeadman commented Mar 12, 2024

Under python using huggingface transformer templates you can get the "generation prompt" like this:

generation_prompt = self.tokenizer.apply_chat_template([], add_generation_prompt=True)

Because the [] represents an empty list of messages and the template does check for the add_generation_prompt variable:

{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}
{% for message in messages %}{{'<|im_start|>' + message['role'] + '\\n' + message['content'] + <|im_end|> + '\\n'}}{% endfor %}
{% if add_generation_prompt %}{{ '<|im_start|>assistant\\n' }}{% endif %}

So only the generation prompt is returned.
It works because it wants the input to be of the message type that is also used for the chat endpoint, and, if the list is empty, returns just the generation prompt (the part that the model should start the generation with).

Maybe something similar could be done with the hbs templates.

@rjmacarthy
Copy link
Owner

Hey, sorry but I'm still unsure about it. If you could adapt it to an hbs template it might be clearer? I did try to adapt the fim completions to use templates but didn't know what format it should be.

@hafriedlander
Copy link

Ollama automatically wraps whatever you pass to the /generate endpoint with a template (unless you turn it off with raw: true).

The default mistral template is pretty boring - https://ollama.com/library/mistral:latest/blobs/e6836092461f - but a lot of them follow that same format - https://ollama.com/library/dolphincoder:latest/blobs/62fbfd9ed093.

The ollama generate endpoint does allow overriding both the default model template and system message. Not sure if other systems (like vllm) do though.

@ChrisDeadman
Copy link
Author

ChrisDeadman commented Mar 14, 2024

Hey, sorry but I'm still unsure about it. If you could adapt it to an hbs template it might be clearer? I did try to adapt the fim completions to use templates but didn't know what format it should be.

If I understand the hbs syntax correctly this should work for your existing fim stuff:

{{#if prefix}}
  <PRE> {{prefix}} 
{{/if}}

{{#if suffix}}
  <SUF> {{suffix}} 
{{/if}}

{{#if add_generation_prompt}}
  <MID>
{{/if}}

Just supply the 3 variables as args or pass only add_generation_prompt if you just want to get the start of the model response.

@ChrisDeadman
Copy link
Author

ChrisDeadman commented Mar 14, 2024

For chatml it could be something like this (not tested):

{{#if system}}
  <|im_start|>system\n{{system}}<|im_end|>\n
{{/if}}

{{#if prefix || suffix}}
  <|im_start|>user\nPlease generate the code between the following prefix and suffix.\n
  {{#if prefix}}
    Prefix:\n```\n{{prefix}}\n```\n
  {{/if}}
  
  {{#if suffix}}
    Suffix:\n```\n{{suffix}}\n```\n
  {{/if}}
  <|im_end|>\n
{{/if}}

{{#if add_generation_prompt}}
  <|im_start|>assistant\n```\n
{{/if}}

@hafriedlander
Copy link

I understand I believe. I'll add another template into my PR (#174) on the next update

@ChrisDeadman are you using ollama as the backend? (Or if not, what are you using?)

@ChrisDeadman
Copy link
Author

@ChrisDeadman are you using ollama as the backend? (Or if not, what are you using?)

I wrote a custom server - I added ollama compatible API to run this extension over it.
Internally, it uses huggingface templates to tokenize the chat completion messages.
It however does not apply any templates to the prompt passed to the /generate endpoint.

@hafriedlander
Copy link

Ah. Ollama does normally wrap the prompt passed to /generate with a model specific template, unless raw: true is part of the request (which twinny doesn't currently set).

I think probably all autocomplete requests should use raw: true though - at least starcoder2 requires it, and my PR currently assumes all models should use it.

Long term, it'd be cool to be able to edit both the chat and FIM templates as HBS in vs code the same way command templates currently can be. For now I'll just add a extra template to the code.

@rjmacarthy
Copy link
Owner

rjmacarthy commented Apr 4, 2024

Hey,

FYI, this should now work with any ChatML endpoint as the provider as I added the ability to edit and choose a custom FIM template. The first test would be using OpenAI API through LiteLLM with GPT3.5 or GPT4. I am still unsure if Ollama support ChatML? Also should I add raw: true to the options, or make it an option to allow raw option in settings?

Here is a template I have been using with GPT-4 with pretty good success:

<|im_start|>system
You are a auto-completion coding assistant who uses a prefix and suffix to "fill_in_middle".<|im_end|>
<|im_start|>user
<prefix>{{{prefix}}}<fill_in_middle>{{{suffix}}}<end>
Only reply with pure code, no backticks, do not repeat code in the prefix or suffix, match brackets carefully.<|im_end|>
<|im_start|>assistant
Sure, here is the pure code auto-completion:

@ChrisDeadman
Copy link
Author

imo this looks like a great approach 馃憤馃徏
According raw, I second what @hafriedlander said after RTFMing the docs

@CartoonFan
Copy link

CartoonFan commented Apr 24, 2024

Sorry if I just missed it, but I don't really understand how to make this work. I'm currently running a fairly large model through Ollama (https://ollama.com/wojtek/beyonder), and it'd be great if I could use it for FIM as well.

Some additional info:

Editor: VSCodium
OS: Arch Linux
GPU: AMD Radeon RX 6800 XT (16 GB)
CPU: AMD Ryzen 7 3700X
RAM: 48 GB

Model's HuggingFace page: https://huggingface.co/mlabonne/Beyonder-4x7B-v3-GGUF

  • I haven't been able to get inline completion working at all, with any model
  • I don't know where the debug logs and config files are located
  • Chat is through Ollama model -> LiteLLM, FIM is directly through Ollama

Thanks 馃槄 馃檹 馃挏

@ChrisDeadman
Copy link
Author

ChrisDeadman commented May 1, 2024

So I tested this brifly with Llama-3 by selecting "custom template" in the FIM Provider settings and modifying the templates like so:

fim-system.hbs

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful, respectful and honest coding assistant.
Always reply with using markdown.<|eot_id|>

fim.hbs

{{{systemMessage}}}<|start_header_id|>user<|end_header_id|>

Please respond with the code that is missing here:

```{{language}}
{{{prefix}}}<MISSING CODE>
{{{suffix}}}
```<|eot_id|><|start_header_id|>assistant<|end_header_id|>

```{{language}}

What I found is:

  • The language variable is resolved to an empty string in fim.hbs
  • The response is not trimmed
  • The trailing backticks are not cut from the response

But other than that it seems to work 馃槂

Here is a screenshot:

image

It would be nice to be able to repeat the last line of the prefix as the model response (e.g. thread. in my example), to make the model not repeat it. (by providing a {{getLastLine prefix}} function for example)

@rjmacarthy
Copy link
Owner

Thanks @ChrisDeadman I think this can be arranged. Is there anything else datawise you'd like passed to the template?

Many thanks,

@ChrisDeadman
Copy link
Author

Thanks @rjmacarthy ! I cannot think of anything else that is missing at the moment, should be enough to support chatml and llama-3 templates imo.

@rjmacarthy
Copy link
Owner

I've added the language to the template now, but I think there is some inconsistencies with how it works still which I need to iron out. I had mixed results still with llama3:8b.

@ChrisDeadman
Copy link
Author

Did you also add an option to get the last line of prefix in the template? When adding this to the end of the template, the results should be much better.

@rjmacarthy
Copy link
Owner

No I didn't actually, I can add it.

@ChrisDeadman
Copy link
Author

That would be awesome, I will do some tests when ready 馃槂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants