Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support GQA export, better run.c, Support tinyllama-1.1B #410

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

magician-blue
Copy link

Add support to tinyllama-1.1B
Add support to convert GQA model (learned from ggerganov/llama.cpp#3364)
Better run.c

xefoci7612 pushed a commit to xefoci7612/baby-llama2.cpp that referenced this pull request Oct 2, 2023
@xefoci7612
Copy link

Current chat schemas in run.c are based on LLama 2

            // render user/system prompts into the Llama 2 Chat schema
            if (pos == 0 && system_prompt[0] != '\0') {
                char system_template[] = "[INST] <<SYS>>\n%s\n<</SYS>>\n\n%s [/INST]";
                sprintf(rendered_prompt, system_template, system_prompt, user_prompt);
            } else {
                char user_template[] = "[INST] %s [/INST]";
                sprintf(rendered_prompt, user_template, user_prompt);
            }

But you may want to use tinyllama's ones instead:

<|im_start|>user
Explain huggingface.<|im_end|>
<|im_start|>assistant

In general chat templates should be bounded to the loaded pre-trained model, so maybe they should be a configuration parameter in the .bin file

@@ -368,11 +368,12 @@ def load_hf_model(model_path):
config.dim = hf_model.config.hidden_size
config.n_layers = hf_model.config.num_hidden_layers
config.n_heads = hf_model.config.num_attention_heads
config.n_kv_heads = hf_model.config.num_attention_heads
config.n_kv_heads = hf_model.config.num_key_value_heads
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For MHA model, the number of kv heads equals q heads.
However, for GQA model like llama2-70b, tinyllama1.1B, the number of kv heads and q head are different.

@@ -451,7 +455,12 @@ void safe_printf(char *piece) {

int str_lookup(char *str, TokenIndex *sorted_vocab, int vocab_size) {
// efficiently find the perfect match for str in vocab, return its index or -1 if not found
TokenIndex tok = { .str = str }; // acts as the key to search for
char *input = "<0x0A>";
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this delta here done?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether I convert the tokenizer correctly. After I convert the tinyllama-1.1B's tokenizer. The run.c gets <0x0A> instead of \n. I'm trying to figure out how to convert the tokenizer better to remove these line.
Besides, I notice that our run.c can not deal with \n in the input (for tinystory 260k,15m,110m model). They will treat it as \\ and n.
In llama.cpp, they hardcode to convert \\n to \n.

@karpathy
Copy link
Owner

karpathy commented Oct 9, 2023

This is cool, I wasn't aware of the TinyLlama 1.1B run. Sounds very nice and useful for this repo to support.
Are there any notable architectural changes in it?
This PR is a bit of a random combination of necessary differences, and a few side optimizations.

@magician-blue
Copy link
Author

This is cool, I wasn't aware of the TinyLlama 1.1B run. Sounds very nice and useful for this repo to support. Are there any notable architectural changes in it? This PR is a bit of a random combination of necessary differences, and a few side optimizations.

There isn't notable architectural changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants