Adapt to use Hugging Face models (includes streaming) #3

teticio · 2023-08-27T07:18:52Z

Hi

I thought about how to implement the streaming functionality and saw that the only way was to re-write the generation functions in codellama, which seemed a bit messy. Simulaneously, Hugging Face released the models in their format, so I thought the easiest thing to do would be to use them.

Advantages:

Makes more accessible to anyone (no need to download the checkpoints manually from Meta)
It is easy to load quantized models (I added a load_in_4bit flag)
Your Flask server is simplified because the parallelization is handled by the transformers library, so you only have one instance of the server running (i.e., no need to mess with torch.distributed).
Streaming is pretty straightforward.
We could easily adapt it to use text-inference-server in the backend, which is much faster than the regular generation.

Disadvantages:

For some reason, I get worse results from the Hugging Face version of the 13b instruct model (even without quantization). For example, if I ask it

Tell me a joke in C

I get responses similar to this:

A C Programmer's Buggy Journey

Sure! Here's a joke in C: Why did the C programmer go to the doctor? Because he was feeling a little "buggy"! I hope you found that joke in C to be "buggy" and "funny"!

and sometimes it spits out endless \n tokens instead of stopping when it should. When I run your code using the Meta checkpoints I get something like

"Chicken Joke: A Play on Words"
Sure, here's a joke in C:
#include <stdio.h>

int main() {
    printf("Why did the chicken cross the playground?\n");
    printf("To get to the other slide!\n");
    return 0;
}
This joke is a play on words, as "slide" can refer to both a toy slide and a software slide.

I mean, the jokes are terrible, but at least it writes it in C as instructed.

Anyway, I thought I would create this Pull Request so you could play around with it. I'd be interested to know whether you think this is a good direction to go in.

teticio added 2 commits August 27, 2023 08:06

adapt to huggingface

1b64bdb

Merge branch 'xNul:main' into main

5bb1a80

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adapt to use Hugging Face models (includes streaming) #3

Adapt to use Hugging Face models (includes streaming) #3

teticio commented Aug 27, 2023 •

edited

Adapt to use Hugging Face models (includes streaming) #3

Are you sure you want to change the base?

Adapt to use Hugging Face models (includes streaming) #3

Conversation

teticio commented Aug 27, 2023 • edited

teticio commented Aug 27, 2023 •

edited