Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adapt to use Hugging Face models (includes streaming) #3

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

teticio
Copy link
Contributor

@teticio teticio commented Aug 27, 2023

Hi

I thought about how to implement the streaming functionality and saw that the only way was to re-write the generation functions in codellama, which seemed a bit messy. Simulaneously, Hugging Face released the models in their format, so I thought the easiest thing to do would be to use them.

Advantages:

  • Makes more accessible to anyone (no need to download the checkpoints manually from Meta)
  • It is easy to load quantized models (I added a load_in_4bit flag)
  • Your Flask server is simplified because the parallelization is handled by the transformers library, so you only have one instance of the server running (i.e., no need to mess with torch.distributed).
  • Streaming is pretty straightforward.
  • We could easily adapt it to use text-inference-server in the backend, which is much faster than the regular generation.

Disadvantages:

  • For some reason, I get worse results from the Hugging Face version of the 13b instruct model (even without quantization). For example, if I ask it

Tell me a joke in C

I get responses similar to this:

A C Programmer's Buggy Journey

Sure! Here's a joke in C: Why did the C programmer go to the doctor? Because he was feeling a little "buggy"! I hope you found that joke in C to be "buggy" and "funny"!

and sometimes it spits out endless \n tokens instead of stopping when it should. When I run your code using the Meta checkpoints I get something like

"Chicken Joke: A Play on Words"
Sure, here's a joke in C:

#include <stdio.h>

int main() {
    printf("Why did the chicken cross the playground?\n");
    printf("To get to the other slide!\n");
    return 0;
}

This joke is a play on words, as "slide" can refer to both a toy slide and a software slide.

I mean, the jokes are terrible, but at least it writes it in C as instructed.

Anyway, I thought I would create this Pull Request so you could play around with it. I'd be interested to know whether you think this is a good direction to go in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant