Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is perplexity correctly computed? #560

Open
halixness opened this issue Mar 15, 2024 · 4 comments
Open

Is perplexity correctly computed? #560

halixness opened this issue Mar 15, 2024 · 4 comments

Comments

@halixness
Copy link

halixness commented Mar 15, 2024

Hello. I'm struggling with replicating the reported perplexity (~6) for LLaMa-2-7b.
I am using this simple code snippet:

import evaluate
import datasets

perplexity = evaluate.load("perplexity", module_type="metric")
input_texts = datasets.load_dataset("wikitext",
                                    "wikitext-2-raw-v1",
                                    split="test")["text"]
input_texts = [s for s in input_texts if s!='']
results = perplexity.compute(
    model_id="sharpbai/Llama-2-7b-hf",
    batch_size=4,
    predictions=input_texts
)
print(results)

And I get among the results: 'mean_perplexity': 60.9764459149642.
In this tutorial it is computed "approximately" by flattening the dataset into a string and by computing the avg. sliding window perplexity. I still get a high perplexity.
I tried to change the model in the code snippet to openai-community/gpt2 and the perplexity is above 600!
Does this depend on using the correct model class?
Thank you for any suggestion.

EDIT: I'm using the following versions

transformers              4.38.2
evaluate                  0.4.1
datasets                  2.18.0 

@SamSJackson
Copy link

Perplexity is a measure which is dependent on the model used to calculate it.
Specifically, the formula for perplexity is entirely dependent on the probability functions for a given model.

So seeing different perplexities for different models is entirely expected.

You could even argue that the higher perplexity from GPT2 compared to Llama-2-7B for the human dataset of wikitext is a reflection that Llama-2-7B is a better model.

@halixness
Copy link
Author

halixness commented Mar 17, 2024

I tested also LLaMa2-70b and the perplexity on wikitext is around 22. Shouldn't I expect a better performance for both the 70b and the 7b variant? Is LLaMa2's training distribution that far?

@SamSJackson
Copy link

I don't know how the results from the initial paper calculated their perplexity, whether they used HuggingFace's metric or not, but given the discussion is based on the varying window size, that could be part of the problem.

Are you confident that you are using the right sliding window (Context window) size?

If you want to be really precise, you could write your own perplexity measure. There is a good guide here HuggingFace: Perplexity of Fixed-Length Models.

Also, it is not surprising that 70b is largely outperforming 7b. The parameter difference is just that large.

@anu7699
Copy link

anu7699 commented Apr 28, 2024

Hi @halixness, were you able to resolve the perplexity issue? I am also getting similar value (~56) for llama2-7b. I have tried coding up the perplexity calculations suggested by @SamSJackson and also with huggingface evaluate module, but still getting similar values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants