Is perplexity correctly computed? #560

halixness · 2024-03-15T18:35:50Z

Hello. I'm struggling with replicating the reported perplexity (~6) for LLaMa-2-7b.
I am using this simple code snippet:

import evaluate
import datasets

perplexity = evaluate.load("perplexity", module_type="metric")
input_texts = datasets.load_dataset("wikitext",
                                    "wikitext-2-raw-v1",
                                    split="test")["text"]
input_texts = [s for s in input_texts if s!='']
results = perplexity.compute(
    model_id="sharpbai/Llama-2-7b-hf",
    batch_size=4,
    predictions=input_texts
)
print(results)

And I get among the results: 'mean_perplexity': 60.9764459149642.
In this tutorial it is computed "approximately" by flattening the dataset into a string and by computing the avg. sliding window perplexity. I still get a high perplexity.
I tried to change the model in the code snippet to openai-community/gpt2 and the perplexity is above 600!
Does this depend on using the correct model class?
Thank you for any suggestion.

EDIT: I'm using the following versions

transformers              4.38.2
evaluate                  0.4.1
datasets                  2.18.0

The text was updated successfully, but these errors were encountered:

SamSJackson · 2024-03-16T10:16:56Z

Perplexity is a measure which is dependent on the model used to calculate it.
Specifically, the formula for perplexity is entirely dependent on the probability functions for a given model.

So seeing different perplexities for different models is entirely expected.

You could even argue that the higher perplexity from GPT2 compared to Llama-2-7B for the human dataset of wikitext is a reflection that Llama-2-7B is a better model.

halixness · 2024-03-17T13:40:38Z

I tested also LLaMa2-70b and the perplexity on wikitext is around 22. Shouldn't I expect a better performance for both the 70b and the 7b variant? Is LLaMa2's training distribution that far?

SamSJackson · 2024-03-17T13:59:18Z

I don't know how the results from the initial paper calculated their perplexity, whether they used HuggingFace's metric or not, but given the discussion is based on the varying window size, that could be part of the problem.

Are you confident that you are using the right sliding window (Context window) size?

If you want to be really precise, you could write your own perplexity measure. There is a good guide here HuggingFace: Perplexity of Fixed-Length Models.

Also, it is not surprising that 70b is largely outperforming 7b. The parameter difference is just that large.

anu7699 · 2024-04-28T23:20:46Z

Hi @halixness, were you able to resolve the perplexity issue? I am also getting similar value (~56) for llama2-7b. I have tried coding up the perplexity calculations suggested by @SamSJackson and also with huggingface evaluate module, but still getting similar values.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is perplexity correctly computed? #560

Is perplexity correctly computed? #560

halixness commented Mar 15, 2024 •

edited

SamSJackson commented Mar 16, 2024

halixness commented Mar 17, 2024 •

edited

SamSJackson commented Mar 17, 2024

anu7699 commented Apr 28, 2024

Is perplexity correctly computed? #560

Is perplexity correctly computed? #560

Comments

halixness commented Mar 15, 2024 • edited

SamSJackson commented Mar 16, 2024

halixness commented Mar 17, 2024 • edited

SamSJackson commented Mar 17, 2024

anu7699 commented Apr 28, 2024

halixness commented Mar 15, 2024 •

edited

halixness commented Mar 17, 2024 •

edited