Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding GPT2 model evaluation on WikiText-103 with optional preprocessing in dev/model-eval/ #340

Closed
wants to merge 1 commit into from

Conversation

joeshmoe0112358
Copy link

@joeshmoe0112358 joeshmoe0112358 commented May 3, 2024

Ultimately in response to #246:

In regards to this specific PR, I propose the addition of two files in a new dev/model_eval/ folder. One file prepares wikitext-103 for a model's evaluation with an optional argument that preprocesses the text. The other file evaluates each gpt2 model size on the prepared evaluation data thanks to the help of huggingface.

Also, please refer to my latest message at the bottom of #276 where I describe some contradictions that I found regarding the topic of reproducing numbers. Also, please refer to the repo on my profile called gpt2eval where I have provided two folders, in each is a python notebook where I run some tests and compute perplexity scores for each gpt2 model size on different preparations of the wikitext dataset. I suggest reading my report at the bottom of the last PR first, then seeing the repo on my profile, then finally viewing this code that I am offering in this PR.

For convenience I am providing this table that summarizes the computations from the tests I ran and compares them to the reported GPT2 numbers:

Perplexity Scores from GPT2 Paper:

Model Size WikiText-2 WikiText-103
117M 29.41 37.50
345M 22.76 26.37
762M 19.93 22.05
1542M 18.34 17.48

All the following numbers are evaluated on WikiText-103 validation split:

Huggingface Dataset:

Model Size Raw Bare Minimum Preprocessing
124M 30.59 31.04
355M 22.35 22.51
774M 19.33 20.09
1558M 17.46 17.91

Smerity Dataset:

Model Size Raw My Extensive Preprocessing
124M 30.13 33.19
355M 21.77 24.31
774M 18.74 21.39
1558M 16.91 19.32

My analysis of the results:

  • These tests seem to confirm that something unusual is going on with the number reporting in the GPT2 paper because as I mention in my report, the evaluations on WikiText-2 and WikiText-103 should be identical because the val/test splits of the datasets are respectively identical.
  • Assuming that the numbers we want to replicate are those in the WikiText-2 column of the GPT2 paper, I believe the most similar results are from the bare minimum preprocessing of the huggingface dataset (I say this off of a quick glance and no supporting calculations to verify this claim). To see the exact differences in the preparations of each test's dataset, you would have to analyze the code in the notebooks on my gpt2eval repo.

@karpathy
Copy link
Owner

karpathy commented May 6, 2024

Thank you @joeshmoe0112358 for looking into this, but it looks like we're basically not able to match the paper table. In that case I'd at least try to match Alec's post from the reddit thread, which sounds a lot easier to match because it's on raw data. But here in this PR it looks like you're including a bunch of the post-processing?

@joeshmoe0112358
Copy link
Author

joeshmoe0112358 commented May 6, 2024

To summarize my findings:

  1. The numbers in the WikiText-2 column should be identical to the numbers in the WikiText-103 column because the val/test splits are identical between datasets. However, the numbers are not identical.

  2. Here is Alec's Table:

    Model Size WikiText-2 WikiText-103
    117M 34.63 37.50
    345M 25.63 26.37
    762M 21.85 22.05
    1542M 20.40 20.04

    The first 3 rows in his WikiText-103 column are identical to the first 3 rows in the WikiText-103 column of the paper which should not be the case because he allegedly tested on the raw text while the paper did more gymnastics. Additionally, Alec claims a 32 stride length, this cannot be the case, testing on stride length = context length gets you similar numbers to the paper, but doing stride length = (context length / 2) gives you substantial improvements in scoring, so doing stride length = 32 must give at least this much improvement which is certainly not the case in Alec's table as the numbers are identical or in the same ball park. Thus I am very skeptical of Alec's table and his claims.

I hypothesize that there is something wrong with the table in the paper, if we instead assume the WikiText-2 column to be the column we wish to replicate (because 103 and 2 have identical val/test splits), then we are able to reasonably reproduce these numbers as shown in my tables in the previous post. We are able to get percent errors like this:

Model Size GPT-2 Paper's WikiText-2 (PPL) Huggingface WikiText-103 Val Split w/ Bare Minimum Processing (PPL) Percent Error (%)
124M 29.41 31.04 5.54
355M 22.76 22.51 1.10
774M 19.93 20.09 0.80
1558M 18.34 17.91 2.34

EDIT: I have been using "pre-processing", "post-processing", "processing", "cleaning", etc. rather loosely when discussing this topic. Let us just say that in general I am preparing the dataset for evaluating a model on it, and optionally I am cleaning the text to get rid of things I deem unnecessary or counterproductive to the evaluation.

@karpathy
Copy link
Owner

We are abandoning WikiText103 because it's a total mess. We'll instead look at one/few of ARC Easy / Challenge, Squad, Hellaswag, TriviaQA, LAMBADA. Closing.

@karpathy karpathy closed this May 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants