Adding GPT2 model evaluation on WikiText-103 with optional preprocessing in dev/model-eval/ #340
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Ultimately in response to #246:
In regards to this specific PR, I propose the addition of two files in a new
dev/model_eval/
folder. One file prepares wikitext-103 for a model's evaluation with an optional argument that preprocesses the text. The other file evaluates each gpt2 model size on the prepared evaluation data thanks to the help of huggingface.Also, please refer to my latest message at the bottom of #276 where I describe some contradictions that I found regarding the topic of reproducing numbers. Also, please refer to the repo on my profile called gpt2eval where I have provided two folders, in each is a python notebook where I run some tests and compute perplexity scores for each gpt2 model size on different preparations of the wikitext dataset. I suggest reading my report at the bottom of the last PR first, then seeing the repo on my profile, then finally viewing this code that I am offering in this PR.
For convenience I am providing this table that summarizes the computations from the tests I ran and compares them to the reported GPT2 numbers:
Perplexity Scores from GPT2 Paper:
All the following numbers are evaluated on WikiText-103 validation split:
Huggingface Dataset:
Smerity Dataset:
My analysis of the results: