Degen

The Official Repository for "The Curious Case of Neural Text Degeneration"

If you want to use Nucleus Sampling, you can use the implementation in Hugging Face Transformers with many pretrained models, including GPT-2!

Generations

All conditional and unconditional generations are available here.

Requirements

pytorch must be installed.

The other required modules are in requirements.txt and can be installed with:

pip install -r requirements.txt

Generating Your Own

Use gen.py to generate:

python gen.py --model_name gpt2-large --batch_size 10 -n 50 --output_path output.jsonl --gpu 0 -p 0.95 --seed 0

--help will show options for decoding strategies and parameters

Formatting Data

For conditional generations, you'll need to format contexts.

Use encode_jsonl.py to tokenize data from https://github.com/openai/gpt-2-output-dataset so it can be be used for conditional generation:

python encode_jsonl.py raw.jsonl tokenized.jsonl

--help will show more options

Use filter_for_conditional.py for creating contexts for conditional generations:

python filter_for_conditional.py tokenized.jsonl filtered.jsonl

--help will show more options

Now use sort_jsonl_by_length.py to sort things for more efficient batching:

python sort_jsonl_by_length.py filtered.jsonl sorted.jsonl

Finally, if you'd like to use Beam Search or Stochastic Beam Search. First create a cache file by generating for another algorithm for a non-beam decoding algorithm with the --cache flag:

python gen.py --model_name gpt2-large --batch_size 10 -n 50 --context_path sorted.jsonl --output_path output.jsonl --gpu 0 -k 40 --seed 0 --cache first.cache

Next, we'll reprocess the cache file for Beam Search:

python rebatch_inits_for_beamsearch.py first.cache --batch_size 4 --out bs_4.cache

Now we can decode with Beam Search:

python gen.py --model_name gpt2-large --batch_size 4 -n 40 --context_path sorted.jsonl --cache bs_4.cache --output_path output.jsonl --gpu 0 -w 4

MTurk

mturk_form.html is the Amazon Mechanical Turk template we used for experiments.

Use from_jsonl.py to extract strings (and other attributes from jsonl files):

python from_jsonl.py output.jsonl string output_string.txt

--help will show more options

chunk4turk.py can be used to batch texts to be used with the above MTurk template:

python chunk4turk.py output_string.txt output_mturk.csv

--help will show more options

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metrics

metrics

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

chunk4turk.py

chunk4turk.py

encode_jsonl.py

encode_jsonl.py

filter_for_conditional.py

filter_for_conditional.py

from_jsonl.py

from_jsonl.py

gen.py

gen.py

mturk_form.html

mturk_form.html

rebatch_inits_for_beamsearch.py

rebatch_inits_for_beamsearch.py

requirements.txt

requirements.txt

sort_jsonl_by_length.py

sort_jsonl_by_length.py

Repository files navigation

Degen

Generations

Requirements

Generating Your Own

Formatting Data

MTurk

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
metrics		metrics
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
chunk4turk.py		chunk4turk.py
encode_jsonl.py		encode_jsonl.py
filter_for_conditional.py		filter_for_conditional.py
from_jsonl.py		from_jsonl.py
gen.py		gen.py
mturk_form.html		mturk_form.html
rebatch_inits_for_beamsearch.py		rebatch_inits_for_beamsearch.py
requirements.txt		requirements.txt
sort_jsonl_by_length.py		sort_jsonl_by_length.py

License

ari-holtzman/degen

Folders and files

Latest commit

History

Repository files navigation

Degen

Generations

Requirements

Generating Your Own

Formatting Data

MTurk

About

Resources

License

Stars

Watchers

Forks

Languages