nshepperd · babaraza · Feb 18, 2019 · Feb 19, 2019 · Feb 20, 2019 · Feb 20, 2019
diff --git a/.gitattributes b/.gitattributes
@@ -0,0 +1,6 @@
+# convert to OS line endings on checkout, back to LF on commit
+* text=auto
+
+# ensure anything copied to the container has unix style line endings
+*.sh text eol=lf
+requirements.txt text eol=lf
diff --git a/.gitignore b/.gitignore
@@ -1,2 +1,5 @@
 __pycache__
+.mypy_cache/
 models/
+checkpoint
+samples
diff --git a/CONTRIBUTORS.md b/CONTRIBUTORS.md
@@ -0,0 +1,17 @@
+# Contributors (alphabetically)
+
+* **[madisonmay](https://github.com/madisonmay)**
+
+  Added Dockerfiles
+
+* **[Margaret Mitchell et al](https://arxiv.org/abs/1810.03993)**
+
+  Our [usage](./README.md#usage) writeup was loosely inspired by the paper
+  [Model Cards for Model Reporting](https://arxiv.org/abs/1810.03993)
+  and related conversations with some of the authors.
+
+* **[webproduktion01](https://github.com/webproduktion01)**
+
+  Ported download script to python.
+
+**[Full code contributors list](https://github.com/openai/gpt-2/contributors).**
diff --git a/DEVELOPERS.md b/DEVELOPERS.md
@@ -0,0 +1,86 @@
+# Installation
+
+Git clone this repository, and `cd` into directory for remaining commands
+```
+git clone https://github.com/openai/gpt-2.git && cd gpt-2
+```
+
+Then, follow instructions for either native or Docker installation.
+
+## Native Installation
+
+All steps can optionally be done in a virtual environment using tools such as `virtualenv` or `conda`.
+
+Install tensorflow 1.12 (with GPU support, if you have a GPU and want everything to run faster)
+```
+pip3 install tensorflow==1.12.0
+```
+or
+```
+pip3 install tensorflow-gpu==1.12.0
+```
+
+Install other python packages:
+```
+pip3 install -r requirements.txt
+```
+
+Download the model data
+```
+python3 download_model.py 117M
+python3 download_model.py 345M
+```
+
+## Docker Installation
+
+Build the Dockerfile and tag the created image as `gpt-2`:
+```
+docker build --tag gpt-2 -f Dockerfile.gpu . # or Dockerfile.cpu
+```
+
+Start an interactive bash session from the `gpt-2` docker image.
+
+You can opt to use the `--runtime=nvidia` flag if you have access to a NVIDIA GPU
+and a valid install of [nvidia-docker 2.0](https://github.com/nvidia/nvidia-docker/wiki/Installation-(version-2.0)).
+```
+docker run --runtime=nvidia -it gpt-2 bash
+```
+
+# Running
+
+| WARNING: Samples are unfiltered and may contain offensive content. |
+| --- |
+
+Some of the examples below may include Unicode text characters. Set the environment variable:
+```
+export PYTHONIOENCODING=UTF-8
+```
+to override the standard stream settings in UTF-8 mode.
+
+## Unconditional sample generation
+
+To generate unconditional samples from the small model:
+```
+python3 src/generate_unconditional_samples.py | tee /tmp/samples
+```
+There are various flags for controlling the samples:
+```
+python3 src/generate_unconditional_samples.py --top_k 40 --temperature 0.7 | tee /tmp/samples
+```
+
+To check flag descriptions, use:
+```
+python3 src/generate_unconditional_samples.py -- --help
+```
+
+## Conditional sample generation
+
+To give the model custom prompts, you can use:
+```
+python3 src/interactive_conditional_samples.py --top_k 40
+```
+
+To check flag descriptions, use:
+```
+python3 src/interactive_conditional_samples.py -- --help
+```
diff --git a/Dockerfile.cpu b/Dockerfile.cpu
@@ -0,0 +1,9 @@
+FROM tensorflow/tensorflow:1.12.0-py3
+
+ENV LANG=C.UTF-8
+RUN mkdir /gpt-2
+WORKDIR /gpt-2
+ADD . /gpt-2
+RUN pip3 install -r requirements.txt
+RUN python3 download_model.py 117M
+RUN python3 download_model.py 345M
diff --git a/Dockerfile.gpu b/Dockerfile.gpu
@@ -0,0 +1,18 @@
+FROM tensorflow/tensorflow:1.12.0-gpu-py3
+
+# nvidia-docker 1.0
+LABEL com.nvidia.volumes.needed="nvidia_driver"
+LABEL com.nvidia.cuda.version="${CUDA_VERSION}"
+
+# nvidia-container-runtime
+ENV NVIDIA_VISIBLE_DEVICES=all \
+    NVIDIA_DRIVER_CAPABILITIES=compute,utility \
+    NVIDIA_REQUIRE_CUDA="cuda>=8.0" \
+    LANG=C.UTF-8
+
+RUN mkdir /gpt-2
+WORKDIR /gpt-2
+ADD . /gpt-2
+RUN pip3 install -r requirements.txt
+RUN python3 download_model.py 117M
+RUN python3 download_model.py 345M
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2019 OpenAI
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/README.md b/README.md
@@ -1,48 +1,107 @@
+
+Reference:  ["Beginner’s Guide to Retrain GPT-2 (117M) to Generate Custom Text Content"](https://medium.com/@ngwaifoong92/beginners-guide-to-retrain-gpt-2-117m-to-generate-custom-text-content-8bb5363d8b7f)
+
 # gpt-2
 
-Code and samples from the paper ["Language Models are Unsupervised Multitask Learners"](https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf).
+Code from the paper ["Language Models are Unsupervised Multitask Learners"](https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf).
 
-For now, we have only released a smaller (117M parameter) version of GPT-2.
+We have currently released small (117M parameter) and medium (345M parameter) versions of GPT-2.  While we have not released the larger models, we have [released a dataset](https://github.com/openai/gpt-2-output-dataset) for researchers to study their behaviors.
 
 See more details in our [blog post](https://blog.openai.com/better-language-models/).
 
-## Installation
+## Usage
+
+This repository is meant to be a starting point for researchers and engineers to experiment with GPT-2.
+
+### Some caveats
+
+- GPT-2 models' robustness and worst case behaviors are not well-understood.  As with any machine-learned model, carefully evaluate GPT-2 for your use case, especially if used without fine-tuning or in safety-critical applications where reliability is important.
+- The dataset our GPT-2 models were trained on contains many texts with [biases](https://twitter.com/TomerUllman/status/1101485289720242177) and factual inaccuracies, and thus GPT-2 models are likely to be biased and inaccurate as well.
+- To avoid having samples mistaken as human-written, we recommend clearly labeling samples as synthetic before wide dissemination.  Our models are often incoherent or inaccurate in subtle ways, which takes more than a quick read for a human to notice.
+
+### Work with us
+
+Please [let us know](mailto:languagequestions@openai.com) if you’re doing interesting research with or working on applications of GPT-2!  We’re especially interested in hearing from and potentially working with those who are studying
+- Potential malicious use cases and defenses against them (e.g. the detectability of synthetic text)
+- The extent of problematic content (e.g. bias) being baked into the models and effective mitigations
+
+## Development
+
+See [DEVELOPERS.md](./DEVELOPERS.md)
+
+## Contributors
+
+See [CONTRIBUTORS.md](./CONTRIBUTORS.md)
+
+## Fine tuning on custom datasets
+
+To retrain GPT-2 117M model on a custom text dataset:
 
-Download the model data
 ```
-sh download_model.sh 117M
+PYTHONPATH=src ./train.py --dataset <file|directory|glob>
 ```
 
-Install python packages:
+If you want to precompute the dataset's encoding for multiple runs, you can instead use:
+
 ```
-pip3 install -r requirements.txt
+PYTHONPATH=src ./encode.py <file|directory|glob> /path/to/encoded.npz
+PYTHONPATH=src ./train.py --dataset /path/to/encoded.npz
 ```
 
-## Unconditional sample generation
+Make sure `cudnn` is installed. [Some have reported](https://github.com/nshepperd/gpt-2/issues/8) that `train.py` runs without it but has worse memory usage and might OOM.
 
-| WARNING: Samples are unfiltered and may contain offensive content. |
-| --- |
+### Gradient Checkpointing
+
+https://github.com/openai/gradient-checkpointing is included to reduce the memory requirements of the model, and can be enabled by `--memory_saving_gradients`. The checkpoints are currently chosen manually (poorly) by just adding layer 10 to the 'checkpoints' collection in model.py. `--memory_saving_gradients` is enabled by default for training the 345M model.
+
+### Validation loss
+
+Set `--val_every` to a number of steps `N > 0`, and "validation" loss against a fixed sample of the dataset will be calculated every N steps to get a better sense of training progress. N around 200 suggested. You can set `--val_dataset` to choose a separate validation dataset, otherwise it defaults to a sample from the train dataset (so not a real cross-validation loss!).
+
+### Optimizer
+
+You can use SGD instead of Adam with `--optimizer sgd`. This also helps conserve memory when training the 345M model. Note: the learning rate needs to be adjusted for SGD, due to not having Adam's gradient normalization (0.0006 seems to be a good number from some experiments).
+
+### Multi gpu (out of date)
+
+To do distributed on multiple GPUs or machines using Horovod:
 
-To generate unconditional samples from the small model:
-```
-python3 src/generate_unconditional_samples.py | tee samples
-```
-There are various flags for controlling the samples:
 ```
-python3 src/generate_unconditional_samples.py --top_k 40 --temperature 0.7 | tee samples
+mpirun -np 4 \
+    -H localhost:4 \
+    -bind-to none -map-by slot \
+    -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
+    -x PYTHONPATH=src \
+    -mca pml ob1 -mca btl ^openib \
+    /home/jovyan/gpt-2/train-horovod.py --dataset encoded.npz
 ```
 
-While we have not yet released GPT-2 itself, you can see some unconditional samples from it (with default settings of temperature 1 and no truncation) in `gpt2-samples.txt`.
+## GPT-2 samples
+
+| WARNING: Samples are unfiltered and may contain offensive content. |
+| --- |
 
-## Conditional sample generation
+While we have not yet released GPT-2 itself, you can see some samples from it in the `gpt-2-samples` folder.
+We show unconditional samples with default settings (temperature 1 and no truncation), with temperature 0.7, and with truncation with top_k 40.
+We show conditional samples, with contexts drawn from `WebText`'s test set, with default settings (temperature 1 and no truncation), with temperature 0.7, and with truncation with top_k 40.
 
-To give the model custom prompts, you can use:
+## Citation
+
+Please use the following bibtex entry:
 ```
-python3 src/interactive_conditional_samples.py --top_k 40
+@article{radford2019language,
+  title={Language Models are Unsupervised Multitask Learners},
+  author={Radford, Alec and Wu, Jeff and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya},
+  year={2019}
+}
 ```
 
 ## Future work
 
 We may release code for evaluating the models on various benchmarks.
 
 We are still considering release of the larger models.
+
+## License
+
+[MIT](./LICENSE)
diff --git a/download_model.py b/download_model.py
@@ -0,0 +1,28 @@
+import os
+import sys
+import requests
+from tqdm import tqdm
+
+if len(sys.argv) != 2:
+    print('You must enter the model name as a parameter, e.g.: download_model.py 117M')
+    sys.exit(1)
+
+model = sys.argv[1]
+
+subdir = os.path.join('models', model)
+if not os.path.exists(subdir):
+    os.makedirs(subdir)
+subdir = subdir.replace('\\','/') # needed for Windows
+
+for filename in ['checkpoint','encoder.json','hparams.json','model.ckpt.data-00000-of-00001', 'model.ckpt.index', 'model.ckpt.meta', 'vocab.bpe']:
+
+    r = requests.get("https://storage.googleapis.com/gpt-2/" + subdir + "/" + filename, stream=True)
+
+    with open(os.path.join(subdir, filename), 'wb') as f:
+        file_size = int(r.headers["content-length"])
+        chunk_size = 1000
+        with tqdm(ncols=100, desc="Fetching " + filename, total=file_size, unit_scale=True) as pbar:
+            # 1k for chunk_size, since Ethernet packet size is around 1500 bytes
+            for chunk in r.iter_content(chunk_size=chunk_size):
+                f.write(chunk)
+                pbar.update(chunk_size)
diff --git a/download_model.sh b/download_model.sh
diff --git a/encode.py b/encode.py
@@ -0,0 +1,31 @@
+#!/usr/bin/env python3
+# Usage:
+#  PYTHONPATH=src ./encode.py <file|directory|glob> /path/to/output.npz
+#  PYTHONPATH=src ./train --dataset /path/to/output.npz
+
+import argparse
+import numpy as np
+
+import encoder
+from load_dataset import load_dataset
+
+parser = argparse.ArgumentParser(
+    description='Pre-encode text files into tokenized training set.',
+    formatter_class=argparse.ArgumentDefaultsHelpFormatter)
+parser.add_argument('--model_name', metavar='MODEL', type=str, default='117M', help='Pretrained model name')
+parser.add_argument('--combine', metavar='CHARS', type=int, default=50000, help='Concatenate files with <|endoftext|> separator into chunks of this minimum size')
+parser.add_argument('--encoding', type=str, default='utf-8', help='Set the encoding for reading and writing files.')
+parser.add_argument('in_text', metavar='PATH', type=str, help='Input file, directory, or glob pattern (utf-8 text).')
+parser.add_argument('out_npz', metavar='OUT.npz', type=str, help='Output file path')
+
+def main():
+    args = parser.parse_args()
+    enc = encoder.get_encoder(args.model_name)
+    print('Reading files')
+    chunks = load_dataset(enc, args.in_text, args.combine, encoding=args.encoding)
+    print('Writing', args.out_npz)
+    np.savez_compressed(args.out_npz, *chunks)
+
+
+if __name__ == '__main__':
+    main()