Sentence-level Language Model and Skip-thought Vector

Training script is as follows:

python -u train.py -g 3 --train train_data --valid valid_data --vocab vocab.t100 -u 512 --layer 1 --dropout 0.1 --batchsize 128 --out output_dir

If you add --language-model, a model to be trained is a sentence-level language model. Otherwise, the model is a skip-thought model by default.

Dataset of training and validation should have one-line-one-sentence format. Training a skip-thought model uses only neighbor sentences in paragraphs, which are separated by blank lines.

Counting-based vocabulary file vocab.t100 can be constructed by the script below:

python construct_vocab.py --data train_data -t 100 --save vocab.t100

For skip-thought vector, see Skip-Thought Vectors, Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Antonio Torralba, Raquel Urtasun, Sanja Fidler, NIPS 2015.

Computation Cost

Sentence-level Language Model

For 128 sentence pairs in a minibatch, 512-unit LSTM with vocabulary size of 22231 can process 10 iterations per second on 7.5GB GPU memory. On dataset with 4,300,000 pairs, training is performed over 5 epoch in 4.5 hours.

Skip-thought Vector

For 128 sentence pairs in a minibatch, 512-unit GRU with vocabulary size of 22231 can process 2-2.5 iterations per second on 7.5GB GPU memory. On dataset with 4,000,000 pairs, training is performed over 5 epoch in 18-22 hours.

Use wikitext103 as Dataset

sh prepare_rawwikitext.sh

PYTHONIOENCODING=utf-8 python preprocess_spacy.py datasets/wikitext-103-raw/wiki.train.raw > datasets/wikitext-103-raw/spacy_wikitext-103-raw.train
PYTHONIOENCODING=utf-8 python preprocess_spacy.py datasets/wikitext-103-raw/wiki.valid.raw > datasets/wikitext-103-raw/spacy_wikitext-103-raw.valid
PYTHONIOENCODING=utf-8 python preprocess_spacy.py datasets/wikitext-103-raw/wiki.test.raw > datasets/wikitext-103-raw/spacy_wikitext-103-raw.test

PYTHONIOENCODING=utf-8 python preprocess_after_spacy.py datasets/wikitext-103-raw/spacy_wikitext-103-raw.train > datasets/wikitext-103-raw/spacy_wikitext-103-raw.train.after
PYTHONIOENCODING=utf-8 python preprocess_after_spacy.py datasets/wikitext-103-raw/spacy_wikitext-103-raw.valid > datasets/wikitext-103-raw/spacy_wikitext-103-raw.valid.after
PYTHONIOENCODING=utf-8 python preprocess_after_spacy.py datasets/wikitext-103-raw/spacy_wikitext-103-raw.test > datasets/wikitext-103-raw/spacy_wikitext-103-raw.test.after

python construct_vocab.py --data datasets/wikitext-103-raw/spacy_wikitext-103-raw.train.after -t 100 --save datasets/wikitext-103-raw/spacy_wikitext-103-raw.train.after.vocab.t100

python -u train.py -g 3 --train datasets/wikitext-103-raw/spacy_wikitext-103-raw.train.after --valid datasets/wikitext-103-raw/spacy_wikitext-103-raw.valid.after --vocab datasets/wikitext-103-raw/spacy_wikitext-103-raw.train.after.vocab.t100 -u 512 --layer 1 --dropout 0.1 --batchsize 128 --out outs/st.u512.l1.d01.b128

Efficient Softmax Approximation

Implementations of Blackout and Adaptive Softmax for efficiently calculating word distribution for language modeling of very large vocabularies.

LSTM language models are derived from rnnlm_chainer.

Available output layers are as follows

Linear + softmax with cross entropy loss. A usual output layer.
--share-embedding: A variant using the word embedding matrix shared with the input layer for the output layer.
--adaptive-softmax: Adaptive softmax
--blackout: BlackOut (BlackOut is not faster on GPU.)

Adaptive Softmax

Efficient softmax approximation for GPUs
Edouard Grave, Armand Joulin, Moustapha Cissé, David Grangier, Hervé Jégou, ICML 2017
paper
authors' Lua code

BlackOut

BlackOut: Speeding up Recurrent Neural Network Language Models With Very Large Vocabularies
Shihao Ji, S. V. N. Vishwanathan, Nadathur Satish, Michael J. Anderson, Pradeep Dubey, ICLR 2016
paper
authors' C++ code

How to Run

python -u train.py -g 0

Datasets

PennTreeBank
Wikitext-2
Wikitext-103

For wikitext, run prepare_wikitext.sh for downloading the datasets.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.gitignore		.gitignore
README.md		README.md
adaptive_softmax.py		adaptive_softmax.py
black_out.py		black_out.py
chain_utils.py		chain_utils.py
construct_vocab.py		construct_vocab.py
doc_lm_train.py		doc_lm_train.py
nets.py		nets.py
prepare_rawwikitext.sh		prepare_rawwikitext.sh
prepare_wikitext.sh		prepare_wikitext.sh
preprocess_after_spacy.py		preprocess_after_spacy.py
preprocess_spacy.py		preprocess_spacy.py
train.py		train.py
utils.py		utils.py

soskek/skip_thought

Folders and files

Latest commit

History

Repository files navigation

Sentence-level Language Model and Skip-thought Vector

Computation Cost

Sentence-level Language Model

Skip-thought Vector

Use wikitext103 as Dataset

Efficient Softmax Approximation

Adaptive Softmax

BlackOut

How to Run

Datasets

About

Topics

Resources

Stars

Watchers

Forks

Languages