Skip to content

Latest commit

 

History

History
34 lines (28 loc) · 770 Bytes

README.md

File metadata and controls

34 lines (28 loc) · 770 Bytes

Revisiting UID

Install Dependencies

Install requirements via pip

$ pip install -r requirements.txt

Get the Data

To get the data run

$ cd src
bash pull_data.sh

Note that this does not include the Dundee corpus, for which the original authors have to be contacted!

To estimate an N-gram model

First build the library in the kenlm submodule

$ cd kenlm
$ mkdir -p build
$ cd build
$ cmake ..
$ make -j 4

then estimate the model from the wikitext 103 dataset

cat {data-dir}/wikitext-103/wiki.train.tokens | awk '!/=\s*/' | awk NF > /tmp/wiki.train.tokens.clean
bin/lmplz -o 5 --skip_symbols < /tmp/wiki.train.tokens.clean >wiki.arpa

You can then find the entire analysis pipeline in src/revisiting-uid.ipynb