Skip to content

jovan-ioanis/nlu-project-1

Repository files navigation

Natural Language Understanding 2017

Project 1: Language Modeling with Recurrent Neural Networks in Tensorflow and Continuation of Sentences

The goal of the project was building language model based on recurrent neural network with LSTM cells, just from tensorflow cell implementation. This means that graph of RNN is unrolled manually and dynamically. The model is evaluated based on Perplexity metric. The second part of the project requires greedy continuation of sentences given their beggining. The text of the project can be found here.

Authors:

Data was provided by NLU teaching staff and we cannot disclose it.

RUNNING THE CODE:

Requirements:

  • We expect all data (sentences.train, sentences.test, sentences.continuation) to be in ./data folder, where current folder is folder in which *.py scripts are
  • We expect wordembeddings-dim100.word2vec file in the same directory as *.py scripts

Training:

To train model for experiment X (A, B or C), run the following command:

python3 main.py -x X

This will:

  • preprocess training data (splitting sentences in words; adding <bos>, <eos>, <unk> and <pad> tokens; removing sentences longer than 28 words)
  • serialize and write to disc the following files:
    • padded_sentences.pickle
    • vocabulary.pickle
    • word_2_index.pickle
    • index_2_word.pickle On next training, the script will use existing pickle files. The files are saved in the same directory as *.py scripts.
  • save trained graph at given frequency. By default, graph is saved in the same directory as *.py scripts. The name of the graph is of the following format: expX-epY-NUM.* where X is substituted with A, B or C, Y is the epoch and NUM is number of batches/steps the model is trained on. Note that after training on one sweep of data, number of batches is not reset, but it's continiously incrementing.

Perplexity calculations:

Perplexity calculations for experiment X are obtained by running:

python3 perplexity.py -x X -c <path_to_checkpoint>

where:

  • X can be substituted with A, B or C for each experiment
  • <path_to_checkpoint> is path to the trained graph. By default, graphs are stored in the same directory as *.py scripts.

Requirements:

  • word_2_index.pickle and index_2_word.pickle in the same directory as *.py scripts.

This will output the file named "group01.perplexityX", where X is substituted with A, B or C.

Continuation of Sentences:

Continuation of sentences is generated by running:

python3 continuation.py -x C -c <path_to_checkpoint>

where:

  • <path_to_checkpoint> is path to the trained graph. By default, graphs are stored in the same directory as *.py scripts.

Requirements:

  • word_2_index.pickle and index_2_word.pickle in the same directory as *.py scripts.

This will output file "group01.continuation". Note that continuation is using model trained in experiment C.

About

Language Model based on Recurrent Neural Networks implemented in Tensorflow (with manual graph unrolling) and continuation of sentences based on the language model

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published