Skip to content

galinator9000/tf_encdec_seq2seq

Repository files navigation

tf_encdec_seq2seq

Configurable advanced Encoder-Decoder Sequence-to-Sequence model. Built with TensorFlow.

Features

  • Easily configurable.
  • Unidirectional-RNN, Bidirectional-RNN
  • Attention models.
  • Bucketing.
  • Embedding models.

Requirements

pip install -r requirements.txt

Preparing Data

  • Put your TSV file under data/ directory as all_data.txt Each line in file is input-output pair, seperated with tab.
    Then run python build_data_matrix.py
    It will create your data matrices through your raw data. If pre-trained is not used, it will train an Embedding model.

Train

cough python train.py cough

Interactive mode & Inference through file

python interactive.py
python test.py my_input_sentences.txt

Config

  • rnn_unit | List | Specifies unit count of each layer on Encoder and Decoder.
  • rnn_cell | String | RNN cell type of Encoder and Decoder.
    [LSTM, GRU]
  • encoder_rnn_type | String | Encoder's RNN type.
    [unidirectional, bidirectional]
  • attention_mechanism | String | Attention mechanism of the model.
    [luong, bahdanau, None]
  • attention_size | int | Attention size of the model. (If not specified, will be defined as rnn_unit's last element)
  • dense_layers | List | Specifies unit count of each layer on FC.
  • dense_activation | String | Activation function to be used on FC layer.
    [relu, sigmoid, tanh, None]
  • optimizer | String | Optimizer function.
    [sgd, adam, rmsprop]
  • learning_rate | Float | Learning rate.
  • dropout_keep_prob_dense | Float | Dropout keep-prob rate on FC layer. (> 0.0, <= 1.0)
  • dropout_keep_prob_rnn_input | Float | Dropout keep-prob rate on RNN input. (> 0.0, <= 1.0)
  • dropout_keep_prob_rnn_output | Float | Dropout keep-prob rate on RNN output. (> 0.0, <= 1.0)
  • dropout_keep_prob_rnn_state | Float | Dropout keep-prob rate on RNN state. (> 0.0, <= 1.0)
  • bucket_use_padding | Bool | If true, adds <pad> tags to input and output sentence. So reduces count of buckets.
  • bucket_padding_input | List | Bucket sizes of input.
  • bucket_padding_output | List | Bucket sizes of output.
  • train_epochs | int | Epochs to be passed during training model. (Each epoch saves model to disk.)
  • train_steps | int | Steps to be passed during training model.
  • train_batch_size | int | Batch-size during training.
  • log_per_step_percent | int | Percent value that will be used as progress log point.
  • embedding_use_pretrained | Bool | Use pre-trained Embedding or not.
  • embedding_pretrained_path | String | Path of the pre-trained Embedding files.
  • embedding_type | String | Embedding type of the model.
    [word2vec, fasttext]
  • embedding_size | int | Embedding size of the model.
  • embedding_negative_sample | int | Embedding negative sampling value.
  • vocab_limit | int | Vocabulary limit during build Embedding model.
  • vocab_special_token | List | Special vocabulary tokens that will be used as padding tag, unknown words, start and end of the sentences.
  • ngram | int | N-gram value of the Embedding model.
  • reverse_input_sequence | Bool | If true, reverse words of the input sentence.
  • seq2seq_loss | Bool | Use seq2seq loss. That means, during loss calculation tags like <pad> going to be ignored.