Readme.md

Hyper-parameter tuning in RNNs

pretrained word embeddings usually perform the best.
Adam optimizer with Nestrov momentum yields the highest performance and converges the fastest.
Gradient clipping does not help to improve the performance.
A large improvement is observed when using gradient normalization.
Two stacked recurrent layers usually performs best.
The impact of the number of recurrent units is rather small.
Around 100 recurrent units per LSTM-network appear to be a good rule of thumb.
Optimizer : SGD has troubles to navigate ravines and at saddle points and is sensitive to learning rate. To eliminate the short comings of SGD, other gradient-based optimization algorithms have been proposed - Adagrad, Adadelta, RMSProp, Adam and Nadam (an Adam variant that incorporates Nesterov momentum)
Gradient normalization works better than graident clipping
Use CRF instead of softmax classifier for the last layer.
Variational dropout performs significantly better than naive and no dropout.
2 stacked LSTM layers gives optimal performance
Mini-batch size of 1-32 works better.