Please refer to this paper for more detailed explaination : Optimal Hyperparameters for Deep LSTM-Networks for Sequence Labeling Tasks
- pretrained word embeddings usually perform the best.
- Adam optimizer with Nestrov momentum yields the highest performance and converges the fastest.
- Gradient clipping does not help to improve the performance.
- A large improvement is observed when using gradient normalization.
- Two stacked recurrent layers usually performs best.
- The impact of the number of recurrent units is rather small.
- Around 100 recurrent units per LSTM-network appear to be a good rule of thumb.
- Optimizer : SGD has troubles to navigate ravines and at saddle points and is sensitive to learning rate. To eliminate the short comings of SGD, other gradient-based optimization algorithms have been proposed - Adagrad, Adadelta, RMSProp, Adam and Nadam (an Adam variant that incorporates Nesterov momentum)
- Gradient normalization works better than graident clipping
- Use CRF instead of softmax classifier for the last layer.
- Variational dropout performs significantly better than naive and no dropout.
- 2 stacked LSTM layers gives optimal performance
- Mini-batch size of 1-32 works better.