* Use Pytorch nn.transformer package to build an encoder for language prediction
* PyTorch 1.2 + TorchText
2 & 2.1. transformer_xl_from_scratch, src
* 2. simple toy example showing the idea of Transformer-XL which uses additional memory to encode history
* 2.1 Build Transformer-XL + MultiAttention heads
* Show how to use previous hidden states to achieve "Recurrence Mechanism"
- the output of the previous hidden layer of that segment
- the output of the previous hidden layer from the previous segment
* Show how to use relative positional encoding to incorporate position information
* Complete implementation of Transformer-XL
* An excellent tutorial version of XLNet from above link
* Add more comments for understanding
* Requirements: Python 3 + Pytorch v1.2
* TODO: Add GPU support
* Build Bert - Bidirectional Transformer
* The task is two-fold, see paper section 3.1
1) to predict the second part of a sentence (Next Sentence Prediction)
2) to predict the masked words of a sentence (Masked LM)
* step 1: generate vocabulary file "vocab.small" in ./data
* step 2: train the network
* See transformer_bert_from_scratch_5.py for more details.
* Build Bert - Bidirectional Transformer
* Utilize official Pytorch API to implement the interface of using existing code and pre-trained model
* pip install transformers tb-nightly
* A Lite BERT which reduces BERT params to ~20%
* Decouple word embedding size with hidden size by using two word projection matrices
- parameters are reduced from O(V*H) to O(V*E + E*H) s.t. E << H
* Cross-layer parameter sharing
- the default decision for ALBERT is to share all parameters across layers (see paper section 3.1 !!)
* Sentence Order Prediction
- NSP (Next Sentence Prediction) in BERT is not effective (as the association of two sents in a doc is not strong)
- Inter-sentence coherence is strong:
the positive case is the two sentences are in proper order;
the negative case is the two sentences in swapped order.
Python = 2.7 and 3.6
PyTorch = 1.2+ [here] for both python versions
GPU training with 4G+ memory, testing with 1G+ memory.