Skip to content
This repository has been archived by the owner on Jul 6, 2021. It is now read-only.

fanchenyou/transformer-study

Repository files navigation

Several Transformer network variants tutorials

1. transformer_encoder, paper, src, tutorial

* Use Pytorch nn.transformer package to build an encoder for language prediction
* PyTorch 1.2 + TorchText

2 & 2.1. transformer_xl_from_scratch, src

* 2. simple toy example showing the idea of Transformer-XL which uses additional memory to encode history    
* 2.1 Build Transformer-XL + MultiAttention heads
* Show how to use previous hidden states to achieve "Recurrence Mechanism"
  - the output of the previous hidden layer of that segment
  - the output of the previous hidden layer from the previous segment
* Show how to use relative positional encoding to incorporate position information

3. transformer_xl full release, src, tutorial

Network

* Complete implementation of Transformer-XL

4. xlnet, paper, src, tutorial

* An excellent tutorial version of XLNet from above link
* Add more comments for understanding
* Requirements: Python 3 + Pytorch v1.2 
* TODO: Add GPU support

5. Bert from scratch, paper, src, tutorial

* Build Bert - Bidirectional Transformer
* The task is two-fold, see paper section 3.1
    1) to predict the second part of a sentence (Next Sentence Prediction)
    2) to predict the masked words of a sentence (Masked LM)
* step 1: generate vocabulary file "vocab.small" in ./data
* step 2: train the network
* See transformer_bert_from_scratch_5.py for more details.

6. Bert from Pytorch Official Implementation, paper, src

* Build Bert - Bidirectional Transformer
* Utilize official Pytorch API to implement the interface of using existing code and pre-trained model
* pip install transformers tb-nightly 

7. ALBERT, A Lite BERT, paper, src, tutorial

* A Lite BERT which reduces BERT params to ~20%
* Decouple word embedding size with hidden size by using two word projection matrices 
   - parameters are reduced from O(V*H) to O(V*E + E*H) s.t. E << H
* Cross-layer parameter sharing
   - the default decision for ALBERT is to share all parameters across layers (see paper section 3.1 !!)
* Sentence Order Prediction
   - NSP (Next Sentence Prediction) in BERT is not effective (as the association of two sents in a doc is not strong)
   - Inter-sentence coherence is strong: 
     the positive case is the two sentences are in proper order; 
     the negative case is the two sentences in swapped order.

Requirements

Python = 2.7 and 3.6

PyTorch = 1.2+ [here] for both python versions

GPU training with 4G+ memory, testing with 1G+ memory.

About

Transformer network variants tutorials

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published