Skip to content

AmberzzZZ/transformer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

seq2seq model

example: character-level english to french

### 数据
数据集格式: english sentence \t french sentence \t ...
target: english
input: french
起始符&终止符: '\t' + decoder_input_sentence + '\n'
characabulary: 有字母,数字,空格,标点符号,decoder_input_sentence还有起始符&终止符,decoder_target_sentence比它早一步,没有起始符,有终止符
one-hot: sentence范围内one-hot for each charac, 其余的填充用空格的one-hot

### data dim
encoder_input_data: 
    [batch_size, max_sentence_length, num_eng_characters]
    each element is a one-hot vec (standing for a specific charac)

decoder_input_data: 
    [batch_size, max_sentence_length, num_eng_characters]
    each element is a one-hot vec

decoder_target_data: 
    [batch_size, max_sentence_length, num_eng_characters]
    offset the decoder_input_data by one step: 
        decoder_target比decoder_input早一步,decoder这一步的预测/gt作为下一步的input
        decoder_target_data[:,t,:] = decoder_input_data[:,t+1,:]

### model
    model input: 
        encoder input: time-distributed input sequence
        decoder input: target sequence starting with \t
        期望输入数据尺寸: (batch_size, timesteps, data_dim)
    model output: predict sequence, (batch_size, timesteps, data_dim)
    model target: target sequence, (batch_size, timesteps, data_dim)
    encoder_states: LSTM有两个(hidden state和cell state),和每个时刻的输出维度相同(batch_size, latent_dim)
    initial_state: 用于指定RNN层的初始状态,
        decoder起始时刻输入的是起始符,hidden state就来自encoder编码得到的信息,
        后续每个时刻的输入可以是前一时刻的预测/gt,循环的状态就是逐渐积累的前文信息

    * 如果使用函数式API的Model类模型,我们会定义input layer的shape=(timestep, emb_dim)
    * 如果使用Sequential模型,我们要在第一个LSTM层里显示定义input_shape=(timesteps, data_dim)
    * batch_size is set for axis-0 by default for both the methods above
    * 但是在stateful的LSTM layer里面,batch_size要被显示声明:
        model.add(LSTM(32, return_sequences=True, stateful=True, batch_input_shape=(batch_size, timesteps, data_dim)))

### teacher-forcing or reinjection
decoder的输出是预测sequence(gt是与之对应的target sequence),输入是比预测序列one-step-lag-behind的输出序列————你只能拿到并复用已经预测出的东西
如果用teacher-forcing,decoder的输入是target sequence,
如果用reinjection,decoder的输入是predicted sequence
【QUESTION】为啥teacher-forcing更常使用?

### inference
decoder模型在inference阶段,变动比较大:
1. 要显式的指定每一个time step的hidden state
2. 要显式的实现逐步预测


what if:
* GRU: 用GRU就是少了cell state,内部结构基于LSTM,输出跟常规RNN一样
* word-level: 
    vocabulary比characabulary要大得多,one-hot embedding维度太大了,而且过于稀疏,
    可以在输入层后面添加Embedding层,将词向量转换成固定尺寸的稠密向量

further improvements:
* attention
* bi-rnn
* deeper: stacking layers

Bi-directional RNN

出发点:
    * LSTM在一定程度上缓解了RNN的长距离依赖问题,但不是完全解决
    * 一个很常见的场景,句子填空,既需要结合上文,又需要参考下文,单向的RNN不能胜任
    * 或者一个图像任务,需要结合全局信息进行分类,单向的RNN获取的信息是不完整的

实现:将sequence倒序再解析一遍

example: MNIST classification

keras layer Bidirectional: 
    * https://github.com/keras-team/keras/blob/d71247dcd805e58110a784b03cf2fcbaa1c837c8/keras/layers/wrappers.py
    * fw和bw层的输入是同一个inputs
    * fw和bw层可以是同一个种layer,也可以是不同的layer
    * fw和bw层的输出在emb_dim上concat在一起(也有其他fusion mode),本例中lstm的out dim是128,所以bi-lstm的out dim是256

attention

出发点:
    basic的seq2seq模型,encoder从输入序列中编码得到一个context vector([b,dim]),然后在解码阶段,
    这个固定的Context Vector作为initial_state,编码input整体的信息,输入给decoder
    考虑机翻这个场景,翻译当前词的时候不是与输入序列中每个element都是强相关的,绝大多数情况只与对应词相关

实现:建立输入语句中第j个单词与输出语句中第i个单词的匹配程度,每个step使用一个加权的context vec
    * for each decoding step
    * s是decoder的输出,[1,dim],当前单词
    * h是encoder的输出,[N,dim],所有输入词向量
    * $e_{ij} = a(s_{i-1},h_j)$
    * $\alpha_{ij} = \frac{exp(e_{ij})}{\sum_k exp(e_{ik})}$
    * $c_i = \sum_j \alpha_{ij} h_j$


example: character-level english to french

这里面的attention是learnable attention,类似se-block,计算每个embedding与其他embedding的线性映射value,然后softmax

keras MultiHeadAttention layer

tf2.4.1 keras估计要2.3以上
https://github.com/keras-team/keras/blob/70d7d07bd186b929d81f7a8ceafff5d78d8bd701/keras/layers/multi_head_attention.py

given sequence length N, batch size B, key dim d, num_heads m, value_dim dv:

step1: projects `query`, `key` and `value`, 
* each is a list of tensors of length `num_attention_heads`
* each tensor [B, N, d]
* trainable variables Wq,Wk,Wv(biases)

step2: compute attention
* dot(Q,K)
* scaled by hidden dim d, [B,N,N]
* softmax to obtain attention probabilities, [B,N,1]
* dropout,我们的实现中dropout放在MSA layer后面,因为drop的是一整个特征维度,放在哪都行
* reweight content vec V, [B,N,mdv]

step3: final dense
* concat multi-heads along d-axis, [B,N,mdv]
* linear projection to d, [B,N,dv]

transformer: attention is all you need

这里面的attention是transformer attention,established on Q,K,V
可以用来表征token embedding的不同维度:https://zhuanlan.zhihu.com/p/158952064
* Q: query,词的查询向量
* K: key,词的被查向量
* V: value,词的内容向量

multi-head self-attention:
* dq=dk=qv=dmodel/h=64
* h=8

encoder:
N=6, encoder由6个相同的attention block构成
每个attention block包含:MSA,FF,LN,residual, dropout
    MSA:MultiHeadAttention + add&norm
    FF:dense + relu/gelu + dense
输入:input embedding + ppositional embedding
self-attention: single input, x=q=k=v

vision transformer (ViT)

官方repo: https://github.com/google-research/vision_transformer
third repo: https://github.com/lucidrains/vit-pytorch

task: supervised classification

inputs: 将图片切成不重叠的16x16块,然后flatten,然后用learnable的线性层降维,然后添加cls token,然后加上PE
    * image patch sequence & trainable linear projection
    * PE: trainable 1d embedding
    * x0: trainable pretended 1d embedding
    * [x0, patch_embeddings, ] + PEs
    实现上,是通过一个一层卷积,kernel size和stride都是patch size,将每个ch3-patch线性映射成一个emb-dim vec

model: transformer encoder
    * patch_size
    * hidden_size: through all
    * MSA layer: 没有mask,最简单的版本

MLP head: 


GeLU:
    Gaussian error linear unit: x * P(X <= x), where P(X) ~ N(0, 1)
    if approx:
        y = 0.5 * x * (1 + tanh(sqrt(2 / pi) * (x + 0.044715 * x^3)))
    else:
        y = 0.5 * x * (1 + erf(x / sqrt(2)))

LN:
    https://www.geek-book.com/src/docs/keras/keras/keras.io/api/layers/normalization_layers/layer_normalization/index.html
    trainable的情况下,given inputs [b,(hwd),c],参数量是2*(hwd)*c,所以用在1D比较正常一点


子类继承模型:class ***(keras.Model)
    * init里面定义层不能复用
    * 批量定义的layer list里面每个layer必须在self作用空间下声明
    * checkpoint只能save_weights不能save_model,因为不支持get_config()和序列化


training details:
* cosine learning rate
* Adam + L2 reg: momentum=0.9, wd=1e-5

主要缺陷:
* 模型量级太大,batch size大,tpu级别训练
* 训练数据量必须要大,不大精度不行,基本没办法在自己的数据集上train from scratch

基于ViT的提升有:DeiT, T2T-ViT, LV-ViT

LV-ViT

官方repo: https://github.com/zihangJiang/TokenLabeling

patch embedding
4-layer conv, kernel size [7,3,3,8], stride [2,1,1,8], filters 64
[conv-bn-relu]-[conv-bn-relu]-[conv-bn-relu]-[conv-bias]

re-labeling
用另一个模型inference training set,给出一个K-dim dense score map
在训练我们的模型的时候,random crop以后,基于cropped score map重新计算label

token labeling
基于re-labeling的dense score map,我们能够进一步地给到每一个token一个独立的K-dim label
每个token的label和prediction能够独立计算一个CE:auxiliary token labeling loss

mixtoken
针对token grids,以cutMix的形式(crop box),而不是noisy drop
crop box的长宽是服从beta分布的(大概率落在较小值,从而保证总体的label是beta分布)
token label是token individual的,所以mixtoken不影响每个token的label学习,所以源代码在计算token loss之前,将crop patches复原,这样token gt labels就不用转换了
mixtoken本质上还是在augment原图,所以只影响cls token的prediction,cls token要基于随机mask重新计算

loss
cls_token对应的out embedding [b,D]接上MLP prediction head,预测总体的类别概率
其他token对应的out embedding [b,N,D]接上shared MLP prediction head,学习每个token的类别预测,求全部tokens的平均
再加权求和:cls_loss + 0.5*token_loss

encoder block
* stochastic depth (dropblock): random drop by sample
* residual_scale: 给residual downscale有提升,scale=2

training details
* lr: linear scaling by batch 1e-3*batch_size/1024, 5 warmup epochs + cosine decay
* AdamW: weight_decay=5e-2
* batch_size: 1024
* dropout = 0.
* dropconnect = .1
* randAug, mixup

swin

official repo: https://github.com/microsoft/Swin-Transformer
keras version: https://github.com/keras-team/keras-io/blob/master/examples/vision/swin_transformers.py

related papers:
origin Swin: Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Swin for object detection: End-to-End Semi-Supervised Object Detection with Soft Teacher
Swin for segmentation: Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation

swin family:
swin-T: (224,224), C=96, num_layers=[2,2,6,2], num_heads=[3,6,12,24]
swin-S: (224,224), C=96, num_layers=[2,2,18,2], num_heads=[3,6,12,24]
swin-B: (224,224) / (384,384), C=128, num_layers=[2,2,18,2], num_heads=[4,8,16,32]
swin-L: (224,224) / (384,384), C=196, num_layers=[2,2,18,2], num_heads=[6,12,24,48]

what's new in swin: 
* hierarchical: 一般ViT都是桶型结构,fp过程中resolution不变,浅层计算量不友好,而且不好应用于FPN及后续dense任务
* window attention: window比patch高一个level,将att分解成window-based global att和local att,减少计算量

positional embeddings:
* no pos: 目前为止仅发现Google的MLP-Mixer是不使用PE的,说是隐式地学到了
* abs pos: 大多数ViT的做法,基于input size计算一组1D/2D的固定值
* rel pos: 本文的做法,不加abs PE,但是在MSA的QKV softmax层里面添加bias

classification head:
* swin里面没有cls token,stage4 最终输出[b,H/32*W/32,8C]的token embeddings
* 对所有的token embeddings求平均,类似GAP,[b,8C]
* 然后送入linear classifier,[b,n_classes]

patch merging:
* 先是空间维度转换成特征维度
* 然后linear proj
* 比pooling保留的信息多

relative position index: 
* 用来描述window中任意两点的相对位置关系:[wh, wh],两个wh分别表示window map上任意一点
* 初始相对距离度量分为h和w两个axis,range from [0,2h-1]和[0,2w-1]
# * 2-dim coords可以合并成1-dim:采用两个digit->两位数的转换方式
* shared among windows
* 常量

relative position bias: 
* 用来保存任意一对相对位置的position bias:[2h-1, 2w-1, n_heads]
* truncated normal distribution: 初始用截断的正态分布填充
* relative position index中保存的所有相对距离,都能在relative position bias找到一组bias: [wh,wh,n_heads]
* learnable

window attention:
* 将特征图分解成互不重叠的window,每个window包含M*M个patch
* 在每个windows内部做self-attention,每个window参数共享————window-based local attention
* window_size=7: 要求特征图尺寸要能整除7,否则pooling
* shifted-window: 
    如果没有shifted-window,每个stage的感受野才2倍,不然都不变的
    given window_size=M: 划分windows的时候不从左上角开始,而是wh各平移M//2
    等价于把featuremap平移一部分然后正常partition
    tf.manip.roll / torch.roll

About

keras implementation of vision transformers

Topics

Resources

Stars

Watchers

Forks

Languages