AttentiveMLP

Official codebase for 2023 ECML-PKDD "Attentive Multi-Layer Perceptron for Non-autoregressive Generation"

The computation process of two variants of AMLP.

Highlights

Our proposed AMLP as an efficient attention which can be applied in both self-attention and cross-attention for non-autoregressive (NAR) models.
AMLP is a powerful attention even compared to softmax attention, while maintaining linear complexity

Our study achieves promising achievements in terms of efficiency by speeding up the computation process of NAR.

Setup & Datasets

The code is based on PyTorch and fairseq and CAB

For text-to-speech, super resolution and long sequence time-series forecasting, we follow CAB to conduct the main experiments. So please follow them to set up the main environment as well as datasets.
For machine translation, we use the open-sourced code base CMLMC. Please follow the setup scripts.

Training

For text-to-speech, super resolution and long sequence time-serires forcasting, please follow CAB and replace the softmax attention with our AMLPCov in amlp.py.
For machine translation, please use the following script:

DATA_DIR="DISTILLED_DATA_DIR"
SAVE_DIR="CHECKPOINT_SAVE_DIR"
CKPT_NAME="CKPT_NAME"
ATTN_TYPE="amlpquery"
VISIBLE_GPUS="0,1,2,3"
# train
CUDA_VISIBLE_DEVICES=$VISIBLE_GPUS python train.py $DATA_DIR \
   --arch cmlm_transformer_wmt_en_de \
   -s de \
   -t en \
   --optimizer adam \
   --adam-betas '(0.9,0.98)' \
   --criterion nat_loss \
   --task translation_lev \
   --label-smoothing 0.1 \
   --noise random_mask \
   --lr-scheduler inverse_sqrt \
   --warmup-init-lr '1e-07' \
   --lr 0.0005 \
   --warmup-updates 40000 \
   --dropout 0.2 \
   --weight-decay 0.01 \
   --decoder-learned-pos \
   --encoder-learned-pos \
   --apply-bert-init \
   --share-all-embeddings \
   --max-tokens 16384 \
   --max-update 300000 \
   --fixed-validation-seed 7 \
   --log-interval 100 \
   --fp16 \
   --keep-last-epochs 20 \
   --eval-bleu --eval-bleu-args '{"iter_decode_max_iter": 0, "iter_decode_with_beam": 1}' \
    --eval-tokenized-bleu --eval-bleu-remove-bpe \
   --landmarks 16 \
   --insertCausalSelfAttn \
   --amlp-activation 'softmax' \
   --encoder-self-attention-type 'mha' \
   --decoder-cross-attention-type $ATTN_TYPE \
   --decoder-self-attention-type $ATTN_TYPE \
   --no-scale-embedding \
   --concatPE \
   --add_ema 0.99 \
   --selfcorrection 0 \
   --replacefactor 0.3 \
   --save-dir $SAVE_DIR/$CKPT_NAME

# test
python InferenceWMT_valid.py $CKPT_NAME 132 151 $SAVE_DIR/BLEU/  $DATA_DIR

Here we use two distilled datasets WMT 14 EN-DE and WMT 14 DE-EN.

Special Arguments explanation:

--landmarks: the $c$ value in AMLP.
--amlp-activation: the activation function of AMLP. Default to be softmax.
--decoder-{self, cross}-attention-type: the type of attention used in experiments. We did not use amlpcov because amlpcov lags behind amlpquery in cross attention, which is not appropriate for generation task.

The number of GPU devices can be changed, but the max tokens should be 64k via the formula: update_freq * GPU_devices * max-tokens, where --update-freq defaults to 1.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.vscode		.vscode
img		img
InferenceWMT_valid.py		InferenceWMT_valid.py
README.md		README.md
abstract_attention.py		abstract_attention.py
amlp.py		amlp.py
multihead_attention.py		multihead_attention.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.vscode

.vscode

img

img

InferenceWMT_valid.py

InferenceWMT_valid.py

README.md

README.md

abstract_attention.py

abstract_attention.py

amlp.py

amlp.py

multihead_attention.py

multihead_attention.py

Repository files navigation

AttentiveMLP

Highlights

Setup & Datasets

Training

About

Releases

Packages

Languages

Shark-NLP/AttentiveMLP

Folders and files

Latest commit

History

Repository files navigation

AttentiveMLP

Highlights

Setup & Datasets

Training

About

Resources

Stars

Watchers

Forks

Languages