Skip to content

Latest commit

 

History

History
27 lines (26 loc) · 8.68 KB

ML_transformer.md

File metadata and controls

27 lines (26 loc) · 8.68 KB

ML - Transformer

Paper Conference Remarks
Attention Is All You Need NIPS 2017 1. Propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. 2. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train.
Training Tips for the Transformer Model PBML 2018 1. Describes our experiments in neural machine translation using the recent Tensor2Tensor framework and the Transformer. 2. Examine some of the critical parameters that affect the final translation quality, memory usage, training stability and training time, concluding each experiment with a set of recommendations for fellow researchers.
Lipschitz Constrained Parameter Initialization for Deep Transformers ACL 2020 1. Present a parameter initialization method that leverages the Lipschitz constraint on the initialization of Transformer parameters that effectively ensures training convergence.
KERMIT - Complementing Transformer Architectures with Encoders of Explicit Syntactic Interpretations EMNLP 2020 1. Propose KERMIT (Kernel-inspired Encoder with Recursive Mechanism for Interpretable Trees) to embed symbolic syntactic parse trees into artificial neural networks and to visualize how syntax is used in inference. 2. Showed that KERMIT can indeed boost their performance by effectively embedding human-coded universal syntactic representations in neural networks.
Highway Transformer - Self-Gating Enhanced Self-Attentive Networks ACL 2020 1. Introduce a gated component self-dependency units (SDU) that incorporates LSTM-styled gating units to replenish internal semantic importance within the multi-dimensional latent space of individual representations. 2. Showed that SDU leads to a clear margin of convergence speed with gradient descent algorithms.
HAT - Hardware-Aware Transformers for Efficient Natural Language Processing ACL 2020 1. Propose to design Hardware-Aware Transformers (HAT) with neural architecture search. 2. Show that HAT can discover efficient models for different hardware (CPU, GPU, IoT device).
Funnel-Transformer - Filtering out Sequential Redundancy for Efficient Language Processing NeurIPS 2020 1. Propose Funnel-Transformer which gradually compresses the sequence of hidden states to a shorter one and hence reduces the computation cost. 2. Shows that, with comparable or fewer FLOPs, Funnel-Transformer outperforms the standard Transformer on a wide variety of sequence-level prediction tasks, including text classification, language understanding, and reading comprehension.
FastFormers - Highly Efficient Transformer Models for Natural Language Understanding SustaiNLP Workshop 2020 1. Present FastFormers, a set of recipes, e.g., knowledge distillation, structured pruning and numerical optimization, to achieve efficient inference-time performance for Transformer-based models on various NLU tasks.
Fast Transformers with Clustered Attention NeurIPS 2020 1. Propose clustered attention, which instead of computing the attention for every query, groups queries into clusters and computes attention just for the centroids. 2. Propose to use the computed clusters to identify the keys with the highest attention per query and compute the exact key/query dot products. 3. Show that the model consistently outperforms vanilla transformers for a given computational budget.
ETC - Encoding Long and Structured Inputs in Transformers EMNLP 2020 1. To improve efficiency, this paper introduces a novel global-local attention mechanism between global tokens and regular input tokens, and shows that combining global-local attention with relative position encodings and a “Contrastive Predictive Coding” (CPC) pre-training objective allows ETC to encode structured inputs. 2. Presented state-of-the-art results on four natural language datasets requiring long and/or structured inputs.
Do Transformers Need Deep Long-Range Memory ACL 2020 1. Perform a set of interventions to show that comparable performance can be obtained with 6X fewer long range memories and better performance can be obtained by limiting the range of attention in lower layers of the network.
DeLighT - Very Deep and Light-weight Transformer Arxiv 2020 1. Introduce a very deep and light-weight transformer, DeLighT, that delivers similar or better performance than transformer-based models with significantly fewer parameters. 2. Experiments on machine translation and language modeling tasks show that DeLighT matches the performance of baseline Transformers with significantly fewer parameters.
Deep Transformers with Latent Depth NeurIPS 2020 1. Present a probabilistic framework to automatically learn which layer(s) to use in Transformers by learning the posterior distributions of layer selection. 2. Propose a novel method to train one shared Transformer network for multilingual machine translation with different layer selection posteriors for each language pair, which alleviates the vanishing gradient issue and enables stable training of deep Transformers (e.g. 100 layers). 3. Outperforms existing approaches for training deeper Transformers on mahcine translation datasets.
Data Movement Is All You Need - A Case Study on Optimizing Transformers Arxiv 2020 1. Present a recipe for globally optimizing data movement in transformers, which can reduce data movement by up to 22.91% and overall achieve a 1.30x performance improvement over state-of-the-art frameworks when training BERT.
CrossTransformers - spatially-aware few-shot transfer NeurIPS 2020 1. Illustrate how the neural network representations which underpin modern vision systems are subject to supervision collapse, whereby they lose any information that is not necessary for performing the training task, including information that may be necessary for transfer to new tasks or domains. 2. Proposes two methods for this problem.
Big Bird - Transformers for Longer Sequences NeurIPS 2020 1. Propose, BigBird, a sparse attention mechanism that reduces this quadratic dependency to linear. 2. Show that BigBird is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model. 3. The proposed sparse attention can handle sequences of length up to 8x of what was previously possible using similar hardware.
Attention is Not Only a Weight - Analyzing Transformers with Vector Norms EMNLP 2020 1. Shows that attention weights alone are only one of the two factors that determine the output of attention and proposes a norm-based analysis that incorporates the second factor, the norm of the transformed input vectors. 2. Findings: (i) contrary to previous studies, BERT pays poor attention to special tokens, and (ii) reasonable word alignment can be extracted from attention mechanisms of Transformer.
Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping NeurIPS 2020 1. Propose a method based on progressive layer dropping that speeds the training of Transformer-based language models, not at the cost of excessive hardware resources but from model architecture change and training technique boosted efficiency. 2. Extensive experiments on BERT show that the proposed method achieves a 24% time reduction on average per sample and allows the pre-training to be 2.5 times faster than the baseline to get a similar accuracy on downstream tasks.
A Bilingual Generative Transformer for Semantic Sentence Embedding EMNLP 2020 1. Propose a deep latent variable model that attempts to perform source separation on parallel sentences, isolating what they have in common in a latent semantic vector, and explaining what is left over with language-specific latent vectors. 2. The proposed approach substantially outperforms the state-of-the-art on a standard suite of unsupervised semantic similarity evaluations.

Back to index