Skip to content

Latest commit

 

History

History
20 lines (19 loc) · 6.62 KB

NLP_pretrained_lm.md

File metadata and controls

20 lines (19 loc) · 6.62 KB

NLP - Pretrained Language Model

Paper Conference Remarks
Experience Grounds Language EMNLP 2020 1. Posit that the present success of representation learning approaches trained on large, text-only corpora requires the parallel tradition of research on the broader physical and social context of language to address the deeper questions of communication.
DynaBERT - Dynamic BERT with Adaptive Width and Depth NeurIPS 2020 1. Propose a novel dynamic BERT model (DynaBERT), which can flexibly adjust the size and latency by selecting adaptive width and depth. 2. Show that the proposed dynamic BERT (or RoBERTa) at its largest size has comparable performance as BERT-base (or RoBERTa-base), while at smaller widths and depths consistently outperforms existing BERT compression methods
Distilling Knowledge Learned in BERT for Text Generation ACL 2020 1. Present a novel approach, Conditional Masked Language Modeling (C-MLM), to enable the finetuning of BERT on target generation tasks. 2. By leveraging BERT's idiosyncratic bidirectional nature, distilling knowledge learned in BERT can encourage auto-regressive Seq2Seq models to plan ahead, imposing global sequence-level supervision for coherent text generation. 3. Show that the proposed approach significantly outperforms strong Transformer baselines on multiple language generation tasks such as machine translation and text summarization.
DeText - A Deep Text Ranking Framework with BERT Arxiv 2020 1. Investigate how tobuild an efficient BERT-based ranking model for industry use cases. 2. Extend to a general ranking framework, DeText, that is open sourced and can be applied to various ranking productions. 3. Offline and online experiments of DeText on three real-world search systems present significant improvement over state-of-the-art approaches.
DeBERTa: Decoding-enhanced BERT with Disentangled Attention Arxiv 2020 1. Propose a new model architecture DeBERTa (Decoding-enhanced BERT with dis-entangled attention) that improves the BERT and RoBERTa models using two novel techniques: disentangled attention mechanism and an enhanced mask decoder is used to incorporate absolute positions in the decoding layer to predict the masked tokens in model pre-training. 3. Show that these two techniques significantly improve the efficiency of model pre-training and the performance of both natural languageunderstand (NLU) and natural langauge generation (NLG) tasks.
ConvBERT - Improving BERT with Span-based Dynamic Convolution NeurIPS 2020 1. Propose a novel span-based dynamic convolution to replace these self-attention heads to directly model local dependencies to improve the efficiency of Transformers. 2. Experiments have shown that ConvBERT significantly outperforms BERT and its variants in various downstream tasks, with lower training cost and fewer model parameters.
Contextual Embeddings - When Are They Worth It ACL 2020 1. Study the settings for which deep contextual embeddings (e.g., BERT) give large improvements in performance relative to classic pretrained embeddings (e.g., GloVe), and an even simpler baseline---random word embeddings---focusing on the impact of the training set size and the linguistic properties of the task. 2. Find that both of these simpler baselines can match contextual embeddings on industry-scale data, and often perform within 5 to 10% accuracy (absolute) on benchmark tasks. 3. Identify properties of data for which contextual embeddings give particularly large gains: language containing complex structure, ambiguous word usage, and words unseen in training.
CogLTX - Applying BERT to Long Texts NeurIPS 2020 1. Founded on the cognitive theory stemming from Baddeley, propose CogLTX framework identifies key sentences by training a judge model, concatenates them for reasoning and enables multi-step reasoning via rehearsal and decay. 2. CogLTX outperforms or gets comparable results to SOTA models on NewsQA, HotpotQA, multi-class and multi-label long-text classification tasks with memory overheads independent of the text length.
Calibration of Pre-trained Transformers EMNLP 2020 1. Analyze the calibration of BERT and RoBERTa across three tasks: natural language inference, paraphrase detection, and commonsense reasoning. 2. Show that: (1) when used out-of-the-box, pre-trained models are calibrated in-domain, and compared to baselines, their calibration error out-of-domain can be as much as 3.5x lower; (2) temperature scaling is effective at further reducing calibration error in-domain, and using label smoothing to deliberately increase empirical uncertainty helps calibrate posteriors out-of-domain.
BERT-EMD - Many-to-Many Layer Mapping for BERT Compression with Earth Mover’s Distance EMNLP 2020 1. Propose a novel BERT distillation method based on many-to-many layer mapping, which allows each intermediate student layer to learn from any intermediate teacher layers. 2. Leverage Earth Mover's Distance (EMD) to compute the minimum cumulative cost that must be paid to transform knowledge from teacher network to student network. 3. Achieves competitive performance on GLUE compared to strong competitors in terms of both accuracy and model compression.
Ad-hoc Document Retrieval using Weak-Supervision with BERT and GPT2 EMNLP 2020 1. Describe a weakly-supervised method for training deep learning models for the task of ad-hoc document retrieval. 2. Present an end-to-end retrieval system that starts with traditional information retrieval methods, followed by two deep learning re-rankers. 3. Show that our method outperforms state-of-the-art methods; this without the need for the expensive process of manually labeling data.
Active Learning for BERT - An Empirical Study EMNLP 2020 1. Present a large-scale empirical study on active learning techniques for BERT-based classification, addressing a diverse set of AL strategies and datasets. 2. Demonstrate that AL can boost BERT performance, especially in the most realistic scenario in which the initial set of labeled examples is created using keyword-based queries, resulting in a biased sample of the minority class.

Back to index