Skip to content

l11ama/quora_question_pairs_kaggle

Repository files navigation

#Competition Quora question pairs

https://www.kaggle.com/c/quora-question-pairs

Main solution

Key moments:

  • It's unacceptable to make classification on all pairs with such heavy classifier as BERT in real life. I would use BERT to get good encodings for questions.
  • Train dataset has selection bias, seems private set also has this issue. I prefer not to use these leakage features to improve the score. But such train dataset leads to overfitting problem, which's still unresolved by me. alt-text-leakage alt-text-q1q2
  • To make BERT <CLS> encodings more suitable for final task I try to finetune them with metric learning on triplets. Also this procedure helps with selection bias problem in case of good triplet generator.
  • Token embeddings are also important to train a good model. To save information from not <CLS> tokens I use classifier head with extra input: sum of all token embeddings attended to other question.
  • I didn't use any ensambles, more interesting for me was to do experiments with embedding learning with single heavy model.

Requirements

  • check requirements.txt and install missing packages
  • download a pre-trained BERT model and place it in this folder, specify the chosen model in config.yml (path_to_pretrained_model). For example, for a medium uncased model that will be:
  • wget -q https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip'
  • unzip uncased_L-12_H-768_A-12.zip
  • place your CSV files in the input folder (path_to_data in models/bert_finetuning/config.yml)
  • specify batch size in models/bert_finetuning/config.yml and models/bert_finetuning/config_triplets.yml:
  • install apex or change apex_mixed_precision to False
  • debug with option toy=True, get real submission with toy=False

Run whole training process and get submission file

python submission.py

Phase 1: Metric Learning

Create triplet dataset from question pairs

notebook

  • split train pair dataset with stratification on train.csv data (validation part = 0.1)
  • build positive and negative connection graphs on train set
  • collect buckets with duplicated questions
  • detect all negative connections from each bucket to other buckets
  • generate triplets: for each pair of questions in bucket - 3 negative examples

Train metric learning

notebook

  • encode anchor, positive and negative questions from triplet with BERT separately
  • train with Triplet Loss for 3 encoded <CLS> tokens respectively

alt text

Phase 2: Pair classifier

notebook

  • split train pair dataset in same way as for metric learning to reduce data leak
  • load metric learned encoder BERT encoder
  • encode left and right question with BERT separately
  • pool all encoded tokens from one question attended to encoded <CLS> token of another question
  • concat attended embedding with <CLS>\ embedding for each question
  • get elementwise product of question embedding
  • make binary classification with 2-layer classifier's head
  • freezing BERT layers for first epoch, different learning rate for head and BERT layers

alt text

Metrics

Kaggle link: https://www.kaggle.com/mfside

Conclusions

Main unresolved problem: overfitting.

BERT model learns well, but on large epoches overfit effect could be recognized both on validation and test set.

Ways to resolve:

  • detailed data analysis. Influence of selection bias, explore ways to select representative validation datasets.
  • make metric learning more efficient to generalize model better (hard to train, but could be optimized with N-Pair Loss)
  • change sampling from train dataset for pair classification (class imbalance, test class imbalance). Simple method with weighted loss gives better results (on last experiment).
  • hyper parameters tuning, reduce model capacity (DisillBERT, less amount of linear layers, reduced hidden states). Not have enough time for experiments, hard to train without good GPUs.

Make experiments more clear

  • Detect influence of each phase and architecture part and take the best variant

Other methods

BERT finetuning on pair of question joined by separator

notebook

Model on Google disk

Submission file

Private = 0.33768 Public = 0.33507

ULMFiT

notebook

vocab on Google disk

Model on Google disk

Submission file

Private = 0.36800 Public = 0.36714

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published