Main solution

#Competition Quora question pairs

https://www.kaggle.com/c/quora-question-pairs

Main solution

Key moments:

It's unacceptable to make classification on all pairs with such heavy classifier as BERT in real life. I would use BERT to get good encodings for questions.
Train dataset has selection bias, seems private set also has this issue. I prefer not to use these leakage features to improve the score. But such train dataset leads to overfitting problem, which's still unresolved by me.
To make BERT <CLS> encodings more suitable for final task I try to finetune them with metric learning on triplets. Also this procedure helps with selection bias problem in case of good triplet generator.
Token embeddings are also important to train a good model. To save information from not <CLS> tokens I use classifier head with extra input: sum of all token embeddings attended to other question.
I didn't use any ensambles, more interesting for me was to do experiments with embedding learning with single heavy model.

Requirements

check requirements.txt and install missing packages
download a pre-trained BERT model and place it in this folder, specify the chosen model in config.yml (path_to_pretrained_model). For example, for a medium uncased model that will be:
wget -q https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip'
unzip uncased_L-12_H-768_A-12.zip
place your CSV files in the input folder (path_to_data in models/bert_finetuning/config.yml)
specify batch size in models/bert_finetuning/config.yml and models/bert_finetuning/config_triplets.yml:
install apex or change apex_mixed_precision to False
debug with option toy=True, get real submission with toy=False

Run whole training process and get submission file

python submission.py

Phase 1: Metric Learning

Create triplet dataset from question pairs

notebook

split train pair dataset with stratification on train.csv data (validation part = 0.1)
build positive and negative connection graphs on train set
collect buckets with duplicated questions
detect all negative connections from each bucket to other buckets
generate triplets: for each pair of questions in bucket - 3 negative examples

Train metric learning

notebook

encode anchor, positive and negative questions from triplet with BERT separately
train with Triplet Loss for 3 encoded <CLS> tokens respectively

Phase 2: Pair classifier

notebook

split train pair dataset in same way as for metric learning to reduce data leak
load metric learned encoder BERT encoder
encode left and right question with BERT separately
pool all encoded tokens from one question attended to encoded <CLS> token of another question
concat attended embedding with <CLS>\ embedding for each question
get elementwise product of question embedding
make binary classification with 2-layer classifier's head
freezing BERT layers for first epoch, different learning rate for head and BERT layers

Metrics

Kaggle link: https://www.kaggle.com/mfside

implemented in submission.py: 2 epochs metric learning + 1 epoch pair clf with freezed BERT + 3 epochs pair clf with unfreezed BERT.

Model on Google Drive

Submission file

Private = 0.38092 Public = 0.37776
finetune one more epoch with decreased lr.

Model on Google Drive

Submission file

Private = 0.37830 Public = 0.37406
finetune another one epoch with decreased lr and weighted loss (try to increase precision, balance of duplicate question in test is lower)

Model on Google Drive

Submission file

Private = 0.36893 Public = 0.36465

Conclusions

Main unresolved problem: overfitting.

BERT model learns well, but on large epoches overfit effect could be recognized both on validation and test set.

Ways to resolve:

detailed data analysis. Influence of selection bias, explore ways to select representative validation datasets.
make metric learning more efficient to generalize model better (hard to train, but could be optimized with N-Pair Loss)
change sampling from train dataset for pair classification (class imbalance, test class imbalance). Simple method with weighted loss gives better results (on last experiment).
hyper parameters tuning, reduce model capacity (DisillBERT, less amount of linear layers, reduced hidden states). Not have enough time for experiments, hard to train without good GPUs.

Make experiments more clear

Detect influence of each phase and architecture part and take the best variant

Other methods

BERT finetuning on pair of question joined by separator

notebook

Model on Google disk

Submission file

Private = 0.33768 Public = 0.33507

ULMFiT

notebook

vocab on Google disk

Model on Google disk

Submission file

Private = 0.36800 Public = 0.36714

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
data		data
imgs		imgs
models/bert_finetuning		models/bert_finetuning
notebooks		notebooks
pair_classification		pair_classification
preprocessing		preprocessing
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
submission.py		submission.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

imgs

imgs

models/bert_finetuning

models/bert_finetuning

notebooks

notebooks

pair_classification

pair_classification

preprocessing

preprocessing

.gitignore

.gitignore

README.md

README.md

requirements.txt

requirements.txt

submission.py

submission.py

Repository files navigation

Main solution

Phase 1: Metric Learning

Create triplet dataset from question pairs

Train metric learning

Phase 2: Pair classifier

Metrics

Conclusions

Other methods

BERT finetuning on pair of question joined by separator

ULMFiT

About

Releases

Packages

Contributors 2

Languages

l11ama/quora_question_pairs_kaggle

Folders and files

Latest commit

History

Repository files navigation

Main solution

Phase 1: Metric Learning

Create triplet dataset from question pairs

Train metric learning

Phase 2: Pair classifier

Metrics

Conclusions

Other methods

BERT finetuning on pair of question joined by separator

ULMFiT

About

Resources

Stars

Watchers

Forks

Languages