Skip to content

In this research I'd like to use BERT with the huggingface PyTorch library to fine-tune a model which will perform best in question pairs classification. The app is build using Streamlit.

License

Notifications You must be signed in to change notification settings

idanmoradarthas/Quora-Questions-Pairs-App

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Quora-Questions-Pairs-App

This research is based on the toturial BERT Fine-Tuning Tutorial with PyTorch.

Under training-bert folder you can find a Jupyter notebook. There I show how I fined-tune base-uncased bert model to solve the classification problem of duplication questions from Quora website.

Introduction

In this research I'd like to use BERT with the huggingface PyTorch library to fine-tune a model which will perform best in question pairs classification. The app is build using Streamlit.

So firstly let's talk about the model and the dataset:

Bert

Bidirectional Encoder Representations from Transformers (BERT) was released, and pretrained, in late 2018 by Google (see original model code here) for NLP (Natural Language Processing) tasks. Bert was created originally by Jacob Devlin with two corpora in pre-training: BookCorpus and English Wikipedia.

BERT consists of 12 Transformer Encoding layers (or 24 for large BERT). If you stack Transformer Decoding layers you'll GPT model to generate senetances.

You can more information inthe those videos:

Transformer Neural Networks - EXPLAINED! (Attention is all you need)

BERT Neural Network - EXPLAINED!

Quora Question Pairs Dataset

Quora is a question-and-answer website where questions are asked, answered, followed, and edited by Internet users, either factually or in the form of opinions. Quora was co-founded by former Facebook employees Adam D'Angelo and Charlie Cheever in June 2009. website was made available to the public for the first time on June 21, 2010. Today the website is available in many languages.

Over 100 million people visit Quora every month, so it's no surprise that many people ask similarly worded questions. Multiple questions with the same intent can cause seekers to spend more time finding the best answer to their question, and make writers feel they need to answer multiple versions of the same question.

The goal is to predict which of the provided pairs of questions contain two questions with the same meaning. The ground truth is the set of labels that have been supplied by human experts. The dataset itself can be downloaded from kaggle: here.

Application

How to use it?

see the following video:

Instructions video

Install

Clone the repo:

git clone https://github.com/idanmoradarthas/Quora-Questions-Pairs-App.git
cd Quora-Questions-Pairs-App

go to the training folder, install the requirements and run the notebook in order to create the model:

cd training-bert
pip install -r requirements.txt
jupyter notebook

Install the requirements in the main folder:

cd ..
pip install -r requirements.txt

Run Streamlit:

streamlit run app.py