Skip to content

Anil-45/Question_Pair_Similarity

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

48 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Question Pair Similarity

Over 100 million people visit Quora every month, so it's no surprise that many people ask similarly worded questions. Multiple questions with the same intent can cause seekers to spend more time finding the best answer to their question, and make writers feel they need to answer multiple versions of the same question. Quora values canonical questions because they provide a better experience to active seekers and writers, and offer more value to both of these groups in the long term.

Credits: Kaggle

To learn more about this project, check out my blog

Table of Contents

General Information

Identify which questions asked on Quora are duplicates of the questions that have already been asked. This could be useful to instantly provide answers. We are tasked with predicting whether a pair of questions are duplicates or not.

Constraints

  • The cost of a mis-classification can be very high.
  • No strict latency concerns.
  • Interpretability is partially important.

Metrics

  • Log loss
  • Confusion Matrix

Data Overview

Data has 6 columns and a total of 4,04,287 entries.

  • id - Id
  • qid1 - Id corresponding to question1
  • qid2 - Id corresponding to question2
  • question1 - Question 1
  • question2 - Question 2
  • is_duplicate - Determines whether a pair is duplicate or not

Libraries

  • Application Framework - flask, wsgiref
  • Data processing and ML - numpy, pandas, matplotlib, sklearn, xgboost, seaborn, spacy, nltk, contractions, fuzzywuzzy, distance, optuna
  • General operations - os, pickle, re

Screenshots

screenshot1 screenshot1 screenshot1

Setup

Clone this repo using

git clone https://github.com/Anil-45/Question_Pair_Similarity.git

Install the required modules using

pip install -r requirements.txt

Usage

Run the following command to start the application

python app.py

Access the application

For training yourself, download the data and place it in data folder

Find the optimal parameters by running tune.py

python tune.py

Train the model from optimal parameters

python train.py

For predictions, place the test.csv in data folder

python predict.py

Room for improvement

  • Add spell correction

Contact

Created by @Anil_Reddy

References