Text-Similarity_Phrase-Matching

This notebook focuses on applying some classical and modern Text Similarity measures to US Patent Matching Dataset and check how well classical models including Jaro Winkler, Jaccard score as well as modern day Word2Vec and USE(Universal Sentence Encoder) can detect similarity between textual data. The project was done in several steps as follows:

Exploratory Data Analysis and Basic Preprocessing
- In this part I have focused on conducting initial analaysis of the text columns in hand with various descriptive stats.
- Visualisations including barplots and boxplots have been used to better demonstrate the stats.
- Along with that, basic preprocessing of dataframe including removing duplicates and renaming of cols. was done here.
Text preprocessing
- In this part of the notebook, the text was cleaned up and prepared for the next part.
- Unneccessary chars., punctuations and other things which do not add any semantic meaning to the text were removed from the dataset.
Applying Text similarity measures
- Once the data was ready, 5 algos. which included some classical and some modern day algos. were used to find the simialrity scores between the columns.
- Classical Algos. - Some classical text similarity were applied these included Jaccard Score, Jaro Winkler Method and Levenshtein distance.
- Modern day algorithms - including Word2Vec and pretrained Universal Sentence Encdoing were used to convert the text into embeddings and then find the cosine distance between the vectors.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md
phrase-matching.ipynb		phrase-matching.ipynb
requirements.txt		requirements.txt
test.csv		test.csv
train.csv (1).zip		train.csv (1).zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

phrase-matching.ipynb

phrase-matching.ipynb

requirements.txt

requirements.txt

test.csv

test.csv

train.csv (1).zip

train.csv (1).zip

Repository files navigation

Text-Similarity_Phrase-Matching

About

Releases

Packages

Languages

akulez/Text-Similarity_Phrase-Matching

Folders and files

Latest commit

History

Repository files navigation

Text-Similarity_Phrase-Matching

About

Topics

Resources

Stars

Watchers

Forks

Languages