Skip to content

TSunny007/Document-Similarity

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

Document-Similarity

Using Jaccard-Similarity and MinHashing to determine similarity between two text documents

Semantic similarity forms a central component in many NLP systems, from lexical semantics, to part of speech tagging, to social media analysis. Recent years have seen a renewed interest in developing new similarity techniques. The increased interest has led to hundreds of techniques for measuring semantic similarity.

You can render the notebook online Here

This Notebook explores one of the simplest techniques of investigating document similarity, using Jaccard Similarity and expands onto MinHashing to quickly differentiate between documents.

MinHashing is a technique for quickly estimating how similar two sets are. It was initially used in the AltaVista search engine to detect duplicate web pages and eliminate them from search results. The original applications for MinHash involved clustering and eliminating near-duplicates among web documents, represented as sets of the words occurring in those documents. Similar techniques have also been used for clustering and near-duplicate elimination for other types of data, such as images: in the case of image data, an image can be represented as a set of smaller subimages cropped from it, or as sets of more complex image feature descriptions.

About

Using Jaccard-Similarity and Minhashing to determine similarity between two text documents

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published