Document-Similarity

Using Jaccard-Similarity and MinHashing to determine similarity between two text documents

Semantic similarity forms a central component in many NLP systems, from lexical semantics, to part of speech tagging, to social media analysis. Recent years have seen a renewed interest in developing new similarity techniques. The increased interest has led to hundreds of techniques for measuring semantic similarity.

You can render the notebook online Here

This Notebook explores one of the simplest techniques of investigating document similarity, using Jaccard Similarity and expands onto MinHashing to quickly differentiate between documents.

MinHashing is a technique for quickly estimating how similar two sets are. It was initially used in the AltaVista search engine to detect duplicate web pages and eliminate them from search results. The original applications for MinHash involved clustering and eliminating near-duplicates among web documents, represented as sets of the words occurring in those documents. Similar techniques have also been used for clustering and near-duplicate elimination for other types of data, such as images: in the case of image data, an image can be represented as a set of smaller subimages cropped from it, or as sets of more complex image feature descriptions.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
documents		documents
Document Similarity.ipynb		Document Similarity.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

documents

documents

Document Similarity.ipynb

Document Similarity.ipynb

README.md

README.md

Repository files navigation

Document-Similarity

About

Releases

Packages

Languages

TSunny007/Document-Similarity

Folders and files

Latest commit

History

Repository files navigation

Document-Similarity

About

Topics

Resources

Stars

Watchers

Forks

Languages