Skip to content

fritshermans/pyminhash

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Version Downloads Conda - Platform Conda (channel only) Conda Recipe Docs - GitHub.io

PyMinHash

MinHashing is a very efficient way of finding similar records in a dataset based on Jaccard similarity. PyMinHash implements efficient minhashing for Pandas dataframes. See instructions below or look at the example notebook to get started.

Developed by Frits Hermans

Documentation

Documentation can be found here

Installation

Normal installation

Using PyPI

pip install pyminhash

Using conda

conda install -c conda-forge pyminhash

Install to contribute

Clone this Github repo and install in editable mode:

python -m pip install -e ".[dev]"
python setup.py develop

Usage

Apply record matching to column name of your Pandas dataframe df as follows:

myHasher = MinHash(n_hash_tables=10)
myHasher.fit_predict(df, 'name')

This will return the row pairs from df that have non-zero Jaccard similarity.