Skip to content

Remove duplicate documents/videos/images via popular algorithms such as SimHash, SpotSig, Shingling, etc.

License

Notifications You must be signed in to change notification settings

Marcnuth/deduplication

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

deduplication

PyPI - Downloads

Remove duplicate documents via popular algorithms such as SimHash, SpotSig, Shingling, etc.

Install

Run following commands:

# install current library
pip install deduplication

# install required pretrained NLP models 
python -m spacy download xx_ent_wiki_sm
python -m spacy download en_core_web_sm

Example

SimHash

from deduplication import simhash

hashvalue1 = simhash('this is text')
hashvalue2 = simhash('this is another text', n_block=4)

L-SimHash

from deduplication import lsimhash

hashvalue = lsimhash('this is very long article texts. maybe with a lot of sentences.')

Citation

SimHash

Sadowski C, Levin G. 
Simhash: Hash-based similarity detection[J]. 
Technical report, Google, 2007.