fuzzysearch

Fuzzy search: Find parts of long text or data, allowing for some changes/typos.

Easy, fast, and just works!

>>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1)
[Match(start=3, end=9, dist=1, matched="PATERN")]

Two simple functions to use: one for in-memory data and one for files
- Fastest search algorithm is chosen automatically
Levenshtein Distance metric with configurable parameters
- Separately configure the max. allowed distance, substitutions, deletions and/or insertions
Advanced algorithms with optional C and Cython optimizations
Properly handles Unicode; special optimizations for binary data
Simple installation:
- pip install fuzzysearch just works
- pure-Python fallbacks for compiled modules
- only one dependency (attrs)
Extensively tested
Free software: MIT license

For more info, see the documentation.

Installation

fuzzysearch supports Python versions 2.7 and 3.5+, as well as PyPy 2.7 and 3.6.

$ pip install fuzzysearch

This will work even if installing the C and Cython extensions fails, using pure-Python fallbacks.

Usage

Just call find_near_matches() with the sub-sequence you're looking for, the sequence to search, and the matching parameters:

>>> from fuzzysearch import find_near_matches
# search for 'PATTERN' with a maximum Levenshtein Distance of 1
>>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1)
[Match(start=3, end=9, dist=1, matched="PATERN")]

To search in a file, use find_near_matches_in_file() similarly:

>>> from fuzzysearch import find_near_matches_in_file
>>> with open('data_file', 'rb') as f:
...     find_near_matches_in_file(b'PATTERN', f, max_l_dist=1)
[Match(start=3, end=9, dist=1, matched="PATERN")]

Examples

fuzzysearch is great for ad-hoc searches of genetic data, such as DNA or protein sequences, before reaching for "heavier", domain-specific tools like BioPython:

>>> sequence = '''\
GACTAGCACTGTAGGGATAACAATTTCACACAGGTGGACAATTACATTGAAAATCACAGATTGGTCACACACACA
TTGGACATACATAGAAACACACACACATACATTAGATACGAACATAGAAACACACATTAGACGCGTACATAGACA
CAAACACATTGACAGGCAGTTCAGATGATGACGCCCGACTGATACTCGCGTAGTCGTGGGAGGCAAGGCACACAG
GGGATAGG'''
>>> subsequence = 'TGCACTGTAGGGATAACAAT' # distance = 1
>>> find_near_matches(subsequence, sequence, max_l_dist=2)
[Match(start=3, end=24, dist=1, matched="TAGCACTGTAGGGATAACAAT")]

BioPython sequences are also supported:

>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> sequence = Seq('''\
GACTAGCACTGTAGGGATAACAATTTCACACAGGTGGACAATTACATTGAAAATCACAGATTGGTCACACACACA
TTGGACATACATAGAAACACACACACATACATTAGATACGAACATAGAAACACACATTAGACGCGTACATAGACA
CAAACACATTGACAGGCAGTTCAGATGATGACGCCCGACTGATACTCGCGTAGTCGTGGGAGGCAAGGCACACAG
GGGATAGG''', IUPAC.unambiguous_dna)
>>> subsequence = Seq('TGCACTGTAGGGATAACAAT', IUPAC.unambiguous_dna)
>>> find_near_matches(subsequence, sequence, max_l_dist=2)
[Match(start=3, end=24, dist=1, matched="TAGCACTGTAGGGATAACAAT")]

Matching Criteria

The search function supports four possible match criteria, which may be supplied in any combination:

maximum Levenshtein distance (max_l_dist)
maximum # of subsitutions
maximum # of deletions ("delete" = skip a character in the sub-sequence)
maximum # of insertions ("insert" = skip a character in the sequence)

Not supplying a criterion means that there is no limit for it. For this reason, one must always supply max_l_dist and/or all other criteria.

>>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1)
[Match(start=3, end=9, dist=1, matched="PATERN")]

# this will not match since max-deletions is set to zero
>>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1, max_deletions=0)
[]

# note that a deletion + insertion may be combined to match a substution
>>> find_near_matches('PATTERN', '---PAT-ERN---', max_deletions=1, max_insertions=1, max_substitutions=0)
[Match(start=3, end=10, dist=1, matched="PAT-ERN")] # the Levenshtein distance is still 1

# ... but deletion + insertion may also match other, non-substitution differences
>>> find_near_matches('PATTERN', '---PATERRN---', max_deletions=1, max_insertions=1, max_substitutions=0)
[Match(start=3, end=10, dist=2, matched="PATERRN")]

When to Use Other Tools

Use case: Search through a list of strings for almost-exactly matching strings. For example, searching through a list of names for possible slight variations of a certain name.

Suggestion: Consider using fuzzywuzzy.

Name		Name	Last commit message	Last commit date
Latest commit History 290 Commits
benchmarks		benchmarks
docs		docs
src/fuzzysearch		src/fuzzysearch
tests		tests
.bumpversion.cfg		.bumpversion.cfg
.coveragerc		.coveragerc
.gitignore		.gitignore
.travis.yml		.travis.yml
AUTHORS.rst		AUTHORS.rst
CONTRIBUTING.rst		CONTRIBUTING.rst
HISTORY.rst		HISTORY.rst
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.rst		README.rst
appveyor.yml		appveyor.yml
build.cmd		build.cmd
requirements_dev.txt		requirements_dev.txt
setup.py		setup.py
tox.ini		tox.ini

License

taleinat/fuzzysearch

Folders and files

Latest commit

History

Repository files navigation

fuzzysearch

Installation

Usage

Examples

Matching Criteria

When to Use Other Tools

About

Topics

Resources

License

Stars

Watchers

Forks

Languages