Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fuzzy Matching #25

Open
bkj opened this issue Jan 20, 2016 · 5 comments
Open

Fuzzy Matching #25

bkj opened this issue Jan 20, 2016 · 5 comments
Labels

Comments

@bkj
Copy link

bkj commented Jan 20, 2016

I see that this data structure supports prefix lookups -- does it also support fuzzy lookups (i.e. all records within Levenshtein distance). If that's not supported in this package / this data structure, do you know of any other packages that would let me do in-memory fuzzy searching?

~ Ben

@superbobry
Copy link
Member

In theory you can do fuzzy matching with tries via branching search, but a more efficient way might be to use a Levenshtein automata. I've never used it in Python, so I can't recommend a specific package, but there're some, e.g. leven.

@bkj
Copy link
Author

bkj commented Jan 23, 2016

Ah thanks. I figured that Levenshtein automata would work for this, but I couldn't find a python implementation. The one that you linked computes Levenshtein distance but according to the docs, Levenshtein automata are still to be implemented.

There is a node module (https://www.npmjs.com/package/node-levenshtein-automata) that implements Levenshtein automata, but I haven't had a chance to take a look at it yet.

I may look into implementing a fuzzy (branching) search in marisa-trie in the near future. I imagine there might be some interest in this, since I haven't been able to find much about in-memory fuzzy searching (esp. in Python).

~ Ben

@superbobry
Copy link
Member

JFYI there's another blog post with Python code for building and using the automata.

I may look into implementing a fuzzy (branching) search in marisa-trie in the near future. I imagine there might be some interest in this, since I haven't been able to find much about in-memory fuzzy searching (esp. in Python).

Feel free to submit a PR. I'm not sure if anyone uses tries for insertions/deletions (as in Leveshtein distance), though, so if you come up with a simpler mismatch-only solution, it would be useful as well.

@kmike
Copy link
Member

kmike commented Jan 24, 2016

Hey,

I think it won't be possible to use marisa-trie efficiently for that, as it lacks an ability to do step-wise walking (see https://code.google.com/p/marisa-trie/issues/detail?id=9 and #20). https://github.com/kmike/DAWG or https://github.com/kmike/DAWG-Python may be a better fit.

@fgregg
Copy link

fgregg commented Sep 18, 2017

Here's a fast Levenstein search. It uses ternary search trees. https://github.com/mattandahalfew/Levenshtein_search

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants