Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use faster implementations of edit distances #378

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

maxbachmann
Copy link

This replaces the usage of jellyfish in the edit distance implementation with rapidfuzz, which provides significantly faster implementations of these metrics. In addition it provides normalized versions of these metrics. The only special handling required here is that rapidfuzz considers two empty strings as similar, while the implementation in textacy considers them not similar.

@bdewilde
Copy link
Collaborator

Hi @maxbachmann, thanks for submitting a PR. I've heard of rapidfuzz but hadn't ever poked around in it! :)

I see that your changes make the textacy code slightly simpler, but to me that's not a compelling reason to switch from an otherwise satisfying dependency. (In fact, I switched from fuzzywuzzy to jellyfish years ago, partially for license reasons.) You claim that your implementations of these distance metrics are faster -- which could indeed be a good reason to switch! Would you be willing to benchmark times on texts of varying lengths, comparing the two implementations?

@maxbachmann
Copy link
Author

Sure I have a script for this laying around. It uses random ascii strings of varying lengths (the findings are the same for different unicode ranges).

Levenshtein:
levenshtein

JaroWinkler:
jarowinkler

Hamming:
hamming

I expected everything except for the extreme performance difference in the hamming distance. This appears to be a performance bug in the new rust backend, since the old C implementation performed quite a bit better:
hamming_old

I reported this to the jellyfish maintainer as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants