Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accurate benchmarks? #36

Open
ekreutz opened this issue Jun 19, 2019 · 3 comments
Open

Accurate benchmarks? #36

ekreutz opened this issue Jun 19, 2019 · 3 comments

Comments

@ekreutz
Copy link

ekreutz commented Jun 19, 2019

Hey, I ran a quick benchmark of my own on, using:

  • macOS 10.14.5 (Mojave)
  • Python 3.7.3
  • python-Levenshtein 0.12.0 (pypi)
  • editdistance 0.5.3 (pypi)

In my tests python-Levenshtein is about 10x faster. Perhaps it's the macOS binaries? Or maybe your tests are outdated?

import editdistance
from Levenshtein._levenshtein import *
from timeit import timeit

s00 = "akjsdjkahsdhjkashd"
s01 = "akjsdjkahsdhj"
s10 = 'xyzz'
s11 = 'xab'
s20 = 'aaaaaaaaaaaaaaaaaaaaaaa'
s21 = 'i'

def a1():
    editdistance.eval(s00, s01)
    editdistance.eval(s10, s11)
    editdistance.eval(s20, s21)
def a2():
    distance(s00, s01)
    distance(s10, s11)
    distance(s20, s21)

print("editdistance")
print(timeit(a1, number=100000))

print("\npython-Levenshtein")
print(timeit(a2, number=100000))

Prints:

editdistance
0.330241583

python-Levenshtein
0.03681695899999998
@desialex
Copy link

I just compared those two in a real-life application and editdistance is about 30% faster.

@maxbachmann
Copy link

maxbachmann commented Apr 11, 2021

At least in my benchmarks this is largely dependent on the length of the input strings. Here is a comparision for different libraries using different string lengths. Both edlib and editdistance appear to have a lot of overhead for short strings.

Only python-Levenshtein uses a quadratic time implementation, while all others use Myers/Hyyrös bitparallel implementation.

@dzieciou
Copy link

dzieciou commented Jul 25, 2021

@maxbachmann Great chart. It shows the choice of implementation really depends on the application.

  • One application would be finding the closest single words, e.g., in spelling correction.
  • The other one would be measuring edit distance for long string sequences, e.g., in comparing the layout of documents as in "Comparison and Classification of Documents Based on Layout Similarity". I regret I haven't done a similar evaluation as yours at the time I was implementing document clustering...

For the latter rapidfuzz seems like a good choice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants