Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Convert edit distance to ratio/similarity #28

Open
odannyc opened this issue Feb 1, 2019 · 4 comments
Open

Feature: Convert edit distance to ratio/similarity #28

odannyc opened this issue Feb 1, 2019 · 4 comments

Comments

@odannyc
Copy link

odannyc commented Feb 1, 2019

I would like a way to get the similarity of the 2 strings instead of just the distance.
For example with python-Levenshtein I can do:

>>> Levenshtein.ratio("ab", "abc")
0.8
@maxbachmann
Copy link

It is worth noting, that Levenshtein.ratio does not return the normalized Levenshtein distance which is calulated as:

1 - lev_dist / max(len1, len2)

but returns the normalized InDel distance (similar to Levenshtein distance but does not allow Substitutions). Which is normalized as:

1 - InDel_dist / (len1 + len2)

I provide a similar metric in my library RapidFuzz: https://maxbachmann.github.io/RapidFuzz/string_metric.html#normalized-levenshtein
However in editdistance it should be simple to calculate the normalized version yourself.

@bertsky
Copy link

bertsky commented May 11, 2021

the normalized Levenshtein distance which is calulated as:

1 - lev_dist / max(len1, len2)

[...]

However in editdistance it should be simple to calculate the normalized version yourself.

I think this is also not correct though. It should be the length of the alignment path as denominator (which is slightly longer than the longest sequence). See this comment

@maxbachmann
Copy link

I think this is also not correct though. It should be the length of the alignment path as denominator (which is slightly longer than the longest sequence).

I am a complete noob on these OCR topics. I simply needed a fast implementation and found none, so I did build my own ;)
Do you happen to have a reference or explanation why this is needed?

E.g. the WER is commonly implemented as

WER = lev_dist / len(reference) = (S+D+I)/(S+D+C)
WAcc = 1 - WER

This has the disadvantage, that WER might be over 1. The implementation I mentioned is a modification to use the length of the longer string/list to make sure the error rate is always between 0 and 1. It also appeared to be a simple solution to map the Levenshtein distance using arbitrary weights (here the normalization I use in RapidFuzz for a range 0-100)
similarity

Dividing the edit distance with the maximum possible edit distance made sense to me.

@bertsky
Copy link

bertsky commented May 11, 2021

I am a complete noob on these OCR topics. I simply needed a fast implementation and found none, so I did build my own ;)

Looks promising, will have a look.

Do you happen to have a reference or explanation why this is needed?

Not sure for the right reference (most papers do not even treat this detail), but I found this: https://www.yorku.ca/mack/nordichi2002-shortpaper.html

Equation 3 is recommended because it always yields the same error rate as that obtained by a character-by-character analysis of the optimal alignments, as just described. In general, Equation 3 yields a slightly lower error rate than Equation 1 since len(alignment) >= max(|A|, |B|).

E.g. the WER is commonly implemented as

WER = lev_dist / len(reference) = (S+D+I)/(S+D+C)
WAcc = 1 - WER

This has the disadvantage, that WER might be over 1

Yes, exactly. Same for CER. Unfortunately, even fairly standard tools (used a lot for published results) make that mistake (see here). Others get it right, though.

The implementation I mentioned is a modification to use the length of the longer string/list to make sure the error rate is always between 0 and 1.
Dividing the edit distance with the maximum possible edit distance made sense to me.

Yes, using the maximum distance is already much better than using the reference string length. My point was that even max-dist is biased: the actual alignment path can be longer. (For example, aligning abcde with bcdef would yield [(a,·) (b,b) (c,c) (d,d) (e,e) (·,f)] which has length 6, not 5.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants