Feature: Convert edit distance to ratio/similarity #28

odannyc · 2019-02-01T19:26:50Z

I would like a way to get the similarity of the 2 strings instead of just the distance.
For example with python-Levenshtein I can do:

>>> Levenshtein.ratio("ab", "abc")
0.8

The text was updated successfully, but these errors were encountered:

maxbachmann · 2021-04-22T22:12:41Z

It is worth noting, that Levenshtein.ratio does not return the normalized Levenshtein distance which is calulated as:

1 - lev_dist / max(len1, len2)

but returns the normalized InDel distance (similar to Levenshtein distance but does not allow Substitutions). Which is normalized as:

1 - InDel_dist / (len1 + len2)

I provide a similar metric in my library RapidFuzz: https://maxbachmann.github.io/RapidFuzz/string_metric.html#normalized-levenshtein
However in editdistance it should be simple to calculate the normalized version yourself.

bertsky · 2021-05-11T19:53:19Z

the normalized Levenshtein distance which is calulated as:
1 - lev_dist / max(len1, len2)

[...]

However in editdistance it should be simple to calculate the normalized version yourself.

I think this is also not correct though. It should be the length of the alignment path as denominator (which is slightly longer than the longest sequence). See this comment

maxbachmann · 2021-05-11T21:30:50Z

I think this is also not correct though. It should be the length of the alignment path as denominator (which is slightly longer than the longest sequence).

I am a complete noob on these OCR topics. I simply needed a fast implementation and found none, so I did build my own ;)
Do you happen to have a reference or explanation why this is needed?

E.g. the WER is commonly implemented as

WER = lev_dist / len(reference) = (S+D+I)/(S+D+C)
WAcc = 1 - WER

This has the disadvantage, that WER might be over 1. The implementation I mentioned is a modification to use the length of the longer string/list to make sure the error rate is always between 0 and 1. It also appeared to be a simple solution to map the Levenshtein distance using arbitrary weights (here the normalization I use in RapidFuzz for a range 0-100)

Dividing the edit distance with the maximum possible edit distance made sense to me.

bertsky · 2021-05-11T22:23:59Z

I am a complete noob on these OCR topics. I simply needed a fast implementation and found none, so I did build my own ;)

Looks promising, will have a look.

Do you happen to have a reference or explanation why this is needed?

Not sure for the right reference (most papers do not even treat this detail), but I found this: https://www.yorku.ca/mack/nordichi2002-shortpaper.html

Equation 3 is recommended because it always yields the same error rate as that obtained by a character-by-character analysis of the optimal alignments, as just described. In general, Equation 3 yields a slightly lower error rate than Equation 1 since len(alignment) >= max(|A|, |B|).

E.g. the WER is commonly implemented as
WER = lev_dist / len(reference) = (S+D+I)/(S+D+C)
WAcc = 1 - WER
This has the disadvantage, that WER might be over 1

Yes, exactly. Same for CER. Unfortunately, even fairly standard tools (used a lot for published results) make that mistake (see here). Others get it right, though.

The implementation I mentioned is a modification to use the length of the longer string/list to make sure the error rate is always between 0 and 1.
Dividing the edit distance with the maximum possible edit distance made sense to me.

Yes, using the maximum distance is already much better than using the reference string length. My point was that even max-dist is biased: the actual alignment path can be longer. (For example, aligning abcde with bcdef would yield [(a,·) (b,b) (c,c) (d,d) (e,e) (·,f)] which has length 6, not 5.)

bertsky mentioned this issue Mar 29, 2022

metrics: avoid CER > 1.0 wincentbalin/pytesstrain#3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Convert edit distance to ratio/similarity #28

Feature: Convert edit distance to ratio/similarity #28

odannyc commented Feb 1, 2019

maxbachmann commented Apr 22, 2021

bertsky commented May 11, 2021 •

edited

maxbachmann commented May 11, 2021

bertsky commented May 11, 2021

Feature: Convert edit distance to ratio/similarity #28

Feature: Convert edit distance to ratio/similarity #28

Comments

odannyc commented Feb 1, 2019

maxbachmann commented Apr 22, 2021

bertsky commented May 11, 2021 • edited

maxbachmann commented May 11, 2021

bertsky commented May 11, 2021

bertsky commented May 11, 2021 •

edited