Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jaro Winkler distance added #2044

Merged
merged 36 commits into from Jul 10, 2018
Merged

Jaro Winkler distance added #2044

merged 36 commits into from Jul 10, 2018

Conversation

53X
Copy link
Contributor

@53X 53X commented Jun 12, 2018

Jaro Winkler distance has been added to nltk.metrics . Also PEP-8 issues have been taken care of .

I have not used any inbuilt functions from nltk for making this. This PR introduces a new string comparison metric. The metrics has been written as a completely different function named jaro_winkler_distance()
It fixes this issue

@53X 53X changed the title Jaro Winkler distance added #2006 Issue fix attempt Jaro Winkler distance added Jun 12, 2018
@alvations alvations self-assigned this Jun 12, 2018
@53X
Copy link
Contributor Author

53X commented Jun 13, 2018

@alvations , please give your thoughts and comments on this one

p = type float , weight parameter with a standard value of 0.1
l_cnt = type int, common prefix at the start of the string (max val = 4)
jw_sim= type float, jaro winkler similarity between s1 and s2
jw_dist = type float, it is equal to 1 - jw_sim """
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring has to be a lot clearer.

  • What's s1 and s2? I assume it's taking strings as the distance but actually the other distance metrics in nltk.metrics.distance accepts any sequence type. Jaro-Winkler seems to be generic enough to cover any sequence.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also variable names in this function needs to be clearer.

Since the function scopes within itself, you can afford to have more explicit variable names, e.g.

  • mtch -> matches
  • t_cnt -> transposition_counts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s1 and s2 are the 2 strings between which we wanna calculate the JW distance. I am going to include it in the docstring. Thanks for the point out

jw_sim= type float, jaro winkler similarity between s1 and s2
jw_dist = type float, it is equal to 1 - jw_sim """
s1 = s1.lower()
s2 = s2.lower()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As commented above, if the input is a sequence, this would throw an AttributeError.

Copy link
Contributor

@alvations alvations Jun 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If .lower() is the intention, then adding it as an kwarg in the function arguments would be better, i.e.

def jaro_winkler_distance(s1, s2, p=0.1, lowercase=True):
    # some code ...

    if lowercase:
        s1, s2 = s1.lower(), s2.lower()

     # other code ...

if(abs(pos1 - pos2) <= dist_bound):
mtch = mtch + 1
if(pos1 != pos2):
t_cnt = t_cnt + 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this looks very levenshtein like.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, Is there a simpler way to do this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well the above code actually counts the no.of "matched" characters and the no.of transposition counts. Also Levenshtein calculates transpositions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simpler way? I don't know ..... I am gonna have to check it out. BTW this works too

p = type float , weight parameter with a standard value of 0.1
l_cnt = type int, common prefix at the start of the string (max val = 4)
jw_sim= type float, jaro winkler similarity between s1 and s2
jw_dist = type float, it is equal to 1 - jw_sim """
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also variable names in this function needs to be clearer.

Since the function scopes within itself, you can afford to have more explicit variable names, e.g.

  • mtch -> matches
  • t_cnt -> transposition_counts

if(l_cnt == 4):
break
jw_sim = j_sim+(l_cnt*p*(1-j_sim))
jw_dist = 1 - jw_sim
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just return 1 - jw_sim here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure .... it is actually redundant

break
if(l_cnt == 4):
break
jw_sim = j_sim+(l_cnt*p*(1-j_sim))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would make sense to put this formula in the function docstring and then explain the related variables.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah sure

@@ -143,7 +147,7 @@ def masi_distance(label1, label2):
return 1 - len_intersection / len_union * m


def interval_distance(label1,label2):
def interval_distance(label1, label2):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -66,7 +68,7 @@ def edit_distance(s1, s2, substitution_cost=1, transpositions=False):
been done in other orders, but at least three steps are needed.

Allows specifying the cost of substitution edits (e.g., "a" -> "b"),
because sometimes it makes sense to assign greater penalties to substitutions.
because often it makes sense to assign bigger penalties to substitutions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the right assumption that it's often and not sometimes?

Also greater penalties sounds better than bigger penalties, the latter sounds presidential ;P

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am changing it back to 'sometimes' and 'greater' . Thanks for the heads up

@@ -34,7 +35,8 @@ def _edit_dist_init(len1, len2):
return lev


def _edit_dist_step(lev, i, j, s1, s2, substitution_cost=1, transpositions=False):
def _edit_dist_step(lev, i, j, s1, s2,
substitution_cost=1, transpositions=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason to break this unless it's over the 80 char lens. Even so, the break should be one per kwargs, so

def _edit_dist_step(lev, i, j, s1, s2,
                     substitution_cost=1, 
                       transpositions=False):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is over 80 chars . So I decided to break it this way

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Break the other keyword into a second line too, that'll be more consistent. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool

else:
break
if(l_cnt == 4):
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These conditions checks don't exactly makes use of numpy. I think there are better ways to do the match.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this look like RIBES' https://github.com/nltk/nltk/blob/develop/nltk/translate/ribes_score.py#L233 I might be wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well i actually used numpy for the np.floor(). Also I wen through RIBES and the find_increasing_sequences(worder) function actually returns monotonic sequences given a worder list. But l_cnt here is actually this : the length of common prefix at the start of the string up to a maximum of four characters . It is as per this: https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance

@alvations
Copy link
Contributor

I have to read more on the metric to really understand the algorithm implemented. But I've left some suggestions on the coding style and documentation issues that needs to be improved. Hope that helps.

print("Jaro-Winkler_distance:", jaro_winkler_distance('JONES', 'JohnSon'))
print("Jaro-Winkler_similarity:",
1-jaro_winkler_distance('JONES', 'JohnSon'))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more request here =)

Could you reuse the edit_distance_exmaples and move this bit of demo code into the loop at L240?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sure . I was afraid of breaking the code initially, so avoided doing that

@53X
Copy link
Contributor Author

53X commented Jun 13, 2018

@alvations , I am gonna make the edits real quick

@53X
Copy link
Contributor Author

53X commented Jun 13, 2018

@alvations , I made the changes . Please take a look

@alvations
Copy link
Contributor

alvations commented Jun 14, 2018

After reading the formula and other implementations, it looks like the winkler part is merely a function on the jaro similarity. I'll make some changes and push.

@53X if you don't mind, I'll do some adaptations to your code and push so that it better documented and suit the Pythonic conventions.


Changes made

  • Decouple the jaro distance from the jaro-winkler function.

  • After some considering, imposing lower on the comparison should be a user decisions, we shouldn't encourage or enforce .lower()

  • Storing the length of the strings, they are used several times, no point re-calculating them =)

  • Variable name changes,

    • match -> matches , since it's counting
    • transposition_count -> transpositions, since it's just counting and there's no real storage of any thing other than integers, it's okay to just drop the _count. Also there's no need to store the half the value, just store the full integer and half it when returning the jaro similarity
    • dist_bound -> match_bound , it's a little strange to call it distance bound because it's not exactly the upper bound of the metric itself but the matches.
  • x = x + 1 -> x += 1

@alvations
Copy link
Contributor

alvations commented Jun 14, 2018

@53X checking your algorithm, you have made certain assumptions with

    # Iterate through sequences, check for matches and compute transpositions.
    for ch1 in s1:     # Iterate through each character.
        if ch1 in s2:  # Check whether the 
            pos1 = s1.index(ch1)
            pos2 = s2.index(ch1)
            if(abs(pos1-pos2) <= dist_bound):
                match = match + 1
                if(pos1 != pos2):
                    transposition_count = transposition_count+1

When you use the .index() it will always be returning the first instance, so if there is a repeat and the range isn't of the bounding dist_bound, it's not going to work correctly, right?

I don't think you can get both matches and transpositions in 1 iteration. You might have to iterate through and be a little more like dynamic programming when you iterate through 2 sequences with the bounding boxes, like in https://rosettacode.org/wiki/Jaro_distance#Python

@alvations
Copy link
Contributor

alvations commented Jun 14, 2018

@53X I'm done with the edits. I hope the changes made are okay with you. I think it's best to be more explicit esp with some of these variable names and clearer in terms of what each loop / condition is doing.

Going back to the algorithm, I think you might need to do some tests on whether the loop here is really the same as the typical dynamic programming code from Rosetta code, it looks different given the .index() assumptions.

    # Iterate through sequences, check for matches and compute transpositions.
    for ch1 in s1:     # Iterate through each character.
        if ch1 in s2:  # Check whether the 
            pos1 = s1.index(ch1)
            pos2 = s2.index(ch1)
            if(abs(pos1-pos2) <= dist_bound):
                match = match + 1
                if(pos1 != pos2):
                    transposition_count = transposition_count+1

Looks like there's some difference in https://www.kaggle.com/alvations/jaro-winkler

Oddly, this implementation https://github.com/nap/jaro-winkler-distance is also different:

>>> from pyjarowinkler.distance import get_jaro_distance
>>> string_distance_examples = [
...     ("rain", "shine"), ("abcdef", "acbdef"), ("language", "lnaguaeg"),
...     ("language", "lnaugage"), ("language", "lngauage")]
>>> 
>>> for s1, s2 in string_distance_examples:
...    get_jaro_distance(s1, s2, winkler=False)
... 
0.6333333333333333
0.9444444444444445
0.9166666666666666
0.9166666666666666
0.9583333333333334

@alvations
Copy link
Contributor

alvations commented Jun 14, 2018

With all the implementations giving different values, I suggest we follow this paper's output https://www.census.gov/srd/papers/pdf/rr93-8.pdf (page 35, Table 5) and aim our implementation towards that:

Table 5 Comparison of String Comparators Rescaled between 0 and 1

Strings Winkler Jaro D-L
billy billy 1.000 1.000 1.000
billy bill 0.967 0.933 0.800
billy blily 0.947 0.933 0.600
massie massey 0.944 0.889 0.600
yvette yevett 0.911 0.889 0.600
billy bolly 0.893 0.867 0.600
dwayne duane 0.858 0.822 0.400
dixon dickson 0.853 0.791 0.200
billy susan 0.000 0.000 0.000

Another table of results from https://www.census.gov/srd/papers/pdf/rr94-5.pdf

Table 2.1. Comparison of String Comparators Using Last Names, First Names, and Street Names

Strings Jaro Wink McLa Lynch
SHACKLEFORD SHACKELFORD 0.970 0.982 0.982 0.989
DUNNINGHAM CUNNIGHAM 0.896 0.896 0.896 0.931
NICHLESON NICHULSON 0.926 0.956 0.969 0.977
JONES JOHNSON 0.790 0.832 0.860 0.874
MASSEY MASSIE 0.889 0.933 0.953 0.953
ABROMS ABRAMS 0.889 0.922 0.946 0.952
HARDIN MARTINEZ 0.722 0.722 0.722 0.774
ITMAN SMITH 0.467 0.467 0.507 0.507
JERALDINE GERALDINE 0.926 0.926 0.948 0.966
MARHTA MARTHA 0.944 0.961 0.961 0.971
MICHELLE MICHAEL 0.869 0.921 0.938 0.944
JULIES JULIUS 0.889 0.933 0.953 0.953
TANYA TONYA 0.867 0.880 0.916 0.933
DWAYNE DUANE 0.822 0.840 0.873 0.896
SEAN SUSAN 0.783 0.805 0.845 0.845
JON JOHN 0.917 0.933 0.933 0.933
JON JAN 0.000 0.000 0.860 0.860
BROOKHAVEN BRROKHAVEN 0.933 0.947 0.947 0.964
BROOK HALLOW BROOK HLLW 0.944 0.967 0.967 0.977
DECATUR DECATIR 0.905 0.943 0.960 0.965
FITZRUREITER FITZENREITER 0.856 0.913 0.923 0.945
HIGBEE HIGHEE 0.889 0.922 0.922 0.932
HIGBEE HIGVEE 0.889 0.922 0.946 0.952
LACURA LOCURA 0.889 0.900 0.930 0.947
IOWA IONA 0.833 0.867 0.867 0.867
1ST IST 0.000 0.000 0.844 0.844

@53X
Copy link
Contributor Author

53X commented Jun 15, 2018

@alvations , please take a look at the (JON,JAN) and (1ST,IST) examples from your comment ( second table) . I think that there is a typo in the paper. There is no way of (JON ,JAN) and (1ST, IST) being completely dissimilar as suggested in the paper( since we have got matching characters for both of them). Also I have experimented with the value of max_l parameter in the jaro-winkler formula. Keeping it's value to 4 gives us the best results ( i.e completely replicating the results reported in the paper). ALso for these two ambiguously reported values in the paper, my results are pretty much consistent with this online tool. I have also made few more changes . Please review them and give your feedback

@53X
Copy link
Contributor Author

53X commented Jun 30, 2018

@alvations , I have made sure that all the test cases pass in the doc-test. Please have a look at the latest commits and provide your feedback.

@alvations
Copy link
Contributor

alvations commented Jul 2, 2018

@53X what is the p_val? Is it decimal places? If so, it shouldn't be in the similarity function but in the doctest

Sorry about the previous comment, saw the commits while commuting and misunderstood. There's no need to call it p_values, user might confuse it with the p-value that is normally used in statistics.

Why is there a need for different p for different tests, shouldn't they all use 0.1 like in the paper?

You shouldn't be hacking the p such that the values passes in the doctest =(

@53X is it because the default value of p isn't 0.1 in the paper but a variable?

@alvations
Copy link
Contributor

alvations commented Jul 3, 2018

Lets debug this one instance at a time in a way that we can copy and paste the code.

Here's the stripped down code without the docstring:

def jaro_similarity(s1, s2):
    # First, store the length of the strings
    # because they will be re-used several times.
    len_s1, len_s2 = len(s1), len(s2)

    # The upper bound of the distance for being a matched character.
    match_bound = max(len_s1, len_s2) // 2 - 1

    # Initialize the counts for matches and transpositions.
    matches = 0  # no.of matched characters in s1 and s2
    transpositions = 0  # no. of transpositions between s1 and s2
    flagged_1 = []  # positions in s1 which are matches to some character in s2
    flagged_2 = []  # positions in s2 which are matches to some character in s1

    # Iterate through sequences, check for matches and compute transpositions.
    for i in range(len_s1):     # Iterate through each character.
        upperbound = min(i+match_bound, len_s2-1)
        lowerbound = max(0, i-match_bound)
        for j in range(lowerbound, upperbound+1):
            if s1[i] == s2[j] and j not in flagged_2:
                matches += 1
                flagged_1.append(i)
                flagged_2.append(j)
                break
    flagged_2.sort()
    for i, j in zip(flagged_1, flagged_2):
        if s1[i] != s2[j]:
            transpositions += 1

    if matches == 0:
        return 0
    else:
        return 1/3 * (matches/len_s1 +
                      matches/len_s2 +
                      (matches-transpositions//2)/matches
                      )


def jaro_winkler_similarity(s1, s2, p=0.1, max_l=4):
    # Compute the Jaro similarity
    jaro_sim = jaro_similarity(s1, s2)

    # Initialize the upper bound for the no. of prefixes.
    # if user did not pre-define the upperbound, use smaller among length of s1
    # and length of s2

    # Compute the prefix matches.
    l_ = 0
    for i in range(min(len(s1), len(s2))):
        if s1[i] == s2[i]:
            l_ += 1
        else:
            break
        if l_ == max_l:
            break
    # Return the similarity value as described in docstring.
    return jaro_sim + (l_ * p * (1 - jaro_sim))

Then we run the tests:

winkler_examples = [("billy", "billy"), ("billy", "bill"), ("billy", "blily"), 
                    ("massie", "massey"), ("yvette", "yevett"), ("billy", "bolly"), ("dwayne", "duane"), 
                    ("dixon", "dickson"), ("billy", "susan")]

winkler_scores = [1.000, 0.967, 0.947, 0.944, 0.911, 0.893, 0.858, 0.853, 0.000]
jaro_scores =    [1.000, 0.933, 0.933, 0.889, 0.889, 0.867, 0.822, 0.790, 0.000]


for (s1, s2), jscore, wscore in zip(winkler_examples, jaro_scores, winkler_scores):
    if round(jaro_similarity(s1, s2), 3) != jscore:
        print('jaro', s1, s2, jscore, round(jaro_similarity(s1, s2), 3))
        
    #if round(jaro_winkler_similarity(s1, s2, p=p_val), 3) != wscore:
    # Compute the prefix matches.
    l_ = 0
    for i in range(min(len(s1), len(s2))):
        if s1[i] == s2[i]:
            l_ += 1
        else:
            break
        if l_ == max_l:
            break

    print('jaro_winkler', s1, s2, jscore, wscore, jaro_winkler_similarity(s1, s2), l_)

[out]:

jaro_winkler billy billy 1.0 1.0 1.0 4
jaro_winkler billy bill 0.933 0.967 0.96 4
jaro_winkler billy blily 0.933 0.947 0.94 1
jaro_winkler massie massey 0.889 0.944 0.9333333333333333 4
jaro_winkler yvette yevett 0.889 0.911 0.8999999999999999 1
jaro_winkler billy bolly 0.867 0.893 0.88 1
jaro_winkler dwayne duane 0.822 0.858 0.84 1
jaro_winkler dixon dickson 0.79 0.853 0.8323809523809523 2
jaro_winkler billy susan 0.0 0.0 0.0 0

So our Jaro similarity matches, we must have done something different in the winkler's rescaling.

I suspect it's because of how we're computing the prefix matches. If we take a look at the p_val hacks you put in the previous commits, they're of a specific denominator, i.e. 0.125 = 1/8, 0.20 = 1/5. So the first thing to check is the l_ value instead of putting up with a "variable" constant.

Lets take a look at

s1 = "billy"
s2 = "bill"

j = jaro_similarity(s1, s2)
w = jaro_winkler_similarity(s1, s2)
p = 0.1 # Keep this constant

l_ = 4
print(j + (l_*0.1*(1-j)))
l_ = 5 
print(j + (l_*0.1*(1-j))) # Now it matches the Table 5 values

[out]:

0.96
0.9666666666666666

Similarly:

s1 = "dixon"
s2 = "dickson"

j = jaro_similarity(s1, s2)
w = jaro_winkler_similarity(s1, s2)
p = 0.1 # Keep this constant

l_ = 2
print(j + (l_*0.1*(1-j)))
l_ = 3 
print(j + (l_*0.1*(1-j))) # Now it matches the Table 5 values

[out]:

0.8323809523809523
0.8533333333333333

It's possible that max_l default is not 4.

@alvations
Copy link
Contributor

I might be wrong about the l_ and p. Looking at the formula:

jaro_winkler_sim = jaro_sim + ( l * p * (1 - jaro_sim) )

It boils down to:

y = x + m * (1-x) 

And to assure that y ranges between [0, 1], it looks like the we need the condition 0.0 <= m <=1.0.

That means that l * p are not exactly something that can be independent of each other in the Jaro Winkler's formula. The default value of p=0.1 and setting the max of l at 4 constrains the scaling factor of the 1-x so it worked but when l*p is more than 1.0, we get:

>>> s1 = "billy"
>>> s2 = "bill"
>>> jaro_winkler_similarity(s1, s2, p=0.3, max_l=4) # Output > 1.0
1.0133333333333334

Similarly there's no real need for max_l=4, it can be higher as long as the p factor product with it is less than or equals to 1.0, e.g.

>>> s1 = "billy"
>>> s2 = "bill"
>>> jaro_winkler_similarity(s1, s2, p=0.05, max_l=7)
0.9466666666666665

After looking at the Winkler (1990) paper, it looks like the parameters in the experiments were ambiguously set. So I think the p variable hacks you've put in is okay. Let me try to clean it up a little and add an assert.

Cleaner loop to check of `l` prefix size and added warning to unexpected output
@alvations
Copy link
Contributor

@53X there's still 3 examples that doesn't match up with the p_factors changes:

  • BROOK HALLOW and BROOK HLLW
  • DECATUR and DECATIR
  • SHACKLEFORD and SHACKELFORD

l_ += 1
else:
break
if l_ == max_l:
break
if l_ > max_l:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop should terminate and stop counting when l_ has reached its maximum bound ( max_l). Thus this would be the better option as l_ shouldn't go beyond max_l :


	if l_ == max_l:
	    break

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alvations , please review this piece of code

@53X
Copy link
Contributor Author

53X commented Jul 3, 2018

The formula for the jaro_winkler_similarity goes as follows:
jaro_winkler_similarity = jaro_sim + l_ * p * (1-jaro_sim) .
Thus we can deduce that l_ * p <=1 is necessary (not max_l * p <=1) such that jaro_winkler_distance <=1.

Quoting from the comments in the code of this PR ( see jaro_winkler_similarity function in the code):

l_ is the length of common prefix at the start of the string.This implementation provides an upperbound for the l_ value.A common value of this upperbound is 4.

max_l is this upperbound which is set to its common value ,4 by default. It is worth noting that the value of l_ is lesser to max_l = 100 (say) in the case of (TANYA, TONYA) where the value of p=0.1 and l_ = 1.

I have put a new doc-test case separately that explains the same and outputs the same value of jaro_winkler_similarity('TANYA', 'TONYA', p=0.1, max_l=4) and jaro_winkler_similarity('TANYA', 'TONYA', p=0.1, max_l=100) as 0.880 .

Thus 0<= jaro_winkler_similarity <=1 depends not on max_l * p <=1 but on l_ * p <=1

One more reason as to why max_l =4 by default is that we need l_ * p <= 1 for the similarity value to lie between 0 and 1. Also the max value that p can take is 0.25. Thus in such cases where p =0.25 to maintain the condition l_ * p <= 1, limiting value of l_ is 4.


if not 0 <= max_l * p <= 1:
        warnings.warn(str("The product of `max_l * p` doesn't fall between [0,1]."
			  "Jaro-Winkler similarity will not be between 0 and 1.")
		     )

This piece of code that generates the warning is logical as there might be cases where l_ becomes equal to max_l such that l_ * p >1 . In light of this we might change the warning to be the following :

if not 0 <= max_l * p <= 1:
        warnings.warn(str("The product  `l_ * p` might not fall between [0,1]."
			  "Jaro-Winkler similarity might not be between 0 and 1.")
		     )

Also instead of this code that calculates the prefix length, l_ :

for s1_i, s2_i in zip(s1, s2): 
        if s1_i == s2_i:
            l_ += 1
        else:
            break
        if l_ > max_l:
            break

we should use this one :

for s1_i, s2_i in zip(s1, s2): 
        if s1_i == s2_i:
            l_ += 1
        else:
            break
        if l_ == max_l:
            break 

This is because the loop should terminate as soon as l_ reaches it's upperbound (max_l) (if at all it reaches the upperbound) .

@alvations
Copy link
Contributor

@53X Thanks for the explanation on the if l_ == max_l:!

Regarding the warning, the max_l is a user changeable parameter while l_ is the internal and variable count of the longest prefix length. Since max_l subsumes l_ checking for max_l * p <=1 effectively also checks for l_ * p <=1, the warning to the user should be something under the user's control.

@alvations
Copy link
Contributor

alvations commented Jul 3, 2018

@53X just 3 small comments pending in the code review and some light changes and the PR will look good to merge!

Sorry that the process took a while because of the back and forth regarding the algorithm and the regression tests.

warnings.warn(str("The product of `max_l * p` doesn't fall between [0,1]."
"Jaro-Winkler similarity will not be between 0 and 1.")
)
warnings.warn(str("The product `l_ * p` might not fall between [0,1]."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's okay to use max_l here. See comment in PR.

@@ -260,80 +260,90 @@ def jaro_winkler_similarity(s1, s2, p=0.1, max_l=4):
American Statistical Association: 354-359.
such that:

jaro_winkler_sim = jaro_sim + ( l * p * (1 - jaro_sim) )
jaro_winkler_sim = jaro_sim + ( l_ * p * (1 - jaro_sim) )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's cleaner to use l here to be clearer when communicating the formula.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's okay to use l in the code too instead of l_ since the function is locally scoped, no global variable.

@alvations
Copy link
Contributor

@53X Thank you for the updates! Now it LGTM.

IMO, you've contributed to the best documented metric in the nltk.metrics.distance.py and it comes with doctest too =)

Kudos, to the amount of work done to document and validate the implementation!

@53X
Copy link
Contributor Author

53X commented Jul 4, 2018

@alvations , @stevenbird can this PR be merged now and the corresponding issue be closed?

@stevenbird
Copy link
Member

Thanks for an excellent contribution @53X.
Thanks for your careful reviewing work @alvations.

@stevenbird stevenbird merged commit ecced11 into nltk:develop Jul 10, 2018
alvations added a commit that referenced this pull request Aug 15, 2018
This reverts commit ecced11, reversing
changes made to ab07520.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants