Jaro Winkler distance added #2044

53X · 2018-06-12T06:38:03Z

Jaro Winkler distance has been added to nltk.metrics . Also PEP-8 issues have been taken care of .

I have not used any inbuilt functions from nltk for making this. This PR introduces a new string comparison metric. The metrics has been written as a completely different function named jaro_winkler_distance()
It fixes this issue

53X · 2018-06-13T06:33:44Z

@alvations , please give your thoughts and comments on this one

alvations · 2018-06-13T08:58:05Z

nltk/metrics/distance.py

+p = type float , weight parameter with a standard value of 0.1
+l_cnt = type int, common prefix at the start of the string (max val = 4)
+jw_sim= type float, jaro winkler similarity between s1 and s2
+jw_dist = type float, it is equal to 1 - jw_sim """


The docstring has to be a lot clearer.

What's s1 and s2? I assume it's taking strings as the distance but actually the other distance metrics in nltk.metrics.distance accepts any sequence type. Jaro-Winkler seems to be generic enough to cover any sequence.

Also variable names in this function needs to be clearer.

Since the function scopes within itself, you can afford to have more explicit variable names, e.g.

mtch -> matches

t_cnt -> transposition_counts

s1 and s2 are the 2 strings between which we wanna calculate the JW distance. I am going to include it in the docstring. Thanks for the point out

alvations · 2018-06-13T08:58:33Z

nltk/metrics/distance.py

+jw_sim= type float, jaro winkler similarity between s1 and s2
+jw_dist = type float, it is equal to 1 - jw_sim """
+    s1 = s1.lower()
+    s2 = s2.lower()


As commented above, if the input is a sequence, this would throw an AttributeError.

If .lower() is the intention, then adding it as an kwarg in the function arguments would be better, i.e.

def jaro_winkler_distance(s1, s2, p=0.1, lowercase=True): # some code ... if lowercase: s1, s2 = s1.lower(), s2.lower() # other code ...

alvations · 2018-06-13T08:59:37Z

nltk/metrics/distance.py

+            if(abs(pos1 - pos2) <= dist_bound):
+                mtch = mtch + 1
+                if(pos1 != pos2):
+                    t_cnt = t_cnt + 1


Actually, this looks very levenshtein like.

Out of curiosity, Is there a simpler way to do this?

Well the above code actually counts the no.of "matched" characters and the no.of transposition counts. Also Levenshtein calculates transpositions.

Simpler way? I don't know ..... I am gonna have to check it out. BTW this works too

alvations · 2018-06-13T09:02:25Z

nltk/metrics/distance.py

+p = type float , weight parameter with a standard value of 0.1
+l_cnt = type int, common prefix at the start of the string (max val = 4)
+jw_sim= type float, jaro winkler similarity between s1 and s2
+jw_dist = type float, it is equal to 1 - jw_sim """


Also variable names in this function needs to be clearer.

Since the function scopes within itself, you can afford to have more explicit variable names, e.g.

mtch -> matches

t_cnt -> transposition_counts

alvations · 2018-06-13T09:04:13Z

nltk/metrics/distance.py

+        if(l_cnt == 4):
+            break
+    jw_sim = j_sim+(l_cnt*p*(1-j_sim))
+    jw_dist = 1 - jw_sim


Just return 1 - jw_sim here.

Sure .... it is actually redundant

alvations · 2018-06-13T09:04:56Z

nltk/metrics/distance.py

+            break
+        if(l_cnt == 4):
+            break
+    jw_sim = j_sim+(l_cnt*p*(1-j_sim))


It would make sense to put this formula in the function docstring and then explain the related variables.

alvations · 2018-06-13T09:05:09Z

nltk/metrics/distance.py

@@ -143,7 +147,7 @@ def masi_distance(label1, label2):
    return 1 - len_intersection / len_union * m


-def interval_distance(label1,label2):
+def interval_distance(label1, label2):


alvations · 2018-06-13T09:06:51Z

nltk/metrics/distance.py

@@ -66,7 +68,7 @@ def edit_distance(s1, s2, substitution_cost=1, transpositions=False):
    been done in other orders, but at least three steps are needed.

    Allows specifying the cost of substitution edits (e.g., "a" -> "b"),
-    because sometimes it makes sense to assign greater penalties to substitutions.
+    because often it makes sense to assign bigger penalties to substitutions.


Is this the right assumption that it's often and not sometimes?

Also greater penalties sounds better than bigger penalties, the latter sounds presidential ;P

I am changing it back to 'sometimes' and 'greater' . Thanks for the heads up

alvations · 2018-06-13T09:07:44Z

nltk/metrics/distance.py

@@ -34,7 +35,8 @@ def _edit_dist_init(len1, len2):
    return lev


-def _edit_dist_step(lev, i, j, s1, s2, substitution_cost=1, transpositions=False):
+def _edit_dist_step(lev, i, j, s1, s2,
+                    substitution_cost=1, transpositions=False):


No reason to break this unless it's over the 80 char lens. Even so, the break should be one per kwargs, so

def _edit_dist_step(lev, i, j, s1, s2, substitution_cost=1, transpositions=False):

it is over 80 chars . So I decided to break it this way

Break the other keyword into a second line too, that'll be more consistent. Thanks!

alvations · 2018-06-13T09:09:16Z

nltk/metrics/distance.py

+        else:
+            break
+        if(l_cnt == 4):
+            break


These conditions checks don't exactly makes use of numpy. I think there are better ways to do the match.

Actually this look like RIBES' https://github.com/nltk/nltk/blob/develop/nltk/translate/ribes_score.py#L233 I might be wrong.

Well i actually used numpy for the np.floor(). Also I wen through RIBES and the find_increasing_sequences(worder) function actually returns monotonic sequences given a worder list. But l_cnt here is actually this : the length of common prefix at the start of the string up to a maximum of four characters . It is as per this: https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance

alvations · 2018-06-13T09:12:53Z

I have to read more on the metric to really understand the algorithm implemented. But I've left some suggestions on the coding style and documentation issues that needs to be improved. Hope that helps.

alvations · 2018-06-13T09:30:56Z

nltk/metrics/distance.py

+    print("Jaro-Winkler_distance:", jaro_winkler_distance('JONES', 'JohnSon'))
+    print("Jaro-Winkler_similarity:",
+          1-jaro_winkler_distance('JONES', 'JohnSon'))
+


One more request here =)

Could you reuse the edit_distance_exmaples and move this bit of demo code into the loop at L240?

Yeah sure . I was afraid of breaking the code initially, so avoided doing that

53X · 2018-06-13T09:48:21Z

@alvations , I am gonna make the edits real quick

53X · 2018-06-13T12:26:55Z

@alvations , I made the changes . Please take a look

alvations · 2018-06-14T00:55:07Z

After reading the formula and other implementations, it looks like the winkler part is merely a function on the jaro similarity. I'll make some changes and push.

@53X if you don't mind, I'll do some adaptations to your code and push so that it better documented and suit the Pythonic conventions.

Changes made

Decouple the jaro distance from the jaro-winkler function.
After some considering, imposing lower on the comparison should be a user decisions, we shouldn't encourage or enforce .lower()
Storing the length of the strings, they are used several times, no point re-calculating them =)
Variable name changes,
- match -> matches , since it's counting
- transposition_count -> transpositions, since it's just counting and there's no real storage of any thing other than integers, it's okay to just drop the _count. Also there's no need to store the half the value, just store the full integer and half it when returning the jaro similarity
- dist_bound -> match_bound , it's a little strange to call it distance bound because it's not exactly the upper bound of the metric itself but the matches.
x = x + 1 -> x += 1

alvations · 2018-06-14T01:36:37Z

@53X checking your algorithm, you have made certain assumptions with

    # Iterate through sequences, check for matches and compute transpositions.
    for ch1 in s1:     # Iterate through each character.
        if ch1 in s2:  # Check whether the 
            pos1 = s1.index(ch1)
            pos2 = s2.index(ch1)
            if(abs(pos1-pos2) <= dist_bound):
                match = match + 1
                if(pos1 != pos2):
                    transposition_count = transposition_count+1

When you use the .index() it will always be returning the first instance, so if there is a repeat and the range isn't of the bounding dist_bound, it's not going to work correctly, right?

I don't think you can get both matches and transpositions in 1 iteration. You might have to iterate through and be a little more like dynamic programming when you iterate through 2 sequences with the bounding boxes, like in https://rosettacode.org/wiki/Jaro_distance#Python

alvations · 2018-06-14T02:21:18Z

@53X I'm done with the edits. I hope the changes made are okay with you. I think it's best to be more explicit esp with some of these variable names and clearer in terms of what each loop / condition is doing.

Going back to the algorithm, I think you might need to do some tests on whether the loop here is really the same as the typical dynamic programming code from Rosetta code, it looks different given the .index() assumptions.

    # Iterate through sequences, check for matches and compute transpositions.
    for ch1 in s1:     # Iterate through each character.
        if ch1 in s2:  # Check whether the 
            pos1 = s1.index(ch1)
            pos2 = s2.index(ch1)
            if(abs(pos1-pos2) <= dist_bound):
                match = match + 1
                if(pos1 != pos2):
                    transposition_count = transposition_count+1

Looks like there's some difference in https://www.kaggle.com/alvations/jaro-winkler

Oddly, this implementation https://github.com/nap/jaro-winkler-distance is also different:

>>> from pyjarowinkler.distance import get_jaro_distance
>>> string_distance_examples = [
...     ("rain", "shine"), ("abcdef", "acbdef"), ("language", "lnaguaeg"),
...     ("language", "lnaugage"), ("language", "lngauage")]
>>> 
>>> for s1, s2 in string_distance_examples:
...    get_jaro_distance(s1, s2, winkler=False)
... 
0.6333333333333333
0.9444444444444445
0.9166666666666666
0.9166666666666666
0.9583333333333334

alvations · 2018-06-14T03:24:58Z

With all the implementations giving different values, I suggest we follow this paper's output https://www.census.gov/srd/papers/pdf/rr93-8.pdf (page 35, Table 5) and aim our implementation towards that:

Table 5 Comparison of String Comparators Rescaled between 0 and 1

Strings	Winkler	Jaro	D-L
billy billy	1.000	1.000	1.000
billy bill	0.967	0.933	0.800
billy blily	0.947	0.933	0.600
massie massey	0.944	0.889	0.600
yvette yevett	0.911	0.889	0.600
billy bolly	0.893	0.867	0.600
dwayne duane	0.858	0.822	0.400
dixon dickson	0.853	0.791	0.200
billy susan	0.000	0.000	0.000

Another table of results from https://www.census.gov/srd/papers/pdf/rr94-5.pdf

Table 2.1. Comparison of String Comparators Using Last Names, First Names, and Street Names

Strings	Jaro Wink McLa Lynch
SHACKLEFORD SHACKELFORD	0.970 0.982 0.982 0.989
DUNNINGHAM CUNNIGHAM	0.896 0.896 0.896 0.931
NICHLESON NICHULSON	0.926 0.956 0.969 0.977
JONES JOHNSON	0.790 0.832 0.860 0.874
MASSEY MASSIE	0.889 0.933 0.953 0.953
ABROMS ABRAMS	0.889 0.922 0.946 0.952
HARDIN MARTINEZ	0.722 0.722 0.722 0.774
ITMAN SMITH	0.467 0.467 0.507 0.507
JERALDINE GERALDINE	0.926 0.926 0.948 0.966
MARHTA MARTHA	0.944 0.961 0.961 0.971
MICHELLE MICHAEL	0.869 0.921 0.938 0.944
JULIES JULIUS	0.889 0.933 0.953 0.953
TANYA TONYA	0.867 0.880 0.916 0.933
DWAYNE DUANE	0.822 0.840 0.873 0.896
SEAN SUSAN	0.783 0.805 0.845 0.845
JON JOHN	0.917 0.933 0.933 0.933
JON JAN	0.000 0.000 0.860 0.860
BROOKHAVEN BRROKHAVEN	0.933 0.947 0.947 0.964
BROOK HALLOW BROOK HLLW	0.944 0.967 0.967 0.977
DECATUR DECATIR	0.905 0.943 0.960 0.965
FITZRUREITER FITZENREITER	0.856 0.913 0.923 0.945
HIGBEE HIGHEE	0.889 0.922 0.922 0.932
HIGBEE HIGVEE	0.889 0.922 0.946 0.952
LACURA LOCURA	0.889 0.900 0.930 0.947
IOWA IONA	0.833 0.867 0.867 0.867
1ST IST	0.000 0.000 0.844 0.844

53X · 2018-06-15T12:10:36Z

@alvations , please take a look at the (JON,JAN) and (1ST,IST) examples from your comment ( second table) . I think that there is a typo in the paper. There is no way of (JON ,JAN) and (1ST, IST) being completely dissimilar as suggested in the paper( since we have got matching characters for both of them). Also I have experimented with the value of max_l parameter in the jaro-winkler formula. Keeping it's value to 4 gives us the best results ( i.e completely replicating the results reported in the paper). ALso for these two ambiguously reported values in the paper, my results are pretty much consistent with this online tool. I have also made few more changes . Please review them and give your feedback

53X · 2018-06-30T06:37:44Z

@alvations , I have made sure that all the test cases pass in the doc-test. Please have a look at the latest commits and provide your feedback.

alvations · 2018-07-02T23:38:15Z

@53X ~~what is the p_val? Is it decimal places? If so, it shouldn't be in the similarity function but in the doctest~~

Sorry about the previous comment, saw the commits while commuting and misunderstood. There's no need to call it p_values, user might confuse it with the p-value that is normally used in statistics.

Why is there a need for different p for different tests, shouldn't they all use 0.1 like in the paper?

You shouldn't be hacking the p such that the values passes in the doctest =(

@53X is it because the default value of p isn't 0.1 in the paper but a variable?

alvations · 2018-07-03T01:51:48Z

Lets debug this one instance at a time in a way that we can copy and paste the code.

Here's the stripped down code without the docstring:

def jaro_similarity(s1, s2):
    # First, store the length of the strings
    # because they will be re-used several times.
    len_s1, len_s2 = len(s1), len(s2)

    # The upper bound of the distance for being a matched character.
    match_bound = max(len_s1, len_s2) // 2 - 1

    # Initialize the counts for matches and transpositions.
    matches = 0  # no.of matched characters in s1 and s2
    transpositions = 0  # no. of transpositions between s1 and s2
    flagged_1 = []  # positions in s1 which are matches to some character in s2
    flagged_2 = []  # positions in s2 which are matches to some character in s1

    # Iterate through sequences, check for matches and compute transpositions.
    for i in range(len_s1):     # Iterate through each character.
        upperbound = min(i+match_bound, len_s2-1)
        lowerbound = max(0, i-match_bound)
        for j in range(lowerbound, upperbound+1):
            if s1[i] == s2[j] and j not in flagged_2:
                matches += 1
                flagged_1.append(i)
                flagged_2.append(j)
                break
    flagged_2.sort()
    for i, j in zip(flagged_1, flagged_2):
        if s1[i] != s2[j]:
            transpositions += 1

    if matches == 0:
        return 0
    else:
        return 1/3 * (matches/len_s1 +
                      matches/len_s2 +
                      (matches-transpositions//2)/matches
                      )


def jaro_winkler_similarity(s1, s2, p=0.1, max_l=4):
    # Compute the Jaro similarity
    jaro_sim = jaro_similarity(s1, s2)

    # Initialize the upper bound for the no. of prefixes.
    # if user did not pre-define the upperbound, use smaller among length of s1
    # and length of s2

    # Compute the prefix matches.
    l_ = 0
    for i in range(min(len(s1), len(s2))):
        if s1[i] == s2[i]:
            l_ += 1
        else:
            break
        if l_ == max_l:
            break
    # Return the similarity value as described in docstring.
    return jaro_sim + (l_ * p * (1 - jaro_sim))

Then we run the tests:

winkler_examples = [("billy", "billy"), ("billy", "bill"), ("billy", "blily"), 
                    ("massie", "massey"), ("yvette", "yevett"), ("billy", "bolly"), ("dwayne", "duane"), 
                    ("dixon", "dickson"), ("billy", "susan")]

winkler_scores = [1.000, 0.967, 0.947, 0.944, 0.911, 0.893, 0.858, 0.853, 0.000]
jaro_scores =    [1.000, 0.933, 0.933, 0.889, 0.889, 0.867, 0.822, 0.790, 0.000]


for (s1, s2), jscore, wscore in zip(winkler_examples, jaro_scores, winkler_scores):
    if round(jaro_similarity(s1, s2), 3) != jscore:
        print('jaro', s1, s2, jscore, round(jaro_similarity(s1, s2), 3))
        
    #if round(jaro_winkler_similarity(s1, s2, p=p_val), 3) != wscore:
    # Compute the prefix matches.
    l_ = 0
    for i in range(min(len(s1), len(s2))):
        if s1[i] == s2[i]:
            l_ += 1
        else:
            break
        if l_ == max_l:
            break

    print('jaro_winkler', s1, s2, jscore, wscore, jaro_winkler_similarity(s1, s2), l_)

[out]:

jaro_winkler billy billy 1.0 1.0 1.0 4
jaro_winkler billy bill 0.933 0.967 0.96 4
jaro_winkler billy blily 0.933 0.947 0.94 1
jaro_winkler massie massey 0.889 0.944 0.9333333333333333 4
jaro_winkler yvette yevett 0.889 0.911 0.8999999999999999 1
jaro_winkler billy bolly 0.867 0.893 0.88 1
jaro_winkler dwayne duane 0.822 0.858 0.84 1
jaro_winkler dixon dickson 0.79 0.853 0.8323809523809523 2
jaro_winkler billy susan 0.0 0.0 0.0 0

So our Jaro similarity matches, we must have done something different in the winkler's rescaling.

I suspect it's because of how we're computing the prefix matches. If we take a look at the p_val hacks you put in the previous commits, they're of a specific denominator, i.e. 0.125 = 1/8, 0.20 = 1/5. So the first thing to check is the l_ value instead of putting up with a "variable" constant.

Lets take a look at

s1 = "billy"
s2 = "bill"

j = jaro_similarity(s1, s2)
w = jaro_winkler_similarity(s1, s2)
p = 0.1 # Keep this constant

l_ = 4
print(j + (l_*0.1*(1-j)))
l_ = 5 
print(j + (l_*0.1*(1-j))) # Now it matches the Table 5 values

[out]:

0.96
0.9666666666666666

Similarly:

s1 = "dixon"
s2 = "dickson"

j = jaro_similarity(s1, s2)
w = jaro_winkler_similarity(s1, s2)
p = 0.1 # Keep this constant

l_ = 2
print(j + (l_*0.1*(1-j)))
l_ = 3 
print(j + (l_*0.1*(1-j))) # Now it matches the Table 5 values

[out]:

0.8323809523809523
0.8533333333333333

It's possible that max_l default is not 4.

alvations · 2018-07-03T02:23:48Z

I might be wrong about the l_ and p. Looking at the formula:

jaro_winkler_sim = jaro_sim + ( l * p * (1 - jaro_sim) )

It boils down to:

y = x + m * (1-x)

And to assure that y ranges between [0, 1], it looks like the we need the condition 0.0 <= m <=1.0.

That means that l * p are not exactly something that can be independent of each other in the Jaro Winkler's formula. The default value of p=0.1 and setting the max of l at 4 constrains the scaling factor of the 1-x so it worked but when l*p is more than 1.0, we get:

>>> s1 = "billy"
>>> s2 = "bill"
>>> jaro_winkler_similarity(s1, s2, p=0.3, max_l=4) # Output > 1.0
1.0133333333333334

Similarly there's no real need for max_l=4, it can be higher as long as the p factor product with it is less than or equals to 1.0, e.g.

>>> s1 = "billy"
>>> s2 = "bill"
>>> jaro_winkler_similarity(s1, s2, p=0.05, max_l=7)
0.9466666666666665

After looking at the Winkler (1990) paper, it looks like the parameters in the experiments were ambiguously set. So I think the p variable hacks you've put in is okay. Let me try to clean it up a little and add an assert.

Cleaner loop to check of `l` prefix size and added warning to unexpected output

alvations · 2018-07-03T03:24:42Z

@53X there's still 3 examples that doesn't match up with the p_factors changes:

BROOK HALLOW and BROOK HLLW
DECATUR and DECATIR
SHACKLEFORD and SHACKELFORD

53X · 2018-07-03T08:30:18Z

nltk/metrics/distance.py

            l_ += 1
        else:
            break
-        if l_ == max_l:
-            break
+	if l_ > max_l:


The loop should terminate and stop counting when l_ has reached its maximum bound ( max_l). Thus this would be the better option as l_ shouldn't go beyond max_l :

if l_ == max_l: break

@alvations , please review this piece of code

53X · 2018-07-03T09:52:24Z

The formula for the jaro_winkler_similarity goes as follows:
jaro_winkler_similarity = jaro_sim + l_ * p * (1-jaro_sim) .
Thus we can deduce that l_ * p <=1 is necessary (not max_l * p <=1) such that jaro_winkler_distance <=1.

Quoting from the comments in the code of this PR ( see jaro_winkler_similarity function in the code):

l_ is the length of common prefix at the start of the string.This implementation provides an upperbound for the l_ value.A common value of this upperbound is 4.

max_l is this upperbound which is set to its common value ,4 by default. It is worth noting that the value of l_ is lesser to max_l = 100 (say) in the case of (TANYA, TONYA) where the value of p=0.1 and l_ = 1.

I have put a new doc-test case separately that explains the same and outputs the same value of jaro_winkler_similarity('TANYA', 'TONYA', p=0.1, max_l=4) and jaro_winkler_similarity('TANYA', 'TONYA', p=0.1, max_l=100) as 0.880 .

Thus 0<= jaro_winkler_similarity <=1 depends not on max_l * p <=1 but on l_ * p <=1

One more reason as to why max_l =4 by default is that we need l_ * p <= 1 for the similarity value to lie between 0 and 1. Also the max value that p can take is 0.25. Thus in such cases where p =0.25 to maintain the condition l_ * p <= 1, limiting value of l_ is 4.


if not 0 <= max_l * p <= 1:
        warnings.warn(str("The product of `max_l * p` doesn't fall between [0,1]."
			  "Jaro-Winkler similarity will not be between 0 and 1.")
		     )

This piece of code that generates the warning is logical as there might be cases where l_ becomes equal to max_l such that l_ * p >1 . In light of this we might change the warning to be the following :

if not 0 <= max_l * p <= 1:
        warnings.warn(str("The product  `l_ * p` might not fall between [0,1]."
			  "Jaro-Winkler similarity might not be between 0 and 1.")
		     )

Also instead of this code that calculates the prefix length, l_ :

for s1_i, s2_i in zip(s1, s2): 
        if s1_i == s2_i:
            l_ += 1
        else:
            break
        if l_ > max_l:
            break

we should use this one :

for s1_i, s2_i in zip(s1, s2): 
        if s1_i == s2_i:
            l_ += 1
        else:
            break
        if l_ == max_l:
            break

This is because the loop should terminate as soon as l_ reaches it's upperbound (max_l) (if at all it reaches the upperbound) .

alvations · 2018-07-03T23:10:52Z

@53X Thanks for the explanation on the if l_ == max_l:!

Regarding the warning, the max_l is a user changeable parameter while l_ is the internal and variable count of the longest prefix length. Since max_l subsumes l_ checking for max_l * p <=1 effectively also checks for l_ * p <=1, the warning to the user should be something under the user's control.

alvations · 2018-07-03T23:16:16Z

@53X just 3 small comments pending in the code review and some light changes and the PR will look good to merge!

Sorry that the process took a while because of the back and forth regarding the algorithm and the regression tests.

alvations · 2018-07-03T23:11:35Z

nltk/metrics/distance.py

-        warnings.warn(str("The product of `max_l * p` doesn't fall between [0,1]."
-			  "Jaro-Winkler similarity will not be between 0 and 1.")
-		     )
+        warnings.warn(str("The product  `l_ * p` might not fall between [0,1]."


I think it's okay to use max_l here. See comment in PR.

alvations · 2018-07-03T23:12:24Z

nltk/metrics/distance.py

@@ -260,80 +260,90 @@ def jaro_winkler_similarity(s1, s2, p=0.1, max_l=4):
        American Statistical Association: 354-359.
    such that:

-        jaro_winkler_sim = jaro_sim + ( l * p * (1 - jaro_sim) )
+        jaro_winkler_sim = jaro_sim + ( l_ * p * (1 - jaro_sim) )


I think it's cleaner to use l here to be clearer when communicating the formula.

I think it's okay to use l in the code too instead of l_ since the function is locally scoped, no global variable.

alvations · 2018-07-04T07:14:15Z

@53X Thank you for the updates! Now it LGTM.

IMO, you've contributed to the best documented metric in the nltk.metrics.distance.py and it comes with doctest too =)

Kudos, to the amount of work done to document and validate the implementation!

53X · 2018-07-04T13:13:17Z

@alvations , @stevenbird can this PR be merged now and the corresponding issue be closed?

stevenbird · 2018-07-10T08:48:22Z

Thanks for an excellent contribution @53X.
Thanks for your careful reviewing work @alvations.

This reverts commit ecced11, reversing changes made to ab07520.

Jaro Winkler distance added

7f70ae1

53X mentioned this pull request Jun 12, 2018

Jaro-Winkler Distance in nltk.metrics.distance #2006

Closed

53X changed the title ~~Jaro Winkler distance added #2006 Issue fix attempt~~ Jaro Winkler distance added Jun 12, 2018

alvations self-assigned this Jun 12, 2018

alvations reviewed Jun 13, 2018

View reviewed changes

suggested changes made

ca80287

alvations added 6 commits June 14, 2018 09:53

Decoupled the Jaro distance.

040a9e1

Renamed function to jaro_similarity

5c34bfa

Cleaned up the jaro_winkler function

dd90881

Finalized refactoring and cleanup

541e202

Use math instead of numpy

2e2c442

Cleaned up the demo code

cbca3e5

alvations added 5 commits June 14, 2018 10:24

Corrected dosctring.

ad7110f

Typo in variable name

c72bb8b

typo in condition

6e3045a

Corrected some \xe4 error

4b945b9

Removed weird \xe2 char in docstring

4f1fef3

Algorithm rechecked and optimized

a53def6

alvations mentioned this pull request Jun 26, 2018

Masi Distance Doc_test Error #2049

Closed

alvations and others added 5 commits June 26, 2018 14:56

More machete

0960b10

Machete still...

425153f

Fix typo in wscore

03fdaef

Errors resolved in doc_test

df8c729

DOC TEST ERRORS TAKEN CARE OF

a8168ac

ALL DOC TEST PASSED

e6798f0

alvations added 2 commits July 3, 2018 10:29

Finalized JR similarity implementation.

fc4ae80

Cleaner loop to check of `l` prefix size and added warning to unexpected output

Fixed typo in warning checks

d526edf

ALL ERRORS RESOLVED

2e44f99

53X commented Jul 3, 2018

View reviewed changes

alvations requested changes Jul 3, 2018

View reviewed changes

FINALIZING THINGS

25123f3

alvations approved these changes Jul 4, 2018

View reviewed changes

alvations added awesome-contribution metrics labels Jul 4, 2018

stevenbird merged commit ecced11 into nltk:develop Jul 10, 2018

alvations added a commit that referenced this pull request Aug 15, 2018

Revert "Merge pull request #2044 from 53X/develop"

dfd4b8f

This reverts commit ecced11, reversing changes made to ab07520.

alvations mentioned this pull request Aug 17, 2018

Rouge Score for MT added #2074

Closed

Jaro Winkler distance added #2044

Jaro Winkler distance added #2044

Conversation

53X commented Jun 12, 2018

53X commented Jun 13, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alvations Jun 13, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alvations commented Jun 13, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

53X commented Jun 13, 2018

53X commented Jun 13, 2018

alvations commented Jun 14, 2018 • edited

Changes made

alvations commented Jun 14, 2018 • edited

alvations commented Jun 14, 2018 • edited

alvations commented Jun 14, 2018 • edited

53X commented Jun 15, 2018 • edited

53X commented Jun 30, 2018

alvations commented Jul 2, 2018 • edited

alvations commented Jul 3, 2018 • edited

alvations commented Jul 3, 2018

alvations commented Jul 3, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

53X commented Jul 3, 2018

alvations commented Jul 3, 2018

alvations commented Jul 3, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alvations commented Jul 4, 2018

53X commented Jul 4, 2018

stevenbird commented Jul 10, 2018

alvations Jun 13, 2018 •

edited

alvations commented Jun 14, 2018 •

edited

alvations commented Jun 14, 2018 •

edited

alvations commented Jun 14, 2018 •

edited

alvations commented Jun 14, 2018 •

edited

53X commented Jun 15, 2018 •

edited

alvations commented Jul 2, 2018 •

edited

alvations commented Jul 3, 2018 •

edited

alvations commented Jul 3, 2018 •

edited