New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jaro Winkler distance added #2044
Conversation
@alvations , please give your thoughts and comments on this one |
nltk/metrics/distance.py
Outdated
p = type float , weight parameter with a standard value of 0.1 | ||
l_cnt = type int, common prefix at the start of the string (max val = 4) | ||
jw_sim= type float, jaro winkler similarity between s1 and s2 | ||
jw_dist = type float, it is equal to 1 - jw_sim """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring has to be a lot clearer.
- What's s1 and s2? I assume it's taking strings as the distance but actually the other distance metrics in
nltk.metrics.distance
accepts any sequence type. Jaro-Winkler seems to be generic enough to cover any sequence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also variable names in this function needs to be clearer.
Since the function scopes within itself, you can afford to have more explicit variable names, e.g.
mtch
->matches
t_cnt
->transposition_counts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s1 and s2 are the 2 strings between which we wanna calculate the JW distance. I am going to include it in the docstring. Thanks for the point out
nltk/metrics/distance.py
Outdated
jw_sim= type float, jaro winkler similarity between s1 and s2 | ||
jw_dist = type float, it is equal to 1 - jw_sim """ | ||
s1 = s1.lower() | ||
s2 = s2.lower() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As commented above, if the input is a sequence, this would throw an AttributeError.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If .lower()
is the intention, then adding it as an kwarg in the function arguments would be better, i.e.
def jaro_winkler_distance(s1, s2, p=0.1, lowercase=True):
# some code ...
if lowercase:
s1, s2 = s1.lower(), s2.lower()
# other code ...
nltk/metrics/distance.py
Outdated
if(abs(pos1 - pos2) <= dist_bound): | ||
mtch = mtch + 1 | ||
if(pos1 != pos2): | ||
t_cnt = t_cnt + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this looks very levenshtein like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, Is there a simpler way to do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well the above code actually counts the no.of "matched" characters and the no.of transposition counts. Also Levenshtein calculates transpositions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simpler way? I don't know ..... I am gonna have to check it out. BTW this works too
nltk/metrics/distance.py
Outdated
p = type float , weight parameter with a standard value of 0.1 | ||
l_cnt = type int, common prefix at the start of the string (max val = 4) | ||
jw_sim= type float, jaro winkler similarity between s1 and s2 | ||
jw_dist = type float, it is equal to 1 - jw_sim """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also variable names in this function needs to be clearer.
Since the function scopes within itself, you can afford to have more explicit variable names, e.g.
mtch
->matches
t_cnt
->transposition_counts
nltk/metrics/distance.py
Outdated
if(l_cnt == 4): | ||
break | ||
jw_sim = j_sim+(l_cnt*p*(1-j_sim)) | ||
jw_dist = 1 - jw_sim |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just return 1 - jw_sim
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure .... it is actually redundant
nltk/metrics/distance.py
Outdated
break | ||
if(l_cnt == 4): | ||
break | ||
jw_sim = j_sim+(l_cnt*p*(1-j_sim)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would make sense to put this formula in the function docstring and then explain the related variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah sure
@@ -143,7 +147,7 @@ def masi_distance(label1, label2): | |||
return 1 - len_intersection / len_union * m | |||
|
|||
|
|||
def interval_distance(label1,label2): | |||
def interval_distance(label1, label2): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
nltk/metrics/distance.py
Outdated
@@ -66,7 +68,7 @@ def edit_distance(s1, s2, substitution_cost=1, transpositions=False): | |||
been done in other orders, but at least three steps are needed. | |||
|
|||
Allows specifying the cost of substitution edits (e.g., "a" -> "b"), | |||
because sometimes it makes sense to assign greater penalties to substitutions. | |||
because often it makes sense to assign bigger penalties to substitutions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the right assumption that it's often and not sometimes?
Also greater penalties
sounds better than bigger penalties
, the latter sounds presidential ;P
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am changing it back to 'sometimes' and 'greater' . Thanks for the heads up
nltk/metrics/distance.py
Outdated
@@ -34,7 +35,8 @@ def _edit_dist_init(len1, len2): | |||
return lev | |||
|
|||
|
|||
def _edit_dist_step(lev, i, j, s1, s2, substitution_cost=1, transpositions=False): | |||
def _edit_dist_step(lev, i, j, s1, s2, | |||
substitution_cost=1, transpositions=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No reason to break this unless it's over the 80 char lens. Even so, the break should be one per kwargs, so
def _edit_dist_step(lev, i, j, s1, s2,
substitution_cost=1,
transpositions=False):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is over 80 chars . So I decided to break it this way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Break the other keyword into a second line too, that'll be more consistent. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool
nltk/metrics/distance.py
Outdated
else: | ||
break | ||
if(l_cnt == 4): | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These conditions checks don't exactly makes use of numpy
. I think there are better ways to do the match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this look like RIBES' https://github.com/nltk/nltk/blob/develop/nltk/translate/ribes_score.py#L233 I might be wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well i actually used numpy for the np.floor(). Also I wen through RIBES and the find_increasing_sequences(worder) function actually returns monotonic sequences given a worder list. But l_cnt here is actually this : the length of common prefix at the start of the string up to a maximum of four characters . It is as per this: https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance
I have to read more on the metric to really understand the algorithm implemented. But I've left some suggestions on the coding style and documentation issues that needs to be improved. Hope that helps. |
nltk/metrics/distance.py
Outdated
print("Jaro-Winkler_distance:", jaro_winkler_distance('JONES', 'JohnSon')) | ||
print("Jaro-Winkler_similarity:", | ||
1-jaro_winkler_distance('JONES', 'JohnSon')) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more request here =)
Could you reuse the edit_distance_exmaples
and move this bit of demo code into the loop at L240?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah sure . I was afraid of breaking the code initially, so avoided doing that
@alvations , I am gonna make the edits real quick |
@alvations , I made the changes . Please take a look |
After reading the formula and other implementations, it looks like the winkler part is merely a function on the jaro similarity. I'll make some changes and push. @53X if you don't mind, I'll do some adaptations to your code and push so that it better documented and suit the Pythonic conventions. Changes made
|
@53X checking your algorithm, you have made certain assumptions with # Iterate through sequences, check for matches and compute transpositions.
for ch1 in s1: # Iterate through each character.
if ch1 in s2: # Check whether the
pos1 = s1.index(ch1)
pos2 = s2.index(ch1)
if(abs(pos1-pos2) <= dist_bound):
match = match + 1
if(pos1 != pos2):
transposition_count = transposition_count+1 When you use the I don't think you can get both matches and transpositions in 1 iteration. You might have to iterate through and be a little more like dynamic programming when you iterate through 2 sequences with the bounding boxes, like in https://rosettacode.org/wiki/Jaro_distance#Python |
@53X I'm done with the edits. I hope the changes made are okay with you. I think it's best to be more explicit esp with some of these variable names and clearer in terms of what each loop / condition is doing. Going back to the algorithm, I think you might need to do some tests on whether the loop here is really the same as the typical dynamic programming code from Rosetta code, it looks different given the # Iterate through sequences, check for matches and compute transpositions.
for ch1 in s1: # Iterate through each character.
if ch1 in s2: # Check whether the
pos1 = s1.index(ch1)
pos2 = s2.index(ch1)
if(abs(pos1-pos2) <= dist_bound):
match = match + 1
if(pos1 != pos2):
transposition_count = transposition_count+1 Looks like there's some difference in https://www.kaggle.com/alvations/jaro-winkler Oddly, this implementation https://github.com/nap/jaro-winkler-distance is also different:
|
With all the implementations giving different values, I suggest we follow this paper's output https://www.census.gov/srd/papers/pdf/rr93-8.pdf (page 35, Table 5) and aim our implementation towards that: Table 5 Comparison of String Comparators Rescaled between 0 and 1
Another table of results from https://www.census.gov/srd/papers/pdf/rr94-5.pdf Table 2.1. Comparison of String Comparators Using Last Names, First Names, and Street Names
|
@alvations , please take a look at the (JON,JAN) and (1ST,IST) examples from your comment ( second table) . I think that there is a typo in the paper. There is no way of (JON ,JAN) and (1ST, IST) being completely dissimilar as suggested in the paper( since we have got matching characters for both of them). Also I have experimented with the value of max_l parameter in the jaro-winkler formula. Keeping it's value to 4 gives us the best results ( i.e completely replicating the results reported in the paper). ALso for these two ambiguously reported values in the paper, my results are pretty much consistent with this online tool. I have also made few more changes . Please review them and give your feedback |
@alvations , I have made sure that all the test cases pass in the doc-test. Please have a look at the latest commits and provide your feedback. |
@53X Sorry about the previous comment, saw the commits while commuting and misunderstood. There's no need to call it Why is there a need for different You shouldn't be hacking the @53X is it because the default value of p isn't 0.1 in the paper but a variable? |
Lets debug this one instance at a time in a way that we can copy and paste the code. Here's the stripped down code without the docstring: def jaro_similarity(s1, s2):
# First, store the length of the strings
# because they will be re-used several times.
len_s1, len_s2 = len(s1), len(s2)
# The upper bound of the distance for being a matched character.
match_bound = max(len_s1, len_s2) // 2 - 1
# Initialize the counts for matches and transpositions.
matches = 0 # no.of matched characters in s1 and s2
transpositions = 0 # no. of transpositions between s1 and s2
flagged_1 = [] # positions in s1 which are matches to some character in s2
flagged_2 = [] # positions in s2 which are matches to some character in s1
# Iterate through sequences, check for matches and compute transpositions.
for i in range(len_s1): # Iterate through each character.
upperbound = min(i+match_bound, len_s2-1)
lowerbound = max(0, i-match_bound)
for j in range(lowerbound, upperbound+1):
if s1[i] == s2[j] and j not in flagged_2:
matches += 1
flagged_1.append(i)
flagged_2.append(j)
break
flagged_2.sort()
for i, j in zip(flagged_1, flagged_2):
if s1[i] != s2[j]:
transpositions += 1
if matches == 0:
return 0
else:
return 1/3 * (matches/len_s1 +
matches/len_s2 +
(matches-transpositions//2)/matches
)
def jaro_winkler_similarity(s1, s2, p=0.1, max_l=4):
# Compute the Jaro similarity
jaro_sim = jaro_similarity(s1, s2)
# Initialize the upper bound for the no. of prefixes.
# if user did not pre-define the upperbound, use smaller among length of s1
# and length of s2
# Compute the prefix matches.
l_ = 0
for i in range(min(len(s1), len(s2))):
if s1[i] == s2[i]:
l_ += 1
else:
break
if l_ == max_l:
break
# Return the similarity value as described in docstring.
return jaro_sim + (l_ * p * (1 - jaro_sim)) Then we run the tests: winkler_examples = [("billy", "billy"), ("billy", "bill"), ("billy", "blily"),
("massie", "massey"), ("yvette", "yevett"), ("billy", "bolly"), ("dwayne", "duane"),
("dixon", "dickson"), ("billy", "susan")]
winkler_scores = [1.000, 0.967, 0.947, 0.944, 0.911, 0.893, 0.858, 0.853, 0.000]
jaro_scores = [1.000, 0.933, 0.933, 0.889, 0.889, 0.867, 0.822, 0.790, 0.000]
for (s1, s2), jscore, wscore in zip(winkler_examples, jaro_scores, winkler_scores):
if round(jaro_similarity(s1, s2), 3) != jscore:
print('jaro', s1, s2, jscore, round(jaro_similarity(s1, s2), 3))
#if round(jaro_winkler_similarity(s1, s2, p=p_val), 3) != wscore:
# Compute the prefix matches.
l_ = 0
for i in range(min(len(s1), len(s2))):
if s1[i] == s2[i]:
l_ += 1
else:
break
if l_ == max_l:
break
print('jaro_winkler', s1, s2, jscore, wscore, jaro_winkler_similarity(s1, s2), l_) [out]:
So our Jaro similarity matches, we must have done something different in the winkler's rescaling. I suspect it's because of how we're computing the prefix matches. If we take a look at the Lets take a look at
[out]:
Similarly: s1 = "dixon"
s2 = "dickson"
j = jaro_similarity(s1, s2)
w = jaro_winkler_similarity(s1, s2)
p = 0.1 # Keep this constant
l_ = 2
print(j + (l_*0.1*(1-j)))
l_ = 3
print(j + (l_*0.1*(1-j))) # Now it matches the Table 5 values [out]:
It's possible that |
I might be wrong about the
It boils down to:
And to assure that y ranges between That means that
Similarly there's no real need for
After looking at the Winkler (1990) paper, it looks like the parameters in the experiments were ambiguously set. So I think the p variable hacks you've put in is okay. Let me try to clean it up a little and add an assert. |
Cleaner loop to check of `l` prefix size and added warning to unexpected output
@53X there's still 3 examples that doesn't match up with the
|
nltk/metrics/distance.py
Outdated
l_ += 1 | ||
else: | ||
break | ||
if l_ == max_l: | ||
break | ||
if l_ > max_l: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The loop should terminate and stop counting when l_
has reached its maximum bound ( max_l
). Thus this would be the better option as l_
shouldn't go beyond max_l
:
if l_ == max_l:
break
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alvations , please review this piece of code
The formula for the jaro_winkler_similarity goes as follows: Quoting from the comments in the code of this PR ( see
I have put a new doc-test case separately that explains the same and outputs the same value of Thus One more reason as to why
This piece of code that generates the warning is logical as there might be cases where
Also instead of this code that calculates the prefix length,
we should use this one :
This is because the loop should terminate as soon as l_ reaches it's upperbound ( |
@53X Thanks for the explanation on the Regarding the warning, the |
@53X just 3 small comments pending in the code review and some light changes and the PR will look good to merge! Sorry that the process took a while because of the back and forth regarding the algorithm and the regression tests. |
nltk/metrics/distance.py
Outdated
warnings.warn(str("The product of `max_l * p` doesn't fall between [0,1]." | ||
"Jaro-Winkler similarity will not be between 0 and 1.") | ||
) | ||
warnings.warn(str("The product `l_ * p` might not fall between [0,1]." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's okay to use max_l
here. See comment in PR.
nltk/metrics/distance.py
Outdated
@@ -260,80 +260,90 @@ def jaro_winkler_similarity(s1, s2, p=0.1, max_l=4): | |||
American Statistical Association: 354-359. | |||
such that: | |||
|
|||
jaro_winkler_sim = jaro_sim + ( l * p * (1 - jaro_sim) ) | |||
jaro_winkler_sim = jaro_sim + ( l_ * p * (1 - jaro_sim) ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's cleaner to use l
here to be clearer when communicating the formula.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's okay to use l
in the code too instead of l_
since the function is locally scoped, no global variable.
@53X Thank you for the updates! Now it LGTM. IMO, you've contributed to the best documented metric in the Kudos, to the amount of work done to document and validate the implementation! |
@alvations , @stevenbird can this PR be merged now and the corresponding issue be closed? |
Thanks for an excellent contribution @53X. |
Jaro Winkler distance has been added to nltk.metrics . Also PEP-8 issues have been taken care of .
I have not used any inbuilt functions from nltk for making this. This PR introduces a new string comparison metric. The metrics has been written as a completely different function named jaro_winkler_distance()
It fixes this issue