Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Function find_near_matches is alternating results for the same situation #40

Open
RagnarCris opened this issue Mar 22, 2022 · 5 comments
Labels

Comments

@RagnarCris
Copy link

RagnarCris commented Mar 22, 2022

Hi, my name is Cristiano and I'm using the library to determine how far someone has read, by comparing a transcript to a text to be read.
And strangely, for the same parameters and strings, the function find_near_matches returned two different results, in 5 to 6 times that i runned the script.

I'm using the version 0.7.3 of fuzzysearch.
I'm leaving an example of a script in which this case is happening. If you want, just run around 7 times to see the results changing at least two times.

Thanks in advance!

Example:

from fuzzysearch import find_near_matches

def fuzzy_extract(qs, ls, threshold):
    for match in find_near_matches(qs, ls, max_l_dist=10):
        match = ls[match.start:match.end]
        index = ls.find(match)
        yield (match, index)

ref = "bruna ficava perto da casa de fátima e brincava de bola perto da lagoa"


ret = list(fuzzy_extract("d dê néká m",ref,0))
print("fuzzy_search: "+str(ret))

P.S.: The text in the script is in Portuguese.

The output that i get by running 7 times this script is:
output_strange_fuzz

@taleinat
Copy link
Owner

Hi Cristiano, thanks for this report!

That is indeed very strange. I have been able to reproduce this locally using your script; many thanks for that!

I'll look into this.

@taleinat taleinat added the bug label Jul 14, 2022
@jcsilva
Copy link

jcsilva commented Mar 29, 2023

Hi @taleinat!

I'm facing a similar issue here. Do you have any suggestion on this?

@taleinat
Copy link
Owner

taleinat commented Mar 29, 2023

Hi,

Unfortunately I haven't had a chance to pin this down.

I will mention, though, that the original question is searching for a pattern of length 11 ("d dê néká m") with max_l_dist=10 (the threshold argument seems to be unused!). The results seem like an edge-case, and do seem to indicate a bug, but I also think that @RagnarCris was not running the fuzzy search they actually intended to run, and that's the main reason the results were not as expected.

@jcsilva, I suggest opening a separate issue with the relevant details, and I'll try to see what can be done.

@taleinat
Copy link
Owner

@RagnarCris, apologies for not having followed up your issue.

Is this still relevant for you?

@jcsilva
Copy link

jcsilva commented Apr 5, 2023

Hi @taleinat ,

sorry for my late response. To provide a bit more of context, I'm trying to replace @RagnarCris on this topic, as he is now working in other subject here. So, I can open a new issue, but it would be basically the same described here.

As you mentioned, the threshold is not used in the example provided. Actually it is just a small snippet of a major function ... this example could be reduced to:

rom fuzzysearch import find_near_matches

def fuzzy_extract(qs, ls):
    for match in find_near_matches(qs, ls, max_l_dist=10):
        match = ls[match.start:match.end]
        index = ls.find(match)
        yield (match, index)

ref = "bruna ficava perto da casa de fátima e brincava de bola perto da lagoa"


ret = list(fuzzy_extract("d dê néká m",ref))
print("fuzzy_search: "+str(ret))

Using this example, we would like to find a pattern of length 11 with max_l_dis=10. Actually we need to use this "weird" configuration, where pattern length and max_l_dist are close because the text and the pattern usually are a bit different (the text comes from one source of information and the pattern from another one, and they produce data with different qualities).

The issue we didn't understand is that the index return by find_near_matches varies between consecutive executions even using the same input. We thought the algorithm would be deterministic, right? But we observed those unexpected results and we would like to ask you help to try to understand it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants