Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new 'iter_long' match misses after the recent fix #185

Open
vegarden opened this issue Feb 14, 2023 · 7 comments · May be fixed by #187
Open

new 'iter_long' match misses after the recent fix #185

vegarden opened this issue Feb 14, 2023 · 7 comments · May be fixed by #187

Comments

@vegarden
Copy link

vegarden commented Feb 14, 2023

After fix #174 which is discussed in #133.
I found the code below

source_text = 'zzabcabdzz'
A = ahocorasick.Automaton()
for k in ['ab', 'abcabd']:
    A.add_word(k, k)

A.make_automaton()
 
list(A.iter_long(source_text))

matches [(6, 'ab')], but it matches [(7, 'abcabd')] in version 1.4.4.
The behavior before seems to be right in this situation.

@doctorink
Copy link

This happens for me as well. 1.4.4 worked fine, 2.0.0 misses the longest match!

@robinchm
Copy link
Contributor

This is indeed a bug introduced by a previous fix for iter_long missing legitimate short matches.
I attempt a fix #186 and add more test cases, but cross my fingers for not introducing more bugs. Can someone help review the pr? @pombredanne
Meanwhile it would be nice if anyone can test my branch and report success/bugs.

The idea is before you fail over to the suffix node, perform a check to see whether there is possibly longer match ahead the current node. If there is a longer match, we should give up the suffix failover.

@robinchm
Copy link
Contributor

ehhh, it does not fully work, since it's not enough to just look ahead one step. You need to look ahead all possible longer matches before deciding whether to fail over. Efficiency-wise that's not acceptable.
Give me some time to think it over, I hope it does not require a overhaul to fully resolve this bug.

@robinchm robinchm linked a pull request May 31, 2023 that will close this issue
@robinchm
Copy link
Contributor

robinchm commented May 31, 2023

#187 Come up with another fix, hopefully it works this time. Feedback welcomed.
Note that in case of two partially overlapping strings it can't return the longest match, but the first match.
@pombredanne mind to review?

@djstrong
Copy link

Anyone tested this fix?

@AyanSinhaMahapatra
Copy link
Contributor

AyanSinhaMahapatra commented Mar 21, 2024

Just tested #187, looks like it's working for me from this branch:

import  ahocorasick

source_text = b'zzabcabdzz'
A = ahocorasick.Automaton()
for k in [b'ab', b'abcabd']:
    A.add_word(k, k)

A.make_automaton()
 
list(A.iter_long(source_text))

Returns [(7, b'abcabd')]

But I get this test failure with make valgrind:

___________________ MemoryUsageDoesNotGrow.test_memory_usage_does_not_grow ___________________

self = <test_issue_9.MemoryUsageDoesNotGrow testMethod=test_memory_usage_does_not_grow>

    def test_memory_usage_does_not_grow(self):

        ac = build_automaton()

        here = Path(__file__).parent
        with open(here.parent / 'README.rst') as f:
            data = f.read()[:1024 * 2]
            data = conv(data)

        before = get_memory_usage()

        for _ in range(1000):
            for start in range(0, len(data) - 20):
                ac.iter(data, start)

        after = get_memory_usage()
>       assert before == after
E       assert 300244.0 == 300964.0

tests/test_issue_9.py:64: AssertionError
================================== short test summary info ===================================
FAILED tests/test_issue_9.py::MemoryUsageDoesNotGrow::test_memory_usage_does_not_grow - assert 300244.0 == 300964.0

Tests are passing otherwise: https://github.com/AyanSinhaMahapatra/pyahocorasick/actions/runs/8371293380/job/22920079498

@pombredanne
Copy link
Collaborator

@robinchm your fix in #187 works but seems to cause a leak.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants