new 'iter_long' match misses after the recent fix #185

vegarden · 2023-02-14T03:07:38Z

After fix #174 which is discussed in #133.
I found the code below

source_text = 'zzabcabdzz'
A = ahocorasick.Automaton()
for k in ['ab', 'abcabd']:
    A.add_word(k, k)

A.make_automaton()
 
list(A.iter_long(source_text))

matches [(6, 'ab')], but it matches [(7, 'abcabd')] in version 1.4.4.
The behavior before seems to be right in this situation.

The text was updated successfully, but these errors were encountered:

doctorink · 2023-02-23T15:07:12Z

This happens for me as well. 1.4.4 worked fine, 2.0.0 misses the longest match!

robinchm · 2023-05-30T09:05:58Z

This is indeed a bug introduced by a previous fix for iter_long missing legitimate short matches.
I attempt a fix #186 and add more test cases, but cross my fingers for not introducing more bugs. Can someone help review the pr? @pombredanne
Meanwhile it would be nice if anyone can test my branch and report success/bugs.

The idea is before you fail over to the suffix node, perform a check to see whether there is possibly longer match ahead the current node. If there is a longer match, we should give up the suffix failover.

robinchm · 2023-05-30T09:21:16Z

ehhh, it does not fully work, since it's not enough to just look ahead one step. You need to look ahead all possible longer matches before deciding whether to fail over. Efficiency-wise that's not acceptable.
Give me some time to think it over, I hope it does not require a overhaul to fully resolve this bug.

robinchm · 2023-05-31T03:28:13Z

#187 Come up with another fix, hopefully it works this time. Feedback welcomed.
Note that in case of two partially overlapping strings it can't return the longest match, but the first match.
@pombredanne mind to review?

djstrong · 2023-11-19T20:58:43Z

Anyone tested this fix?

AyanSinhaMahapatra · 2024-03-21T07:36:41Z

Just tested #187, looks like it's working for me from this branch:

import  ahocorasick

source_text = b'zzabcabdzz'
A = ahocorasick.Automaton()
for k in [b'ab', b'abcabd']:
    A.add_word(k, k)

A.make_automaton()
 
list(A.iter_long(source_text))

Returns [(7, b'abcabd')]

But I get this test failure with make valgrind:

___________________ MemoryUsageDoesNotGrow.test_memory_usage_does_not_grow ___________________

self = <test_issue_9.MemoryUsageDoesNotGrow testMethod=test_memory_usage_does_not_grow>

    def test_memory_usage_does_not_grow(self):

        ac = build_automaton()

        here = Path(__file__).parent
        with open(here.parent / 'README.rst') as f:
            data = f.read()[:1024 * 2]
            data = conv(data)

        before = get_memory_usage()

        for _ in range(1000):
            for start in range(0, len(data) - 20):
                ac.iter(data, start)

        after = get_memory_usage()
>       assert before == after
E       assert 300244.0 == 300964.0

tests/test_issue_9.py:64: AssertionError
================================== short test summary info ===================================
FAILED tests/test_issue_9.py::MemoryUsageDoesNotGrow::test_memory_usage_does_not_grow - assert 300244.0 == 300964.0

Tests are passing otherwise: https://github.com/AyanSinhaMahapatra/pyahocorasick/actions/runs/8371293380/job/22920079498

pombredanne · 2024-03-21T07:50:22Z

@robinchm your fix in #187 works but seems to cause a leak.

robinchm mentioned this issue May 30, 2023

Fix early return when there can be longer match in iter_long. #186

Closed

robinchm linked a pull request May 31, 2023 that will close this issue

Fix long iter again #187

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

new 'iter_long' match misses after the recent fix #185

new 'iter_long' match misses after the recent fix #185

vegarden commented Feb 14, 2023 •

edited

doctorink commented Feb 23, 2023

robinchm commented May 30, 2023

robinchm commented May 30, 2023

robinchm commented May 31, 2023 •

edited

djstrong commented Nov 19, 2023

AyanSinhaMahapatra commented Mar 21, 2024 •

edited

pombredanne commented Mar 21, 2024

new 'iter_long' match misses after the recent fix #185

new 'iter_long' match misses after the recent fix #185

Comments

vegarden commented Feb 14, 2023 • edited

doctorink commented Feb 23, 2023

robinchm commented May 30, 2023

robinchm commented May 30, 2023

robinchm commented May 31, 2023 • edited

djstrong commented Nov 19, 2023

AyanSinhaMahapatra commented Mar 21, 2024 • edited

pombredanne commented Mar 21, 2024

vegarden commented Feb 14, 2023 •

edited

robinchm commented May 31, 2023 •

edited

AyanSinhaMahapatra commented Mar 21, 2024 •

edited