Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

\K does not appear to work with --multiline #2528

Open
jtojnar opened this issue Jun 6, 2023 · 5 comments
Open

\K does not appear to work with --multiline #2528

jtojnar opened this issue Jun 6, 2023 · 5 comments
Labels
bug A bug.

Comments

@jtojnar
Copy link

jtojnar commented Jun 6, 2023

What version of ripgrep are you using?

ripgrep 13.0.0
-SIMD -AVX (compiled)
+SIMD +AVX (runtime)

How did you install ripgrep?

NixOS package.

What operating system are you using ripgrep on?

NixOS unstable

Describe your bug.

When I use PCRE 2 regex with \K in multiline mode, it will print the whole line instead only the part after \K.

What are the steps to reproduce the behaviour?

Run

printf "foo\nbar" | rg --multiline --pcre2 'foo\nb\K(a)'

What is the actual behaviour?

$ printf "foo\nbar" | rg --debug --multiline --pcre2 'foo\nb\K(a)'
DEBUG|globset|crates/globset/src/lib.rs:421: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:421: built glob set; 0 literals, 1 basenames, 0 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
bar

What is the expected behavior?

It should print just ar.

@jtojnar
Copy link
Author

jtojnar commented Jun 6, 2023

If I add --only-matching , it will even preserve the part after the expression (ar instead of just a):

$ printf "foo\nbar" | rg --multiline --pcre2 --only-matching 'foo\nb\K(a)'
bar

--only-matching does work if I remove \K but then it will also include the extra context (foo\nb):

$ printf "foo\nbar" | rg --multiline --pcre2 --only-matching 'foo\nb(a)'
foo
ba

In some cases it could be worked around with negative look-behind but not always – in my full code I am getting:

“lookbehind assertion is not fixed length”

@jtojnar
Copy link
Author

jtojnar commented Jun 6, 2023

Looks like \K also breaks another workaround using --replace:

$ printf "foo\nbar" | rg --multiline --pcre2 'foo\nb\K(a)' --replace '$1'
bar

Fortunately, here I can just drop \K and use --replace in addition --only-matching:

$ printf "foo\nbar" | rg --multiline --pcre2 --only-matching 'foo\nb(a)' --replace '$1'
a

@BurntSushi
Copy link
Owner

BurntSushi commented Jun 6, 2023

So at a high level, for this specific case, the -r/--replace flag is really what you ought to be using here IMO. And in your example, you don't even need PCRE2:

$ printf "foo\nbar" | rg -Uor '$1' 'foo\nb(a)'
a

Otherwise, \K really messes with things. Your very first example is correct and intended behavior:

$ printf "foo\nbar" | rg --multiline --pcre2 'foo\nb\K(a)'
bar

Unless you give ripgrep a flag to tell it to do otherwise, it always prints matching lines. bar is a matching line, so the whole line is printed. \K does and shouldn't have any impact on that.

But the output of --only-matching is wrong:

$ printf "foo\nbar" | rg --multiline --pcre2 --only-matching 'foo\nb\K(a)'
bar

That really should just print a.

The problem here is that \K breaks an invariant that ripgrep assumes to be true about regexes: that the regex engine can be re-run (roughly) starting at the beginning of a match and reproduce the matches found. But because of look-behind (which is basically what \K is), you actually need to make the contents (potentially all of them) before the match available for the regex engine to see, otherwise the regex may not match because of look-around. This has led to kludges like this in ripgrep:

pub fn find_iter_at_in_context<M, F>(
searcher: &Searcher,
matcher: M,
mut bytes: &[u8],
range: std::ops::Range<usize>,
mut matched: F,
) -> io::Result<()>
where
M: Matcher,
F: FnMut(Match) -> bool,
{
// This strange dance is to account for the possibility of look-ahead in
// the regex. The problem here is that mat.bytes() doesn't include the
// lines beyond the match boundaries in mulit-line mode, which means that
// when we try to rediscover the full set of matches here, the regex may no
// longer match if it required some look-ahead beyond the matching lines.
//
// PCRE2 (and the grep-matcher interfaces) has no way of specifying an end
// bound of the search. So we kludge it and let the regex engine search the
// rest of the buffer... But to avoid things getting too crazy, we cap the
// buffer.
//
// If it weren't for multi-line mode, then none of this would be needed.
// Alternatively, if we refactored the grep interfaces to pass along the
// full set of matches (if available) from the searcher, then that might
// also help here. But that winds up paying an upfront unavoidable cost for
// the case where matches don't need to be counted. So then you'd have to
// introduce a way to pass along matches conditionally, only when needed.
// Yikes.
//
// Maybe the bigger picture thing here is that the searcher should be
// responsible for finding matches when necessary, and the printer
// shouldn't be involved in this business in the first place. Sigh. Live
// and learn. Abstraction boundaries are hard.
let is_multi_line = searcher.multi_line_with_matcher(&matcher);
if is_multi_line {
if bytes[range.end..].len() >= MAX_LOOK_AHEAD {
bytes = &bytes[..range.end + MAX_LOOK_AHEAD];
}
} else {
// When searching a single line, we should remove the line terminator.
// Otherwise, it's possible for the regex (via look-around) to observe
// the line terminator and not match because of it.
let mut m = Match::new(0, range.end);
trim_line_terminator(searcher, bytes, &mut m);
bytes = &bytes[..m.end()];
}
matcher
.find_iter_at(bytes, range.start, |m| {
if m.start() >= range.end {
return false;
}
matched(m)
})
.map_err(io::Error::error_message)
}

At a more fundamental level, I fucked up the abstraction boundary between printing and searching, which makes this hard to fix properly. That abstraction boundary basically needs to go through a re-think.

That means I'll classify this as a bug, but don't hold your breath for it getting fixed any time soon.

The bug with --replace not working correctly with \K is "just" a downstream effect of this.

“lookbehind assertion is not fixed length”

This is a PCRE2 limitation. Nothing I can do about that.

@BurntSushi BurntSushi added the bug A bug. label Jun 6, 2023
@jtojnar
Copy link
Author

jtojnar commented Jun 6, 2023

Otherwise, \K really messes with things. Your very first example is correct and intended behavior:

$ printf "foo\nbar" | rg --multiline --pcre2 'foo\nb\K(a)'
bar

You are right. The issue here is actually that the match is not highlighted.

I started with --only-matching but then noticed that it does not also do any match highlighting without --only-matching and forgot to mention that and did not update the expectations accordingly. Normally, the matched part would be highlighted in bold red.

Thanks for the informative response. Will use --only-matching --replace without \K since it works okay.

@BurntSushi
Copy link
Owner

Ah yeah, the lack of highlights are indeed caused by the same underlying problem plaguing --only-matching and --replace.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A bug.
Projects
None yet
Development

No branches or pull requests

2 participants