Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support searching across multiple lines #176

Closed
isobit opened this issue Oct 13, 2016 · 103 comments · Fixed by #1017
Closed

support searching across multiple lines #176

isobit opened this issue Oct 13, 2016 · 103 comments · Fixed by #1017
Labels
enhancement An enhancement to the functionality of the software. libripgrep An issue related to modularizing ripgrep into libraries.
Milestone

Comments

@isobit
Copy link

isobit commented Oct 13, 2016

Say for example I'm trying to find instances of click that reside in a listeners block, like so:

listeners: {
    foo: ...
    click: ....
}

According to the Rust regex docs, I should be able to do: rg '(?s)listeners.+click', but this doesn't seem to work. Does ripgrep not support multiline regex?

@BurntSushi
Copy link
Owner

Does ripgrep not support multiline regex?

Correct. Not even the s flag will help, because ripgrep explicitly instructs the regex automaton to never match \n. Like grep, ripgrep is a line oriented search tool.

ripgrep can perform a search in two different ways. One of them reads a chunk of bytes at a time and searches it. The other memory maps the file and searches that all at once. The former has a number of advantages, including being faster when searching a large number of small files in parallel and being able to search streams in constant memory. The latter has the advantage of being faster for single files (sometimes) and much simpler to implement.

The former only works because search is line oriented. A multiline regex can technically match, say, 2GB of data, which is completely incompatible with searching small chunks at a time.

The latter could be made to work with multiline search, but memory maps can't search stdin for example. So a multiline search on stdin would have to block and read all of stdin into memory before searching. (There exists a way around even this, but it requires changing the regex engine to be capable of incremental search, which is an even bigger change, but theoretically possible.)

multiline searching therefore comes with significant implementation complexity, and IMO is a pretty niche use case. I can also imagine it having a pretty big impact on the printing code. This fact alone is a good reason why it may never be in ripgrep proper, but perhaps once #162 is done, others can take a crack at it.

This is a good example of a feature that The Silver Searcher has that ripgrep may either never have or won't have for a long time.

@isobit
Copy link
Author

isobit commented Oct 13, 2016

Gotcha, thanks for the explanation. I really like ripgrep as a tool, just was hoping to use it for this case too 😉 .

@BurntSushi
Copy link
Owner

@joshglendenning Yeah, I admit, it would be nice, and if it were easy, I'd have no problems with it. While I do consider it niche, I have no doubts that it would be quite useful!

Once I split out most of the pieces of ripgrep to library form, perhaps there will be interest in building other tools for more niche use cases! I will keep this case in mind as I do that though.

@maxbrunsfeld
Copy link

This is a really cool tool, but I might suggest including this as a caveat in the README, alongside the comparisons to ag, since ag does support multi-line patterns.

@BurntSushi
Copy link
Owner

@maxbrunsfeld I've been meaning to add an "anti pitch" section to the README like the one in my blog post. That's now done. Thanks for the reminder!

@BurntSushi
Copy link
Owner

BurntSushi commented Mar 17, 2017

I'm going to re-open this, because it's one of the most highly requested features.

Nothing has changed about the problems I outlined above. However, multiline search needn't be the default. If we provide it as a flag, then we can do what we need to do to support multiline search only when that flag is provided. The critical thing that multiline search needs is a complete sequence of bytes in memory to search. Memory maps can provide this, but failing that, we would need to read the entire file into memory before starting a search.

Other than using heap space proportional to the file being searched, the fundamental issue with this flag is when it's used in conjunction with searching stdin. Namely, ripgrep will need to block until EOF is read on stdin before a search can even start. Alternatively, multiline search simply wouldn't be allowed on stdin. The silver searcher will in fact do this silently when searching stdin:

/* TODO: this will only match single lines. multi-line regexes silently don't match */
void search_stream(FILE *stream, const char *path) {
    // ...
}

I don't like the "silent" idea, but stopping ripgrep with an error is certainly something I'd be open to. Neither seem like good choices to me, but I don't think it should block this feature altogether.

N.B. This is a significant feature and it would have to be part of the libripgrep effort.

@BurntSushi
Copy link
Owner

The other thing I forgot to mention is that multiline search will negate inner literal optimizations. Normal prefix and, in special cases, suffix, literal optimizations will still be performed as part of the regex engine. (I've long thought about making inner literal optimizations work on arbitrary strings, but it's hard.)

@BurntSushi BurntSushi added the libripgrep An issue related to modularizing ripgrep into libraries. label Mar 17, 2017
@BurntSushi BurntSushi added this to the libripgrep milestone Mar 17, 2017
@gulshan
Copy link

gulshan commented Mar 17, 2017

A naive question/suggestion. Assuming single lines are being loaded for search now, can that be changed to n lines, n set to 10 or 20 or something like that? While a line gets in, another gets out of the load in FIFO fashion? This will not be technically correct for all cases, but may be enough for most cases.

@d-akara
Copy link

d-akara commented Mar 17, 2017

How significant are the trade-offs to the user experience?
If doing multiline is more expensive, I'm fine with that as long as single line performance is not impacted.

Would you actually need a special flag to ripgrep? or can you reliably determine from the expression itself?

@BurntSushi
Copy link
Owner

Great questions! Keep'em coming.

Assuming single lines are being loaded for search now

They are not. If they were, ripgrep would be very slow. The reasons for this are a bit subtle, but basically, "it's faster to search a huge chunk than it is to break it into little pieces and then search each piece." "Huge chunk" in this case might be the size of some internal buffer, perhaps, 8KB.

If you're curious about how a fast grep tool works in more detail, check out this section in my blog post on ripgrep: http://blog.burntsushi.net/ripgrep/#anatomy-of-a-grep

can that be changed to n lines, n set to 10 or 20 or something like that? While a line gets in, another gets out of the load in FIFO fashion? This will not be technically correct for all cases, but may be enough for most cases.

If you have a regex like a\s+b, then it's not possible to determine the length of the match up front. You have three choices:

  1. You use a regex engine that supports incremental search. (This is somewhat at odds with performance if "incremental" means "byte at a time." So for something like this, you'd need an incremental engine that can process chunks at a time.) ripgrep's regex engine doesn't support this.
  2. You feed the regex engine every byte you got. (The Plan.)
  3. You arbitrarily cap the size of the match. This will invariably get things wrong and there's no way to escape.

I still actually strongly believe that multiline search is a very niche feature, but it is one that can be quite useful when the situation calls for it. (A text editor is perhaps one such situation, but ripgrep is first and foremost a command line tool where multiline search feels a lot less common.) Therefore, taking approach (3) doesn't seem worth it. In the common case, memory maps will work just fine and your OS will manage the memory for you. It's only the corner cases that are sub-optimal: when memory maps can't be used (e.g., on virtual files or stdin).

How significant are the trade-offs to the user experience? If doing multiline is more expensive, I'm fine with that as long as single line performance is not impacted.

If --multiline is behind a flag, then I'm pretty confident that the standard UX of ripgrep won't be impacted. Including performance.

Would you actually need a special flag to ripgrep? or can you reliably determine from the expression itself?

A flag is 100% necessary. A regex like a\s+b shouldn't match across multiple lines by default, because that's what we've all come to expect from line oriented searchers. But it is totally plausible that you might want it to. That's when you'd pass a flag.

@d-akara
Copy link

d-akara commented Mar 17, 2017

I still actually strongly believe that multiline search is a very niche feature, but it is one that can be quite useful when the situation calls for it.

I would agree use is actually niche, but desire to use is not.

  1. It is a bit non intuitive how to properly write a multiline expression. Especially if the engine doesn't support the . dotAll matching and even worse if you want to constrain to a range like next N lines.
  2. Due to 1, many use incomplete results although not always knowingly. Most coding languages can have line breaks almost anywhere.

I would say if you are searching for 2 terms and completeness is important then using multiline would often be your default. However, writing an expression to find termA followed by termB within 5 or less lines is likely not something that rolls off of the fingertips of someone who occasionally uses regular expressions although I think many would find it useful and use such expressions if more intuitive to write.

@BurntSushi
Copy link
Owner

BurntSushi commented Mar 17, 2017

@dakaraphi Good points. I'd like to use your comment to constrain this feature, namely, that multiline search is the ability to apply a regex whose matches may span an arbitrary number of lines.

With that said:

  1. It would be plausible to make . match \n by default if multiline mode is enabled.
  2. The use case of "where do A and B co-occur within N lines of each other" is definitely something I agree can be useful. It's possible to some extent to do this with a regex, e.g., A([^\n]*\n){0,5}[^\n]*B|B([^\n]*\n){0,5}[^\n]*A, but that is a little painful. Extending this to three terms would probably be horrifying.

I think (2) is something that's enabled by multiline search, although, today, you can do something similar with contexts: rg B -C5 | rg A -C5 for example works to some extent. Regardless, it might be wiser to categorize this into a separate feature whose UX can be more thoughtfully designed. Others have requested similarish things, as in #346 and #360. sift is a tool that has support for this kind of matching, so we may be able to crib ideas from them.

With all that said, we must be careful not to get too far away from what ripgrep is supposed to be good at doing: searching lines. :-) I say this because there has to be a point at which "write code for your specialized search" becomes a valid thing to say. The key is figuring out where that point is.

@d-akara
Copy link

d-akara commented Mar 17, 2017

multiline search is the ability to apply a regex whose matches may span an arbitrary number of lines

Just to make sure I understand the intention, could you state that as what you see ripgrep would not do that possibly other regex engines do when searching multiline?

@BurntSushi
Copy link
Owner

@dakaraphi Sorry, the intention of me saying that was to push UX concerns like "how do I find co-occurring terms, A and B, within a fixed number of lines" out of multiline support. i.e., I don't think that particular UX should be addressed as part of standard multiline support, but should instead be considered as a separate feature (that may or may not happen). :-)

I don't think there's anything ripgrep would do differently in terms of UX with respect to the silver searcher, other than 1) not doing it by default and 2) probably not doing silent things.

@BurntSushi
Copy link
Owner

Are there are other tools that support multiline search other than the silver searcher?

@d-akara
Copy link

d-akara commented Mar 17, 2017

I'm not sure about command line tools. Prior to using VS Code I was using Brackets which supported multiline file search. I believe other editors like Sublime, Notepad++ etc also support multiline.

@d-akara
Copy link

d-akara commented Mar 17, 2017

I don't think that particular UX should be addressed as part of standard multiline support, but should instead be considered as a separate feature (that may or may not happen). :-)

ok right. Yes I'm not sure if that really should be part of something like ripgrep or not. For example, I've been thinking about maybe writing some extension for VS Code like a regex helper or such that would take something like common patterns or templates and you just plugin the values for such use cases and it would generate the regex.

@BurntSushi
Copy link
Owner

@dakaraphi Great! I think we're on the same page now. :-) Thanks for poking!

paldepind added a commit to paldepind/ripgrep that referenced this issue Mar 23, 2017
After [this comment](BurntSushi#176 (comment)) it seems like the statement about never supporting multiline search should be removed.
@BurntSushi BurntSushi added the enhancement An enhancement to the functionality of the software. label Apr 9, 2017
@rshpeley
Copy link

@dakaraphi directed me here from Microsoft/vscode #13155

It looks like one of the most common requests for searching across multiple lines is related to text editors. At the moment, my needs are very simple. If I can get a match across multiple files in a project for a multiline selection -- even if it's fully literal -- I could work with it. For most text editors, the menu option to search across multiple lines is separate than a simple search, and so a ripgrep flag, as @BurntSushi suggested, would naturally fit this use case.

I'm still making it through @BurntSushi's anatomy of a grep link, but it appears to me that a multiline search for text editors mostly requires a literal search with some multiple literals (white space, line endings) and therefore the search won't even make it to the regex engine for these cases.

Isn't the multiple line selection just a contiguous sequence of bytes (in the fully literal case) to be matched in a buffer? Or am I missing something related to optimisation here?

I'm sure people will come up with cases where a regex in a multiline search/replace would be mighty handy, but I think support for the simpler multiple literal multiline case would be a good start to give some text editors (such as vscode and atom) missing functionality.

btw, a most excellent ripgrep article @BurntSushi!

@priyadarshan
Copy link

Multi-line searching would be a boon to many. See for example this use case.

@mateon1
Copy link

mateon1 commented Aug 17, 2018

Some nits:

This flag causes '.' to match new lines

Should be newlines for consistency reasons

requires that each file it searches appear as if it exists contiguously in

appears, but perhaps this section could be worded differently.
Maybe: ... ripgrep requires that the searched file is laid out/mapped/allocated contiguously in memory
I'm unsure which wording is the best (I prefer laid out, but maybe that's not appropriate for documentation), but all three sound better to me than the existing version.

Specifically, if the --multiline flag is provided by the regex
cannot match over multiple lines

s/by/but/

@waldyrious
Copy link

waldyrious commented Aug 17, 2018

@BurntSushi I'm glad you agree with the suggestions! The reworded sentence is indeed much clearer, after fixing the typo pointed out by @mateon1.

Here's the diff of that sentence, for future reference/convenience:

-That is, even if you use the --multiline flag but your regex cannot
-match over multiple lines, then ripgrep won't consume unnecessary resources.
+Specifically, if the --multiline flag is provided but the regex
+cannot match over multiple lines, then ripgrep won't read each file into memory
+before searching it.

Now that I re-read that, I'm not sure "cannot match" is the best choice of words, since it can imply both a neutral statement or an imperative enforcement. (Not sure I'm being clear myself; let me know if I should rephrase!)

I suppose you're referring to the case where the regex does not contain any patterns that would match newlines, or it contains . without the dotall flag being activated. Is that correct?

@BurntSushi
Copy link
Owner

@mateon1 Thanks! I took your advice, and chose "laid out."

@waldyrious

I suppose you're referring to the case where the regex does not contain any patterns that would match newlines, or it contains . without the dotall flag being activated. Is that correct?

Yes. Whether dotall is enabled or not is mostly orthogonal; what matters is whether a \n exists in any of the possible matches of a regex. Enabling dotall and uttering . is one way to achieve that, but a literal \n, \s, \p{any} and so on also achieve that.

It is possible I should just remove this part of the docs. I'm not sure. I put it there as a way of saying that even if you enable multiline mode but don't make use it, you generally won't pay (much) for it. But maybe that's not that important.

@waldyrious
Copy link

I think it wouldn't be a problem if it were removed, but it is useful information so I'd have a slight preference to keep it.

IMO changing that sentence to something like this:

"Specifically, if the --multiline flag is provided, but the regex cannot match over multiple lines does not contain patterns that would match \n characters, then ripgrep won't read will automatically avoid reading each file into memory before searching it."

...would make it sufficiently unambiguous.

@BurntSushi
Copy link
Owner

@waldyrious I like it. Much better. Thanks! :)

BurntSushi added a commit that referenced this issue Aug 19, 2018
This commit updates the CHANGELOG to reflect all the work done to make
libripgrep a reality.

* Closes #162 (libripgrep)
* Closes #176 (multiline search)
* Closes #188 (opt-in PCRE2 support)
* Closes #244 (JSON output)
* Closes #416 (Windows CRLF support)
* Closes #917 (trim prefix whitespace)
* Closes #993 (add --null-data flag)
* Closes #997 (--passthru works with --replace)

* Fixes #2 (memory maps and context handling work)
* Fixes #200 (ripgrep stops when pipe is closed)
* Fixes #389 (more intuitive `-w/--word-regexp`)
* Fixes #643 (detection of stdin on Windows is better)
* Fixes #441, Fixes #690, Fixes #980 (empty matching lines are weird)
* Fixes #764 (coalesce color escapes)
* Fixes #922 (memory maps failing is no big deal)
* Fixes #937 (color escapes no longer used for empty matches)
* Fixes #940 (--passthru does not impact exit status)
* Fixes #1013 (show runtime CPU features in --version output)
BurntSushi added a commit that referenced this issue Aug 20, 2018
This commit updates the CHANGELOG to reflect all the work done to make
libripgrep a reality.

* Closes #162 (libripgrep)
* Closes #176 (multiline search)
* Closes #188 (opt-in PCRE2 support)
* Closes #244 (JSON output)
* Closes #416 (Windows CRLF support)
* Closes #917 (trim prefix whitespace)
* Closes #993 (add --null-data flag)
* Closes #997 (--passthru works with --replace)

* Fixes #2 (memory maps and context handling work)
* Fixes #200 (ripgrep stops when pipe is closed)
* Fixes #389 (more intuitive `-w/--word-regexp`)
* Fixes #643 (detection of stdin on Windows is better)
* Fixes #441, Fixes #690, Fixes #980 (empty matching lines are weird)
* Fixes #764 (coalesce color escapes)
* Fixes #922 (memory maps failing is no big deal)
* Fixes #937 (color escapes no longer used for empty matches)
* Fixes #940 (--passthru does not impact exit status)
* Fixes #1013 (show runtime CPU features in --version output)
@myfairsyer
Copy link

myfairsyer commented Aug 23, 2018

Will \n only match \n / 0x0A or any common single line break (\r?\n) (or if you take the classic MacOS and BBC into account ((\n\r?)|(\r\n?)))

(I do know that both styles exist among regex engines but couldn't tell which is which)

Sry if there is an answer to that somewhere.

@BurntSushi
Copy link
Owner

BurntSushi commented Aug 23, 2018

\n only matches \n.

Current master has a --crlf option that causes $ to match \r\n line breaks in addition to \n.

I'm not aware of any regex engines that permit a literal \n to match \r\n. Some regex engines certainly allow for a looser definition of what "line terminator" actually means when necessary, e.g., when matching the ^ or $ anchors. If you know of a regex engine that permits a literal \n to match \r\n then I'd like to have a link to that so I can investigate!

@roblourens
Copy link
Contributor

roblourens commented Aug 23, 2018

VS Code matches \r\n on \n when ctrl+f searching in a single file, it's useful in an editor but I wouldn't use that as inspiration for ripgrep.

@BurntSushi
Copy link
Owner

BurntSushi commented Aug 23, 2018 via email

@roblourens
Copy link
Contributor

No, it's just something vscode does.

@myfairsyer
Copy link

If you know of a regex engine that permits a literal \n to match \r\n then I'd like to have a link to that so I can investigate!

@BurntSushi Most probably I only encountered it in text editors like VSCode.

it's useful in an editor but I wouldn't use that as inspiration for ripgrep.

@roblourens Would you mind to elaborate?
And does that mean that VSCode will behave differently inside an editor and when searching across files?

@roblourens
Copy link
Contributor

roblourens commented Aug 24, 2018

Personally I don't prefer "magic" like that, but yeah I'll have to see whether we can rewrite \n to \r?\n so that search across files works the same as search inside files.

@myfairsyer
Copy link

it's useful in an editor but I wouldn't use that as inspiration for ripgrep.

@roblourens Would you mind to elaborate?

Personally I don't prefer "magic" like that

@roblourens
I was rather driving at the distinction between text editor and ripgrep.
I couldn't quite follow.
Is it b/c you consider ripgrep as a command line tool having a more advanced audience which demands more control and less magic than a graphical text editor?

@BurntSushi
I don't want to derail or hijack this therad for irrelevant discussions.
You said you'd like to know more and investigate and found VSCode's behavior interesting.
If you don't anymore tell me.

@wmww
Copy link

wmww commented Oct 31, 2018

Currently, if you try to make a multiline search without the -U/--multiline option, ripgrep errors with the literal '"\n"' is not allowed in a regex. Would it make sense to mention the existence of a multiline enabling option here?

@BurntSushi
Copy link
Owner

@wmww That should already be done on master. See: #1055

Also, please file new issues for new requests.

@unphased
Copy link

unphased commented Nov 25, 2020

Hi, I'm curious if there is a way to make multiline dot non-greedy? I tried (?s).*? and .*? under --multiline-dotall and neither worked. It seems with multiline mode, the .*? fails to become non-greedy. Is there an underlying reason for this?

[^>] and such work under multiline mode, though, which is the less general way to do sort of non-greedy stuff.

@BurntSushi
Copy link
Owner

Please file a new issue. And please don't use phrases like "does not work" without actually showing what you mean by it. Please fill out the complete bug template.

@unphased
Copy link

You're right, sorry for trying to resurrect and derail an old issue. I did more testing on this and I think I neglected to consider something with my test. it seems all to work as expected.

@amosbird
Copy link

amosbird commented Dec 1, 2021

Do we have an option to limit the max number of lines each match can have?

@BurntSushi
Copy link
Owner

No. And I don't see any obvious way to implement that either. You can usually build such limits into your regex instead.

@amosbird
Copy link

amosbird commented Dec 1, 2021

No. And I don't see any obvious way to implement that either. You can usually build such limits into your regex instead.

Can we build a regex to match the following? foo.*[at most three new lines]bar.*[at most three new lines]baz.* (all together at most three new lines)

@BurntSushi
Copy link
Owner

@amosbird Sure? Why not? foo(.*\n?){0,3}bar(.*\n?){0,3}, or something like that anyway.

In the future, I'd really prefer you open new tickets for support questions. Bumping old issues doesn't make these discussions easy to find. There is even a Q&A forum designed for this purpose. Please use it.

@amosbird
Copy link

amosbird commented Dec 1, 2021

In the future, I'd really prefer you open new tickets for support questions. Bumping old issues doesn't make these discussions easy to find. There is even a Q&A forum designed for this purpose. Please use it.

Sure. Will continue the discussion in the Q&A forum.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An enhancement to the functionality of the software. libripgrep An issue related to modularizing ripgrep into libraries.
Projects
None yet
Development

Successfully merging a pull request may close this issue.