libripgrep #162

BurntSushi · 2016-10-11T00:42:59Z

This is a tracking issue for the concept known as "libripgrep."

In fact, "libripgrep" probably won't ever be a thing unto itself. Instead, there will be several independent crates that together make up the thing one might think of as "libripgrep." Most of this work is already done. Here's a list:

memchr - Fast single byte search.
walkdir - Recursive directory iterator.
utf8-ranges - Generate utf8 automata.
regex-syntax - A regex parser (including literal extraction helpers).
regex - The regex engine itself.
globset - Regex based glob matching, including glob-set matching.
grep - Blazing fast line-by-line search. This is where all of the inner literal optimizations happen. (INCOMPLETE.)
ignore - Recursively traverse a directory while respecting ignore files and file types.
parallel directory iterator - Iterate over files in parallel, applying ignore rules. (This is now done and part of the ignore crate.)

grep is incomplete because it exposes only the most basic line-by-line
search. For example, it doesn't handle inverted matching, context handling,
line counts or anything other than "which line matches this regex." Still,
there is considerable regex-specific work in here around literal matching and
other optimizations. The rest of the aforementioned work is done in ripgrep
proper. Ideally, all of that should get moved out into the grep crate. This
is predominantly a design task, because searching is inextricably tied to
printing, which means the API for all of these features is quite complex.

gitignore is a reasonably light wrapper around the globset crate. The hope
is that a future hgignore crate would be too. Nevertheless, the semantics are
tricky enough that it's worth it to put it into its own crate.

A parallel directory iterator is something I want because so much time is spent
in handling all of the ignore rules. If we can apply parallelism to that, I
kind of expect that we'll get a nice win. Whether a parallel directory iterator
can be design out-of-tree or not still isn't clear. In particular, it's not
clear how coupled this needs to be with respecting ignore rules.

In any case, I don't really have the bandwidth to do the design work required to manage contributions from others toward this goal quite yet. I kind of think they're big enough that I would like to at least do the initial work on them.

The text was updated successfully, but these errors were encountered:

BurntSushi · 2016-10-11T00:47:29Z

With that said, if someone wanted to get started on a sed replacement, it'd be faster to fork ripgrep and try to start from there by ripping out pieces you no longer need and adding what you do need. I'd still recommend waiting for more progress on "libripgrep" though. :-)

BurntSushi · 2016-10-12T00:30:48Z

I'm currently working on moving ignore pattern handling out into a separate crate, with an eye toward designing an abstraction that is amenable to parallel directory iteration.

BurntSushi · 2016-10-30T02:38:29Z

Update on this: The ignore crate is now a thing. We don't have a parallel iterator yet, but I designed a persistent data structure that should be amenable to it. Other than that, the only other big thing left in ripgrep core is the search code.

BurntSushi · 2016-11-04T00:44:59Z

I finally have an implementation of a parallel recursive directory iterator and it is a beautiful sight to see. On the Chromium repository, it can actually scan the entire repo while respecting ignores faster than GNU find can print every path(!).

No data races. No unsafe. Rust is awesome.

BurntSushi · 2016-11-05T01:46:03Z

Utterly massive improvements. On 20GB of various git checkouts:

[andrew@Cheetah clones] du -csh
20G     .
20G     total
[andrew@Cheetah clones] time ag Openbox | wc -l
5

real    0m4.931s
user    0m5.863s
sys     0m12.790s
[andrew@Cheetah clones] time rg-master -n Openbox | wc -l
5

real    0m2.791s
user    0m19.297s
sys     0m2.760s
[andrew@Cheetah clones] time rg -n Openbox | wc -l
5

real    0m0.903s
user    0m8.203s
sys     0m4.880s

We even benefit from the parallel iterator when we don't respect .gitignore:

[andrew@Cheetah clones] time rg-master -nu Openbox | wc -l
5

real    0m0.865s
user    0m3.480s
sys     0m3.093s
[andrew@Cheetah clones] time rg -nu Openbox | wc -l
5

real    0m0.631s
user    0m4.123s
sys     0m5.057s

And this is why:

[andrew@Cheetah clones] time rg-master --files | wc -l
631778

real    0m2.522s
user    0m2.043s
sys     0m0.563s
[andrew@Cheetah clones] time rg --files | wc -l
631778

real    0m0.552s
user    0m5.157s
sys     0m1.083s
[andrew@Cheetah clones] time rg-master -u --files | wc -l
611552

real    0m0.797s
user    0m0.397s
sys     0m0.460s
[andrew@Cheetah clones] time rg -u --files | wc -l
611552

real    0m0.377s
user    0m1.447s
sys     0m0.903s

acornejo · 2017-01-20T17:00:10Z

As a potential use of the hypothetical grep crate, wondering if I should wait or fork-and-ripout as you had suggested.

I do need to keep track of line numbers and inverted matching, I am hoping the naive way of doing this is good enough (or perhaps ripgrep employs a sophisticated datastructure for these?)

BurntSushi · 2017-01-20T18:12:56Z

Note that there is already a grep crate, but I intend to do a near complete overhaul of it when I move the (rest of the) search code out of ripgrep into that crate.

I do need to keep track of line numbers and inverted matching, I am hoping the naive way of doing this is good enough (or perhaps ripgrep employs a sophisticated datastructure for these?)

Not really. These things aren't really stored anywhere, they're just computed as they are needed to be printed for performance reasons.

There is a lot of complexity in ripgrep's search_stream module for handling inverted matches, line numbers, contexts and a menagerie of other features. My plan to expose this in the grep crate is to define a trait with a visitor like pattern that the searcher calls. There can then be implementations of that trait that, say, print the results directly to an output or there could be "sink" implementations that collect the results as structured data in memory.

@acornejo What problem are you trying to solve? Without telling me that, I don't really know how to give you any advice.

acornejo · 2017-01-20T18:20:26Z

Let me try to describe the problem I want solved (and btw, if you know of a utility that already does this, please let me know since if I can avoid writing it it will be a big plus ;)

At work I often have to examine log files which are millions of lines long, and usually my workflow looks like this:

cat somelog | grep module
# read and examine output
cat somelog | grep module2
# read and examine output
cat somelog | grep module2 | grep -v feature1
# read and examine output
cat somelog | grep module2 | grep -v feature1 | grep -v feature2
# readn and examine output
cat somelog | grep module2 | grep -v feature1 | grep -v feature2 | grep keyword
# etc...

This is a lot more tedious/slower than it could be. Hence, I now want to write an interactive ncurses program, lets call it loggrep

Then I would do:

loggrep somelog

There I am envisioning to be able to interactively add more greps to the pipeline, remove them from the pipeline, and see how the results are being updated interactively as I type. I also want to support a few features which are not grep related (something like, cut, sort, and uniq, which are often in my command pipeline too).

BurntSushi · 2017-01-20T18:27:10Z

@acornejo I see. That definitely seems like something that could be built on top of what I think the grep crate will look like.

My feeling though is that you might get better mileage out of reusing existing command line tools. It sounds like your key problem is that you're rerunning your entire pipeline for each invocation. But if you saved your results to an intermediate file (which I imagine you'd have to do anyway in your loggrep tool) and continued your pipeline on that, then that seems like it could work well. Maybe there's a simple wrapper shell script you could write instead?

acornejo · 2017-01-20T18:40:32Z

yea, i have a nasty bash script that speeds up the aforementioned workflow, but this is something that I use frequently enough that I feel that having something more ergonomic would pay off.

I guess I'll take a look at the current grep crate when I start this and take it from there.

theduke · 2017-09-18T09:58:35Z

@BurntSushi so there is no plan to provide an easy to use ripgrep library?

I think that could be quite useful!

extern crate ripgrep;

use ripgrep::{run, Options};
...

run("some/path", Options{...}, |result| {
  // Do something with a match...
})?;

BurntSushi · 2017-09-18T10:19:31Z

@theduke Could you explain why you think this doesn't qualify as a plan?

theduke · 2017-09-18T10:33:46Z

Maybe I understood the second sentence in your root comment on this issue:

In fact, "libripgrep" probably won't ever be a thing unto itself.

I interpreted that as "there won't be an actual library, just all the functionality provided by different crates that you have to tie together yourself.

BurntSushi · 2017-09-18T10:44:30Z

@theduke Yes, that's right. You'll have to tie it together. There's currently no plan to move all of ripgrep into a library. I'm not saying it won't happen or that it's not useful, it's just not my goal right now.

libripgrep is not any one library, but rather, a collection of libraries that roughly separate the following key distinct phases in a grep implementation: 1. Pattern matching (e.g., by a regex engine). 2. Searching a file using a pattern matcher. 3. Printing results. Ultimately, both (1) and (3) are defined by de-coupled interfaces, of which there may be multiple implementations. Namely, (1) is satisfied by the `Matcher` trait in the `grep-matcher` crate and (3) is satisfied by the `Sink` trait in the `grep2` crate. The searcher (2) ties everything together and finds results using a matcher and reports those results using a `Sink` implementation. Closes #162

This commit updates the CHANGELOG to reflect all the work done to make libripgrep a reality. * Closes #162 (libripgrep) * Closes #176 (multiline search) * Closes #188 (opt-in PCRE2 support) * Closes #244 (JSON output) * Closes #416 (Windows CRLF support) * Closes #917 (trim prefix whitespace) * Closes #993 (add --null-data flag) * Closes #997 (--passthru works with --replace) * Fixes #2 (memory maps and context handling work) * Fixes #200 (ripgrep stops when pipe is closed) * Fixes #389 (more intuitive `-w/--word-regexp`) * Fixes #643 (detection of stdin on Windows is better) * Fixes #441, Fixes #690, Fixes #980 (empty matching lines are weird) * Fixes #764 (coalesce color escapes) * Fixes #922 (memory maps failing is no big deal) * Fixes #937 (color escapes no longer used for empty matches) * Fixes #940 (--passthru does not impact exit status) * Fixes #1013 (show runtime CPU features in --version output)

libripgrep is not any one library, but rather, a collection of libraries that roughly separate the following key distinct phases in a grep implementation: 1. Pattern matching (e.g., by a regex engine). 2. Searching a file using a pattern matcher. 3. Printing results. Ultimately, both (1) and (3) are defined by de-coupled interfaces, of which there may be multiple implementations. Namely, (1) is satisfied by the `Matcher` trait in the `grep-matcher` crate and (3) is satisfied by the `Sink` trait in the `grep2` crate. The searcher (2) ties everything together and finds results using a matcher and reports those results using a `Sink` implementation. Closes #162

This commit updates the CHANGELOG to reflect all the work done to make libripgrep a reality. * Closes #162 (libripgrep) * Closes #176 (multiline search) * Closes #188 (opt-in PCRE2 support) * Closes #244 (JSON output) * Closes #416 (Windows CRLF support) * Closes #917 (trim prefix whitespace) * Closes #993 (add --null-data flag) * Closes #997 (--passthru works with --replace) * Fixes #2 (memory maps and context handling work) * Fixes #200 (ripgrep stops when pipe is closed) * Fixes #389 (more intuitive `-w/--word-regexp`) * Fixes #643 (detection of stdin on Windows is better) * Fixes #441, Fixes #690, Fixes #980 (empty matching lines are weird) * Fixes #764 (coalesce color escapes) * Fixes #922 (memory maps failing is no big deal) * Fixes #937 (color escapes no longer used for empty matches) * Fixes #940 (--passthru does not impact exit status) * Fixes #1013 (show runtime CPU features in --version output)

BurntSushi added the enhancement An enhancement to the functionality of the software. label Oct 11, 2016

BurntSushi self-assigned this Oct 11, 2016

BurntSushi mentioned this issue Oct 11, 2016

Add support for replace in files #74

Closed

BurntSushi mentioned this issue Oct 13, 2016

support searching across multiple lines #176

Closed

kaushalmodi mentioned this issue Oct 14, 2016

Ability to write the replacements #179

Closed

BurntSushi mentioned this issue Oct 28, 2016

add support for reading patterns from a file #7

Closed

BurntSushi mentioned this issue Nov 6, 2016

add option for structured output #32

Closed

BurntSushi mentioned this issue Nov 22, 2016

Limit line lengths #245

Closed

BurntSushi modified the milestone: libripgrep Jan 10, 2017

BurntSushi added the libripgrep An issue related to modularizing ripgrep into libraries. label Jan 11, 2017

BurntSushi mentioned this issue Feb 19, 2017

Explore using a native third-party search tool such as ripgrep or Silver Searcher microsoft/vscode#19983

Closed

BurntSushi mentioned this issue Aug 19, 2018

libripgrep: PCRE2 support, multiline search, JSON output and more #1017

Merged

BurntSushi closed this as completed in #1017 Aug 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

libripgrep #162

libripgrep #162

BurntSushi commented Oct 11, 2016 •

edited

BurntSushi commented Oct 11, 2016 •

edited

BurntSushi commented Oct 12, 2016

BurntSushi commented Oct 30, 2016

BurntSushi commented Nov 4, 2016

BurntSushi commented Nov 5, 2016

acornejo commented Jan 20, 2017 •

edited

BurntSushi commented Jan 20, 2017

acornejo commented Jan 20, 2017

BurntSushi commented Jan 20, 2017

acornejo commented Jan 20, 2017

theduke commented Sep 18, 2017 •

edited

BurntSushi commented Sep 18, 2017

theduke commented Sep 18, 2017

BurntSushi commented Sep 18, 2017

libripgrep #162

libripgrep #162

Comments

BurntSushi commented Oct 11, 2016 • edited

BurntSushi commented Oct 11, 2016 • edited

BurntSushi commented Oct 12, 2016

BurntSushi commented Oct 30, 2016

BurntSushi commented Nov 4, 2016

BurntSushi commented Nov 5, 2016

acornejo commented Jan 20, 2017 • edited

BurntSushi commented Jan 20, 2017

acornejo commented Jan 20, 2017

BurntSushi commented Jan 20, 2017

acornejo commented Jan 20, 2017

theduke commented Sep 18, 2017 • edited

BurntSushi commented Sep 18, 2017

theduke commented Sep 18, 2017

BurntSushi commented Sep 18, 2017

BurntSushi commented Oct 11, 2016 •

edited

BurntSushi commented Oct 11, 2016 •

edited

acornejo commented Jan 20, 2017 •

edited

theduke commented Sep 18, 2017 •

edited