Use PCRE for `builtins.match` and `builtins.split` #7336

yorickvP · 2022-11-23T15:06:44Z

Fixes #2147, fixes #4758
See also #3826.

This is not technically fully compatible, but I can't find a builtins.match invocation in the wild that doesn't work, since programs could only rely on quite a limited subset anyways.
Performance is similar because no one really used a lot of regexes, but this should theoretically be faster.

Additionally adds support for named captures, so you can do

nix-repl> builtins.match "(?<date>(?<year>(\\d\\d)?\\d\\d) - (?<month>\\d\\d) - (?<day>\\d\\d))" "2020 - 10 - 10"
{ date = "2020 - 10 - 10"; day = "10"; month = "10"; year = "2020"; }

nix-repl> builtins.match "(?<date>(?<year>(\\d\\d)?\\d\\d) - (?<month>\\d\\d) - (?<day>\\d\\d))" "2020 - 10 - 10" { date = "2020 - 10 - 10"; day = "10"; month = "10"; year = "2020"; }

nixos-discourse · 2022-11-25T09:50:46Z

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/tweag-nix-dev-update-40/23480/1

oxalica · 2022-11-25T11:53:33Z

src/libexpr/primops/regex.cc

+      Returns a list composed of non matched strings interleaved with the
+      lists of the [extended POSIX regular
+      expression](http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_04)
+      *regex* matches of *str*. Each item in the lists of matched


PCRE is not POSIX ERE.

Also worth mentioning that PCRE supports arbitrary look-around, which can cause (maybe accidental) exponential run time. Maybe we should limit the input to a reasonable subset.

So it should point to https://www.pcre.org/original/doc/html/pcrepattern.html.

tfc · 2022-11-25T13:44:36Z

src/libexpr/primops/regex.cc

+        int errorcode;
+        PCRE2_SIZE erroffset;
+
+        code = pcre2_compile((const unsigned char*)re.data(), re.length(), 0, &errorcode, &erroffset, nullptr);


Suggested change

code = pcre2_compile((const unsigned char*)re.data(), re.length(), 0, &errorcode, &erroffset, nullptr);

code = pcre2_compile(static_cast<const unsigned char*>(re.data()), re.length(), 0, &errorcode, &erroffset, nullptr);

tfc · 2022-11-25T13:46:02Z

src/libexpr/primops/regex.cc

+            name_table.reserve(namecount);
+            for (size_t i = 0; i < namecount; i++) {
+                int n = tabptr[0] << 8 | tabptr[1];
+                name_table.emplace_back((const char*)(tabptr+2), n);


Suggested change

name_table.emplace_back((const char*)(tabptr+2), n);

name_table.emplace_back(static_cast<const char*>(tabptr+2), n);

tfc · 2022-11-25T13:51:36Z

src/libexpr/primops/regex.cc

+{
+    friend class MatchData;
+protected:
+    pcre2_code* code;


if you made this a std::unique_ptr, then you wouldn't need to manually call pcre2_code_free in the destructor. In addition to that, the constructor function could not leak the successfully compiled pcre2 resource in case of exceptions.

relevant snippets:

// member declaration using pcre_ptr = std::unique_ptr<Bar, void(*)(pcre2_code*)>; pcre_ptr code; // constructor: Regex(std::string_view re) { ... code = pcre_ptr(pcre2_compile((const unsigned char*)re.data(), re.length(), 0, &errorcode, &erroffset, nullptr), pcre2_code_free); } ...

tfc · 2022-11-25T13:55:01Z

src/libexpr/primops/regex.cc

+
+    void compile()
+    {
+        assert(pcre2_jit_compile(code, PCRE2_JIT_COMPLETE) == 0);


i suggest putting the call to pcre2_jit_compile outside the assert and store its result in a variable. then, assert on the variable's value.

The reason i suggest that is that asserts are meant to be optimized out of production builds, but that would then remove the compile call...

tfc · 2022-11-25T13:55:36Z

src/libexpr/primops/regex.cc

+    MatchData(MatchData&&) = delete;
+    ~MatchData()
+    {
+        pcre2_match_data_free(match);


std::unique_ptr with a custom destructor as above would be great.

tfc · 2022-11-25T13:56:59Z

src/libexpr/primops/regex.cc

+    MatchData(Regex& re) noexcept
+        : re(re)
+    {
+        match = pcre2_match_data_create_from_pattern(re.code, NULL);
+        size_ = pcre2_get_ovector_count(match);
+        ovector = pcre2_get_ovector_pointer(match);
+    };


Suggested change

MatchData(Regex& re) noexcept

: re(re)

{

match = pcre2_match_data_create_from_pattern(re.code, NULL);

size_ = pcre2_get_ovector_count(match);

ovector = pcre2_get_ovector_pointer(match);

};

MatchData(Regex& re) noexcept

: re{re}

, match{pcre2_match_data_create_from_pattern(re.code, NULL)}

, size_{pcre2_get_ovector_count(match)}

, ovector {pcre2_get_ovector_pointer(match)}

{

};

plus the reordering comment from above

tfc · 2022-11-25T14:09:53Z

src/libexpr/primops/regex.cc

+        v.mkAttrs(bindings);
+    } else {
+        // the first match is the whole string
+        const size_t len = match.size() - 1;


it feels like asserting that the size is > 0 would increase the safety of this function

tfc · 2022-11-25T14:12:46Z

src/libexpr/primops/regex.cc

+            if (!match[i+1].has_value())
+                (v.listElems()[i] = state.allocValue())->mkNull();
+            else
+                (v.listElems()[i] = state.allocValue())->mkString(*match[i + 1]);


Suggested change

if (!match[i+1].has_value())

(v.listElems()[i] = state.allocValue())->mkNull();

else

(v.listElems()[i] = state.allocValue())->mkString(*match[i + 1]);

auto *val = state.allocValue()

v.listElems()[i] = val;

if (!match[i+1].has_value())

val->mkNull();

else

val->mkString(*match[i + 1]);

tfc · 2022-11-25T14:13:37Z

src/libexpr/primops/regex.cc

+
+void prim_match(EvalState & state, const PosIdx pos, Value * * args, Value & v)
+{
+    auto re = state.forceStringNoCtx(*args[0], pos);


can we add some asserts here?

tfc · 2022-11-25T14:15:15Z

src/libexpr/primops/regex.cc

+   non-matching parts interleaved by the lists of the matching groups. */
+void prim_split(EvalState & state, const PosIdx pos, Value * * args, Value & v)
+{
+    auto re = state.forceStringNoCtx(*args[0], pos);


same here, asserts would improve debuggability if we ever end up with code that inserts nullptrs

tfc · 2022-11-25T14:19:36Z

src/libexpr/primops/regex.cc

+            result.push_back(prefix);
+
+            // Add a list for matched substrings.
+            auto elem = state.allocValue();


the code is a bit inconsistent in choosing the type for allocValue calls. It's either Value* or auto and it would be nice to choose one variant and stick to it.

edolstra · 2022-11-28T13:46:40Z

Looks good to me. The main objection is that this makes the Nix language specification depend on "whatever PCRE does". That's not a real problem for Nix, but might be a problem for people who want to reimplement it without having a dependency on a C library.

thufschmitt

I like this overall 👍 left a few inline comments, but nothing too big.

Since this is a potential breaking change, I would like to see it feature-gated for at least a release cycle. Maybe on by default, but with an easy off switch in case it's hurting people. That would require keeping the old code around of course, but that'd just be temporary.

thufschmitt · 2022-11-28T10:43:20Z

src/libexpr/eval.hh

@@ -76,6 +76,7 @@ void initGC();
 struct RegexCache;

 std::shared_ptr<RegexCache> makeRegexCache();
+size_t regexCacheSize(std::shared_ptr<RegexCache> cache);


Could this be replaced by a RegexCache::size() method?

thufschmitt · 2022-11-28T10:46:26Z

src/libexpr/primops/regex.cc

+
+class MatchData;
+
+class Regex


Don't let yourself be blocked on that, but it would be nicer to move the pure "PCRE abstraction layer" part under libutil

thufschmitt · 2022-11-28T10:52:20Z

src/libexpr/primops/regex.cc

+
+    } catch (RegexError & e) {
+        state.debugThrowLastTrace(EvalError({
+            .msg = hintfmt("error while evaluating regex '%s': ", re, e.what()),


Suggested change

.msg = hintfmt("error while evaluating regex '%s': ", re, e.what()),

.msg = hintfmt("error while evaluating regex '%s': %s", re, e.what()),

thufschmitt · 2022-11-28T10:52:34Z

src/libexpr/primops/regex.cc

+
+    } catch (RegexError & e) {
+        state.debugThrowLastTrace(EvalError({
+            .msg = hintfmt("error while evaluating regex '%s': ", re, e.what()),


Suggested change

.msg = hintfmt("error while evaluating regex '%s': ", re, e.what()),

.msg = hintfmt("error while evaluating regex '%s': %s", re, e.what()),

thufschmitt · 2022-11-28T20:13:30Z

src/libexpr/primops/regex.cc

+        state.debugThrowLastTrace(EvalError({
+            .msg = hintfmt("error while evaluating regex '%s': ", re, e.what()),
+            .errPos = state.positions[pos]
+        }));


Actually the above doesn't look so great, maybe rather something like

Suggested change

state.debugThrowLastTrace(EvalError({

.msg = hintfmt("error while evaluating regex '%s': ", re, e.what()),

.errPos = state.positions[pos]

}));

e.addTrace(state.positions[pos], "while evaluating regex '%s'", re);

state.debugThrowLastTrace(e);

alyssais · 2022-11-29T07:53:35Z

Is there no portable library we could use that implements POSIX ERE? That would avoid compatibility problems between Nix versions, avoid the footguns of backtracking etc., and be something more reasonable to find multiple implementations of for Nix reimpls.

oxalica · 2022-11-29T09:29:34Z

Is there no portable library we could use that implements POSIX ERE? That would avoid compatibility problems between Nix versions, avoid the footguns of backtracking etc., and be something more reasonable to find multiple implementations of for Nix reimpls.

boost also supports regex in many flavor, including POSIX ERE. Could it be a choice?

POSIX ERE is a relatively small and easy-to-implement standard, so that other languages usually also provide or have library of it.

edolstra · 2022-11-29T10:29:07Z

boost::regex is essentially the same as std::regex, except that it has a dependency on icu4c which makes it very bloated.

oxalica · 2022-11-29T10:37:52Z

boost::regex is essentially the same as std::regex, except that it has a dependency on icu4c which makes it very bloated.

Technically it should not, if we don't use the Unicode part. The issue is that currently boost is dynamically linked and have --with-icu by default...

While std::regex is known (at least for me) to be fragile and broken at different toolchain/platforms.

tfc · 2022-11-29T11:38:26Z

While std::regex is known (at least for me) to be fragile and broken at different toolchain/platforms.

std::regex is also known to be really slow. I was involved in a project where we, based on significant differences in performance after benchmarking it changed to google's re2 library. (not suggesting that it should be re2, but std::regex is simply slow)

edolstra · 2022-11-30T12:52:56Z

The issue is that currently boost is dynamically linked and have --with-icu by default...

If we can leave out icu support using an override on boost, I'm fine with that.

Ericson2314 · 2022-12-02T12:28:02Z

One option is that we support both regexes during a migration period.

nixos-discourse · 2022-12-05T09:43:43Z

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/2022-12-02-nix-team-meeting-minutes-13/23731/1

alyssais · 2022-12-06T20:37:36Z

Another option for a portable POSIX ERE implementation might be musl's? It's permissively licensed, and only a few thousand lines total, so it would be reasonable for Nix to just bundle it, IMO.

fricklerhandwerk · 2022-12-08T08:21:28Z

Discussed in the Nix team meeting on 2022-12-05:

decision: @yorickvP please try building boost without icu4c. the requirement is full backwards compatibility and minimal closure size.
- (@alyssais that was before your comment. @edolstra anything speaking against musl?)

Complete discussion

@edolstra: if we did ran tests against public flakes, we could use that to detect breakages on regexes
@Ericson2314: could have both implementations
- @edolstra: strongly against having multiple implementations. this would be a bad architectural idea in terms of maintainability and elegance
- @thufschmitt: could have both and a policy that one of them will be turned off at some point in the future
(more argument over reverting Nix versions and compatibility policy)
decision: ask author to try building boost without icu4c
- in that case, we get around both constraints: compatibility and closure

nixos-discourse · 2022-12-08T08:25:40Z

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/2022-12-05-nix-team-meeting-minutes-14/23836/1

yorickvP · 2022-12-08T14:08:09Z

boost builds without icu (it's mostly used for Boost.Locale afaict), but it requires an ugly overrideAttrs or a nixpkgs patch

(boost.override { icu = null; }).overrideAttrs (o: {
  configureFlags = (lib.take 4 o.configureFlags) ++ [ "--without-icu" ];
})

oxalica · 2022-12-08T19:06:53Z

boost builds without icu (it's mostly used for Boost.Locale afaict), but it requires an ugly overrideAttrs or a nixpkgs patch

I created #205166 to simplify this.

roberth · 2022-12-13T13:00:11Z

While I do think fixing the stack behavior is a higher priority than extending the language with PCRE-specific features, adding PCRE may be feasible.

I believe PCRE was designed to process all EREs correctly, but I haven't found definitive evidence of this. Finding this would greatly help, so if anyone knows better where to look...

To some degree, any change in behavior could be a breaking change. (cue spacebar heating) However, I'd be willing to assume that nobody is validating ERE regexes by trying to use them in builtins.match regex "", in large part because tryEval can't catch a bad regex anyway. This should make the "boolean" whether the extended functionality is present unobservable from within the language.

So we have two possible approaches: either "prove" that PCRE processes EREs correctly, or use a good ERE library instead.

thufschmitt · 2022-12-13T13:40:17Z

@roberth afaik, PCRE is mostly a superset of Posix EREs, but not strictly.
An example of an (obscure) ERE regex construct that is not PCRE one is character equivalents:

$ echo e | grep --extendes-regexp '[[=e=]]'
e
$ echo e | grep --perl-regexp '[[=e=]]'
grep: POSIX collating elements are not supported

Now the question is whether this really matters, and I'm not really convinced it does given how niche it is

roberth · 2022-12-13T14:10:23Z

An example of an (obscure) ERE regex construct that is not PCRE one is character equivalents:

Ooh, that's locale dependent. I didn't manage to exploit that as a potential impurity (but maybe I did it wrong). Maybe it's just not implemented in GNU stdc++, but it may be implemented in a library that does.

What helps in this case is that libpcre prints an error, and doesn't continue with a garbage regex. If anyone does rely on regexes with collating elements, they'll be able to detect the problem in time, and work around it by applying regexes to their, presumably foreign input, regexes. If character equivalents are already truly broken in Nix (which they should be, as it's impure), this is hardly a regression.
Similarly, collating sequences must not be supported; also locale dependent.

May have to dig through this...
https://gist.github.com/CMCDragonkai/6c933f4a7d713ef712145c5eb94a1816#feature-comparison
[:alpha:] looked like a potential deal breaker, but seems ok in grep.
So the bracket classes (bottom of page) seem kind of ok. Haven't looked at the rest yet.

SuperSandro2000 · 2023-02-08T10:44:41Z

Just my 2 cents on this:
boost is notorious for massively increasing compile times and for the 50k lines c++ code nix has, it takes very long to compile.
Not sure if buying more into boost is a good considering compile time.

yorickvP · 2023-04-04T07:45:30Z

Superseded by #7762

yorickvP added 3 commits November 23, 2022 15:54

primops: use PCRE for regexp matching

5107b26

priomps: split regex into separate file

bd88fd1

prim_match: support named captures

74bc0b2

nix-repl> builtins.match "(?<date>(?<year>(\\d\\d)?\\d\\d) - (?<month>\\d\\d) - (?<day>\\d\\d))" "2020 - 10 - 10" { date = "2020 - 10 - 10"; day = "10"; month = "10"; year = "2020"; }

oxalica suggested changes Nov 25, 2022

View reviewed changes

tfc reviewed Nov 25, 2022

View reviewed changes

tfc suggested changes Nov 25, 2022

View reviewed changes

thufschmitt requested changes Nov 28, 2022

View reviewed changes

roberth added the bug label Dec 2, 2022

oxalica mentioned this pull request Dec 8, 2022

boost: add configurable enableIcu flag NixOS/nixpkgs#205166

Merged

13 tasks

yorickvP mentioned this pull request Feb 6, 2023

Switch from std::regex to boost::regex #7762

Merged

7 tasks

yorickvP closed this Apr 4, 2023

	code = pcre2_compile((const unsigned char*)re.data(), re.length(), 0, &errorcode, &erroffset, nullptr);
	code = pcre2_compile(static_cast<const unsigned char*>(re.data()), re.length(), 0, &errorcode, &erroffset, nullptr);

	name_table.emplace_back((const char*)(tabptr+2), n);
	name_table.emplace_back(static_cast<const char*>(tabptr+2), n);

	.msg = hintfmt("error while evaluating regex '%s': ", re, e.what()),
	.msg = hintfmt("error while evaluating regex '%s': %s", re, e.what()),

Use PCRE for builtins.match and builtins.split #7336

Use PCRE for builtins.match and builtins.split #7336

Conversation

yorickvP commented Nov 23, 2022

nixos-discourse commented Nov 25, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

edolstra commented Nov 28, 2022

thufschmitt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alyssais commented Nov 29, 2022

oxalica commented Nov 29, 2022 • edited

edolstra commented Nov 29, 2022

oxalica commented Nov 29, 2022 • edited

tfc commented Nov 29, 2022

edolstra commented Nov 30, 2022

Ericson2314 commented Dec 2, 2022

nixos-discourse commented Dec 5, 2022

alyssais commented Dec 6, 2022

fricklerhandwerk commented Dec 8, 2022

nixos-discourse commented Dec 8, 2022

yorickvP commented Dec 8, 2022

oxalica commented Dec 8, 2022

roberth commented Dec 13, 2022

thufschmitt commented Dec 13, 2022

roberth commented Dec 13, 2022

SuperSandro2000 commented Feb 8, 2023

yorickvP commented Apr 4, 2023

Use PCRE for `builtins.match` and `builtins.split` #7336

Use PCRE for `builtins.match` and `builtins.split` #7336

oxalica commented Nov 29, 2022 •

edited

oxalica commented Nov 29, 2022 •

edited