Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use PCRE for builtins.match and builtins.split #7336

Closed
wants to merge 3 commits into from

Conversation

yorickvP
Copy link
Contributor

Fixes #2147, fixes #4758
See also #3826.

This is not technically fully compatible, but I can't find a builtins.match invocation in the wild that doesn't work, since programs could only rely on quite a limited subset anyways.
Performance is similar because no one really used a lot of regexes, but this should theoretically be faster.

Additionally adds support for named captures, so you can do

nix-repl> builtins.match "(?<date>(?<year>(\\d\\d)?\\d\\d) - (?<month>\\d\\d) - (?<day>\\d\\d))" "2020 - 10 - 10"
{ date = "2020 - 10 - 10"; day = "10"; month = "10"; year = "2020"; }

nix-repl> builtins.match "(?<date>(?<year>(\\d\\d)?\\d\\d) - (?<month>\\d\\d) - (?<day>\\d\\d))" "2020 - 10 - 10"
{ date = "2020 - 10 - 10"; day = "10"; month = "10"; year = "2020"; }
@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/tweag-nix-dev-update-40/23480/1

Comment on lines +327 to +330
Returns a list composed of non matched strings interleaved with the
lists of the [extended POSIX regular
expression](http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_04)
*regex* matches of *str*. Each item in the lists of matched
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PCRE is not POSIX ERE.

Also worth mentioning that PCRE supports arbitrary look-around, which can cause (maybe accidental) exponential run time. Maybe we should limit the input to a reasonable subset.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int errorcode;
PCRE2_SIZE erroffset;

code = pcre2_compile((const unsigned char*)re.data(), re.length(), 0, &errorcode, &erroffset, nullptr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
code = pcre2_compile((const unsigned char*)re.data(), re.length(), 0, &errorcode, &erroffset, nullptr);
code = pcre2_compile(static_cast<const unsigned char*>(re.data()), re.length(), 0, &errorcode, &erroffset, nullptr);

name_table.reserve(namecount);
for (size_t i = 0; i < namecount; i++) {
int n = tabptr[0] << 8 | tabptr[1];
name_table.emplace_back((const char*)(tabptr+2), n);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
name_table.emplace_back((const char*)(tabptr+2), n);
name_table.emplace_back(static_cast<const char*>(tabptr+2), n);

{
friend class MatchData;
protected:
pcre2_code* code;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you made this a std::unique_ptr, then you wouldn't need to manually call pcre2_code_free in the destructor. In addition to that, the constructor function could not leak the successfully compiled pcre2 resource in case of exceptions.

relevant snippets:

  // member declaration
  using pcre_ptr = std::unique_ptr<Bar, void(*)(pcre2_code*)>;
  pcre_ptr code;

  // constructor:
  Regex(std::string_view re) { 
    ...
    code = pcre_ptr(pcre2_compile((const unsigned char*)re.data(), re.length(), 0, &errorcode, &erroffset, nullptr), pcre2_code_free);
  } 
  ...


void compile()
{
assert(pcre2_jit_compile(code, PCRE2_JIT_COMPLETE) == 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i suggest putting the call to pcre2_jit_compile outside the assert and store its result in a variable. then, assert on the variable's value.

The reason i suggest that is that asserts are meant to be optimized out of production builds, but that would then remove the compile call...

MatchData(MatchData&&) = delete;
~MatchData()
{
pcre2_match_data_free(match);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::unique_ptr with a custom destructor as above would be great.

Comment on lines +87 to +93
MatchData(Regex& re) noexcept
: re(re)
{
match = pcre2_match_data_create_from_pattern(re.code, NULL);
size_ = pcre2_get_ovector_count(match);
ovector = pcre2_get_ovector_pointer(match);
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
MatchData(Regex& re) noexcept
: re(re)
{
match = pcre2_match_data_create_from_pattern(re.code, NULL);
size_ = pcre2_get_ovector_count(match);
ovector = pcre2_get_ovector_pointer(match);
};
MatchData(Regex& re) noexcept
: re{re}
, match{pcre2_match_data_create_from_pattern(re.code, NULL)}
, size_{pcre2_get_ovector_count(match)}
, ovector {pcre2_get_ovector_pointer(match)}
{
};

plus the reordering comment from above

v.mkAttrs(bindings);
} else {
// the first match is the whole string
const size_t len = match.size() - 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it feels like asserting that the size is > 0 would increase the safety of this function

Comment on lines +189 to +192
if (!match[i+1].has_value())
(v.listElems()[i] = state.allocValue())->mkNull();
else
(v.listElems()[i] = state.allocValue())->mkString(*match[i + 1]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (!match[i+1].has_value())
(v.listElems()[i] = state.allocValue())->mkNull();
else
(v.listElems()[i] = state.allocValue())->mkString(*match[i + 1]);
auto *val = state.allocValue()
v.listElems()[i] = val;
if (!match[i+1].has_value())
val->mkNull();
else
val->mkString(*match[i + 1]);


void prim_match(EvalState & state, const PosIdx pos, Value * * args, Value & v)
{
auto re = state.forceStringNoCtx(*args[0], pos);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add some asserts here?

non-matching parts interleaved by the lists of the matching groups. */
void prim_split(EvalState & state, const PosIdx pos, Value * * args, Value & v)
{
auto re = state.forceStringNoCtx(*args[0], pos);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, asserts would improve debuggability if we ever end up with code that inserts nullptrs

result.push_back(prefix);

// Add a list for matched substrings.
auto elem = state.allocValue();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the code is a bit inconsistent in choosing the type for allocValue calls. It's either Value* or auto and it would be nice to choose one variant and stick to it.

@edolstra
Copy link
Member

Looks good to me. The main objection is that this makes the Nix language specification depend on "whatever PCRE does". That's not a real problem for Nix, but might be a problem for people who want to reimplement it without having a dependency on a C library.

Copy link
Member

@thufschmitt thufschmitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this overall 👍 left a few inline comments, but nothing too big.

Since this is a potential breaking change, I would like to see it feature-gated for at least a release cycle. Maybe on by default, but with an easy off switch in case it's hurting people. That would require keeping the old code around of course, but that'd just be temporary.

@@ -76,6 +76,7 @@ void initGC();
struct RegexCache;

std::shared_ptr<RegexCache> makeRegexCache();
size_t regexCacheSize(std::shared_ptr<RegexCache> cache);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be replaced by a RegexCache::size() method?


class MatchData;

class Regex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't let yourself be blocked on that, but it would be nicer to move the pure "PCRE abstraction layer" part under libutil


} catch (RegexError & e) {
state.debugThrowLastTrace(EvalError({
.msg = hintfmt("error while evaluating regex '%s': ", re, e.what()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.msg = hintfmt("error while evaluating regex '%s': ", re, e.what()),
.msg = hintfmt("error while evaluating regex '%s': %s", re, e.what()),


} catch (RegexError & e) {
state.debugThrowLastTrace(EvalError({
.msg = hintfmt("error while evaluating regex '%s': ", re, e.what()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.msg = hintfmt("error while evaluating regex '%s': ", re, e.what()),
.msg = hintfmt("error while evaluating regex '%s': %s", re, e.what()),

Comment on lines +217 to +220
state.debugThrowLastTrace(EvalError({
.msg = hintfmt("error while evaluating regex '%s': ", re, e.what()),
.errPos = state.positions[pos]
}));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the above doesn't look so great, maybe rather something like

Suggested change
state.debugThrowLastTrace(EvalError({
.msg = hintfmt("error while evaluating regex '%s': ", re, e.what()),
.errPos = state.positions[pos]
}));
e.addTrace(state.positions[pos], "while evaluating regex '%s'", re);
state.debugThrowLastTrace(e);

@alyssais
Copy link
Member

Is there no portable library we could use that implements POSIX ERE? That would avoid compatibility problems between Nix versions, avoid the footguns of backtracking etc., and be something more reasonable to find multiple implementations of for Nix reimpls.

@oxalica
Copy link
Contributor

oxalica commented Nov 29, 2022

Is there no portable library we could use that implements POSIX ERE? That would avoid compatibility problems between Nix versions, avoid the footguns of backtracking etc., and be something more reasonable to find multiple implementations of for Nix reimpls.

boost also supports regex in many flavor, including POSIX ERE. Could it be a choice?

POSIX ERE is a relatively small and easy-to-implement standard, so that other languages usually also provide or have library of it.

@edolstra
Copy link
Member

boost::regex is essentially the same as std::regex, except that it has a dependency on icu4c which makes it very bloated.

@oxalica
Copy link
Contributor

oxalica commented Nov 29, 2022

boost::regex is essentially the same as std::regex, except that it has a dependency on icu4c which makes it very bloated.

Technically it should not, if we don't use the Unicode part. The issue is that currently boost is dynamically linked and have --with-icu by default...

While std::regex is known (at least for me) to be fragile and broken at different toolchain/platforms.

@tfc
Copy link
Contributor

tfc commented Nov 29, 2022

While std::regex is known (at least for me) to be fragile and broken at different toolchain/platforms.

std::regex is also known to be really slow. I was involved in a project where we, based on significant differences in performance after benchmarking it changed to google's re2 library. (not suggesting that it should be re2, but std::regex is simply slow)

@edolstra
Copy link
Member

The issue is that currently boost is dynamically linked and have --with-icu by default...

If we can leave out icu support using an override on boost, I'm fine with that.

@Ericson2314
Copy link
Member

One option is that we support both regexes during a migration period.

@roberth roberth added the bug label Dec 2, 2022
@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/2022-12-02-nix-team-meeting-minutes-13/23731/1

@alyssais
Copy link
Member

alyssais commented Dec 6, 2022

Another option for a portable POSIX ERE implementation might be musl's? It's permissively licensed, and only a few thousand lines total, so it would be reasonable for Nix to just bundle it, IMO.

@fricklerhandwerk
Copy link
Contributor

Discussed in the Nix team meeting on 2022-12-05:

  • decision: @yorickvP please try building boost without icu4c. the requirement is full backwards compatibility and minimal closure size.
Complete discussion
  • @edolstra: if we did ran tests against public flakes, we could use that to detect breakages on regexes
  • @Ericson2314: could have both implementations
    • @edolstra: strongly against having multiple implementations. this would be a bad architectural idea in terms of maintainability and elegance
    • @thufschmitt: could have both and a policy that one of them will be turned off at some point in the future
  • (more argument over reverting Nix versions and compatibility policy)
  • decision: ask author to try building boost without icu4c
    • in that case, we get around both constraints: compatibility and closure

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/2022-12-05-nix-team-meeting-minutes-14/23836/1

@yorickvP
Copy link
Contributor Author

yorickvP commented Dec 8, 2022

boost builds without icu (it's mostly used for Boost.Locale afaict), but it requires an ugly overrideAttrs or a nixpkgs patch

(boost.override { icu = null; }).overrideAttrs (o: {
  configureFlags = (lib.take 4 o.configureFlags) ++ [ "--without-icu" ];
})

@oxalica
Copy link
Contributor

oxalica commented Dec 8, 2022

boost builds without icu (it's mostly used for Boost.Locale afaict), but it requires an ugly overrideAttrs or a nixpkgs patch

I created #205166 to simplify this.

@roberth
Copy link
Member

roberth commented Dec 13, 2022

While I do think fixing the stack behavior is a higher priority than extending the language with PCRE-specific features, adding PCRE may be feasible.

I believe PCRE was designed to process all EREs correctly, but I haven't found definitive evidence of this. Finding this would greatly help, so if anyone knows better where to look...

To some degree, any change in behavior could be a breaking change. (cue spacebar heating) However, I'd be willing to assume that nobody is validating ERE regexes by trying to use them in builtins.match regex "", in large part because tryEval can't catch a bad regex anyway. This should make the "boolean" whether the extended functionality is present unobservable from within the language.

So we have two possible approaches: either "prove" that PCRE processes EREs correctly, or use a good ERE library instead.

@thufschmitt
Copy link
Member

@roberth afaik, PCRE is mostly a superset of Posix EREs, but not strictly.
An example of an (obscure) ERE regex construct that is not PCRE one is character equivalents:

$ echo e | grep --extendes-regexp '[[=e=]]'
e
$ echo e | grep --perl-regexp '[[=e=]]'
grep: POSIX collating elements are not supported

Now the question is whether this really matters, and I'm not really convinced it does given how niche it is

@roberth
Copy link
Member

roberth commented Dec 13, 2022

An example of an (obscure) ERE regex construct that is not PCRE one is character equivalents:

Ooh, that's locale dependent. I didn't manage to exploit that as a potential impurity (but maybe I did it wrong). Maybe it's just not implemented in GNU stdc++, but it may be implemented in a library that does.

What helps in this case is that libpcre prints an error, and doesn't continue with a garbage regex. If anyone does rely on regexes with collating elements, they'll be able to detect the problem in time, and work around it by applying regexes to their, presumably foreign input, regexes. If character equivalents are already truly broken in Nix (which they should be, as it's impure), this is hardly a regression.
Similarly, collating sequences must not be supported; also locale dependent.

May have to dig through this...
https://gist.github.com/CMCDragonkai/6c933f4a7d713ef712145c5eb94a1816#feature-comparison
[:alpha:] looked like a potential deal breaker, but seems ok in grep.
So the bracket classes (bottom of page) seem kind of ok. Haven't looked at the rest yet.

@SuperSandro2000
Copy link
Member

Just my 2 cents on this:
boost is notorious for massively increasing compile times and for the 50k lines c++ code nix has, it takes very long to compile.
Not sure if buying more into boost is a good considering compile time.

@yorickvP
Copy link
Contributor Author

yorickvP commented Apr 4, 2023

Superseded by #7762

@yorickvP yorickvP closed this Apr 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Archived in project