New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use PCRE for builtins.match
and builtins.split
#7336
Conversation
nix-repl> builtins.match "(?<date>(?<year>(\\d\\d)?\\d\\d) - (?<month>\\d\\d) - (?<day>\\d\\d))" "2020 - 10 - 10" { date = "2020 - 10 - 10"; day = "10"; month = "10"; year = "2020"; }
This pull request has been mentioned on NixOS Discourse. There might be relevant details there: https://discourse.nixos.org/t/tweag-nix-dev-update-40/23480/1 |
Returns a list composed of non matched strings interleaved with the | ||
lists of the [extended POSIX regular | ||
expression](http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_04) | ||
*regex* matches of *str*. Each item in the lists of matched |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PCRE is not POSIX ERE.
Also worth mentioning that PCRE supports arbitrary look-around, which can cause (maybe accidental) exponential run time. Maybe we should limit the input to a reasonable subset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it should point to https://www.pcre.org/original/doc/html/pcrepattern.html.
int errorcode; | ||
PCRE2_SIZE erroffset; | ||
|
||
code = pcre2_compile((const unsigned char*)re.data(), re.length(), 0, &errorcode, &erroffset, nullptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code = pcre2_compile((const unsigned char*)re.data(), re.length(), 0, &errorcode, &erroffset, nullptr); | |
code = pcre2_compile(static_cast<const unsigned char*>(re.data()), re.length(), 0, &errorcode, &erroffset, nullptr); |
name_table.reserve(namecount); | ||
for (size_t i = 0; i < namecount; i++) { | ||
int n = tabptr[0] << 8 | tabptr[1]; | ||
name_table.emplace_back((const char*)(tabptr+2), n); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name_table.emplace_back((const char*)(tabptr+2), n); | |
name_table.emplace_back(static_cast<const char*>(tabptr+2), n); |
{ | ||
friend class MatchData; | ||
protected: | ||
pcre2_code* code; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you made this a std::unique_ptr
, then you wouldn't need to manually call pcre2_code_free
in the destructor. In addition to that, the constructor function could not leak the successfully compiled pcre2 resource in case of exceptions.
relevant snippets:
// member declaration
using pcre_ptr = std::unique_ptr<Bar, void(*)(pcre2_code*)>;
pcre_ptr code;
// constructor:
Regex(std::string_view re) {
...
code = pcre_ptr(pcre2_compile((const unsigned char*)re.data(), re.length(), 0, &errorcode, &erroffset, nullptr), pcre2_code_free);
}
...
|
||
void compile() | ||
{ | ||
assert(pcre2_jit_compile(code, PCRE2_JIT_COMPLETE) == 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i suggest putting the call to pcre2_jit_compile
outside the assert
and store its result in a variable. then, assert on the variable's value.
The reason i suggest that is that asserts are meant to be optimized out of production builds, but that would then remove the compile call...
MatchData(MatchData&&) = delete; | ||
~MatchData() | ||
{ | ||
pcre2_match_data_free(match); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::unique_ptr
with a custom destructor as above would be great.
MatchData(Regex& re) noexcept | ||
: re(re) | ||
{ | ||
match = pcre2_match_data_create_from_pattern(re.code, NULL); | ||
size_ = pcre2_get_ovector_count(match); | ||
ovector = pcre2_get_ovector_pointer(match); | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MatchData(Regex& re) noexcept | |
: re(re) | |
{ | |
match = pcre2_match_data_create_from_pattern(re.code, NULL); | |
size_ = pcre2_get_ovector_count(match); | |
ovector = pcre2_get_ovector_pointer(match); | |
}; | |
MatchData(Regex& re) noexcept | |
: re{re} | |
, match{pcre2_match_data_create_from_pattern(re.code, NULL)} | |
, size_{pcre2_get_ovector_count(match)} | |
, ovector {pcre2_get_ovector_pointer(match)} | |
{ | |
}; |
plus the reordering comment from above
v.mkAttrs(bindings); | ||
} else { | ||
// the first match is the whole string | ||
const size_t len = match.size() - 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it feels like asserting that the size is > 0
would increase the safety of this function
if (!match[i+1].has_value()) | ||
(v.listElems()[i] = state.allocValue())->mkNull(); | ||
else | ||
(v.listElems()[i] = state.allocValue())->mkString(*match[i + 1]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (!match[i+1].has_value()) | |
(v.listElems()[i] = state.allocValue())->mkNull(); | |
else | |
(v.listElems()[i] = state.allocValue())->mkString(*match[i + 1]); | |
auto *val = state.allocValue() | |
v.listElems()[i] = val; | |
if (!match[i+1].has_value()) | |
val->mkNull(); | |
else | |
val->mkString(*match[i + 1]); |
|
||
void prim_match(EvalState & state, const PosIdx pos, Value * * args, Value & v) | ||
{ | ||
auto re = state.forceStringNoCtx(*args[0], pos); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add some asserts here?
non-matching parts interleaved by the lists of the matching groups. */ | ||
void prim_split(EvalState & state, const PosIdx pos, Value * * args, Value & v) | ||
{ | ||
auto re = state.forceStringNoCtx(*args[0], pos); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, asserts would improve debuggability if we ever end up with code that inserts nullptrs
result.push_back(prefix); | ||
|
||
// Add a list for matched substrings. | ||
auto elem = state.allocValue(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the code is a bit inconsistent in choosing the type for allocValue
calls. It's either Value*
or auto
and it would be nice to choose one variant and stick to it.
Looks good to me. The main objection is that this makes the Nix language specification depend on "whatever PCRE does". That's not a real problem for Nix, but might be a problem for people who want to reimplement it without having a dependency on a C library. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this overall 👍 left a few inline comments, but nothing too big.
Since this is a potential breaking change, I would like to see it feature-gated for at least a release cycle. Maybe on by default, but with an easy off switch in case it's hurting people. That would require keeping the old code around of course, but that'd just be temporary.
@@ -76,6 +76,7 @@ void initGC(); | |||
struct RegexCache; | |||
|
|||
std::shared_ptr<RegexCache> makeRegexCache(); | |||
size_t regexCacheSize(std::shared_ptr<RegexCache> cache); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be replaced by a RegexCache::size()
method?
|
||
class MatchData; | ||
|
||
class Regex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't let yourself be blocked on that, but it would be nicer to move the pure "PCRE abstraction layer" part under libutil
|
||
} catch (RegexError & e) { | ||
state.debugThrowLastTrace(EvalError({ | ||
.msg = hintfmt("error while evaluating regex '%s': ", re, e.what()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.msg = hintfmt("error while evaluating regex '%s': ", re, e.what()), | |
.msg = hintfmt("error while evaluating regex '%s': %s", re, e.what()), |
|
||
} catch (RegexError & e) { | ||
state.debugThrowLastTrace(EvalError({ | ||
.msg = hintfmt("error while evaluating regex '%s': ", re, e.what()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.msg = hintfmt("error while evaluating regex '%s': ", re, e.what()), | |
.msg = hintfmt("error while evaluating regex '%s': %s", re, e.what()), |
state.debugThrowLastTrace(EvalError({ | ||
.msg = hintfmt("error while evaluating regex '%s': ", re, e.what()), | ||
.errPos = state.positions[pos] | ||
})); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually the above doesn't look so great, maybe rather something like
state.debugThrowLastTrace(EvalError({ | |
.msg = hintfmt("error while evaluating regex '%s': ", re, e.what()), | |
.errPos = state.positions[pos] | |
})); | |
e.addTrace(state.positions[pos], "while evaluating regex '%s'", re); | |
state.debugThrowLastTrace(e); |
Is there no portable library we could use that implements POSIX ERE? That would avoid compatibility problems between Nix versions, avoid the footguns of backtracking etc., and be something more reasonable to find multiple implementations of for Nix reimpls. |
POSIX ERE is a relatively small and easy-to-implement standard, so that other languages usually also provide or have library of it. |
|
Technically it should not, if we don't use the Unicode part. The issue is that currently While |
|
If we can leave out icu support using an override on boost, I'm fine with that. |
One option is that we support both regexes during a migration period. |
This pull request has been mentioned on NixOS Discourse. There might be relevant details there: https://discourse.nixos.org/t/2022-12-02-nix-team-meeting-minutes-13/23731/1 |
Another option for a portable POSIX ERE implementation might be musl's? It's permissively licensed, and only a few thousand lines total, so it would be reasonable for Nix to just bundle it, IMO. |
Discussed in the Nix team meeting on 2022-12-05:
Complete discussion
|
This pull request has been mentioned on NixOS Discourse. There might be relevant details there: https://discourse.nixos.org/t/2022-12-05-nix-team-meeting-minutes-14/23836/1 |
boost builds without icu (it's mostly used for Boost.Locale afaict), but it requires an ugly (boost.override { icu = null; }).overrideAttrs (o: {
configureFlags = (lib.take 4 o.configureFlags) ++ [ "--without-icu" ];
}) |
I created #205166 to simplify this. |
While I do think fixing the stack behavior is a higher priority than extending the language with PCRE-specific features, adding PCRE may be feasible. I believe PCRE was designed to process all EREs correctly, but I haven't found definitive evidence of this. Finding this would greatly help, so if anyone knows better where to look... To some degree, any change in behavior could be a breaking change. (cue spacebar heating) However, I'd be willing to assume that nobody is validating ERE regexes by trying to use them in builtins.match regex "", in large part because tryEval can't catch a bad regex anyway. This should make the "boolean" whether the extended functionality is present unobservable from within the language. So we have two possible approaches: either "prove" that PCRE processes EREs correctly, or use a good ERE library instead. |
@roberth afaik, PCRE is mostly a superset of Posix EREs, but not strictly. $ echo e | grep --extendes-regexp '[[=e=]]'
e
$ echo e | grep --perl-regexp '[[=e=]]'
grep: POSIX collating elements are not supported Now the question is whether this really matters, and I'm not really convinced it does given how niche it is |
Ooh, that's locale dependent. I didn't manage to exploit that as a potential impurity (but maybe I did it wrong). Maybe it's just not implemented in GNU stdc++, but it may be implemented in a library that does. What helps in this case is that libpcre prints an error, and doesn't continue with a garbage regex. If anyone does rely on regexes with collating elements, they'll be able to detect the problem in time, and work around it by applying regexes to their, presumably foreign input, regexes. If character equivalents are already truly broken in Nix (which they should be, as it's impure), this is hardly a regression. May have to dig through this... |
Just my 2 cents on this: |
Superseded by #7762 |
Fixes #2147, fixes #4758
See also #3826.
This is not technically fully compatible, but I can't find a builtins.match invocation in the wild that doesn't work, since programs could only rely on quite a limited subset anyways.
Performance is similar because no one really used a lot of regexes, but this should theoretically be faster.
Additionally adds support for named captures, so you can do