Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for PCRE2_SUBSTITUTE_EXTENDED #14

Open
Sharparam opened this issue May 7, 2019 · 7 comments
Open

Support for PCRE2_SUBSTITUTE_EXTENDED #14

Sharparam opened this issue May 7, 2019 · 7 comments

Comments

@Sharparam
Copy link

Something I've always missed in the built-in regex engine of .NET is the ability to do case transformations in the replacement string. PCRE2 supports this by supplying an extra option (docs, Ctrl+F for PCRE2_SUBSTITUTE_EXTENDED).

It would be great if it was possible to supply this "substitute extended" option in some way in PCRE.NET.

@ltrzesniewski
Copy link
Owner

The main issue here is that PCRE.NET implemented string replacement before PCRE had this feature, and subsequently kept it. It doesn't use PCRE's substitution feature at all. The whole replacement feature would need to be rewritten for this, and it would probably be slower because of memory management issues that would be introduced by this approach.

But PCRE.NET provides replacement with a callback, just like .NET regex does, so you can provide a lambda which does the case transformation.

I mean something like this:

PcreRegex.Replace("...", "...", m => m.Value.ToLower());

@Sharparam
Copy link
Author

Sharparam commented May 7, 2019

Yeah when you just want to convert the entire match to a case it is easy enough. The tricky part is with more complex replacement strings, like:

input: "bob and alice sent messages to each other"
pattern: "(?<first>\w+) and (?<second>\w+) (.+)"
replace: "\U${first}\E and \U${second}\E ${3}"

I guess one would have to write some kind of in-between parser to turn the replacement string into a custom Func<PcreMatch, string>?

On an unrelated note: It seems the $n shorthand for referring to a capture group doesn't work? Using $3 in place of ${3} above just outputs $3 verbatim.

Edit: For context, this is to support cases where the replacement string is not built in code, but provided from an outside source. So it's not trivial to just write a lambda func directly.

@ltrzesniewski
Copy link
Owner

The tricky part is with more complex replacement strings

It's not that tricky:

PcreRegex.Replace("...", "...", m => $"{m["first"].Value.ToUpper()} and {m["second"].Value.ToUpper()} {m[3]}");

I guess one would have to write some kind of in-between parser to turn the replacement string into a custom Func<PcreMatch, string>?

That's exactly what PCRE.NET does actually. If you really need this feature quickly, maybe you could use this code as a starting point to build something that matches your needs.

For context, this is to support cases where the replacement string is not built in code, but provided from an outside source. So it's not trivial to just write a lambda func directly.

OK, that makes sense.

On an unrelated note: It seems the $n shorthand for referring to a capture group doesn't work? Using $3 in place of ${3} above just outputs $3 verbatim.

That's weird, it's supposed to work. I'll check that.

Anyway, another reason I didn't implement this is that I didn't want to have two incompatible replacement pattern syntaxes. I'm not sure what would be best here, I need to think about it.

@boolbag
Copy link

boolbag commented May 7, 2019

Hi guys,

hope your week started great.

As someone who has been watching pcre-net for years and is currently writing a regex replacement grammar and maintains a number of PCRE pages, I feel compelled to comment on this interesting question.

  • writing a replacement grammar is not hard. The problem is knowing when to stop. Case conversion is just the beginning of an infinite number of ways one might want to manipulate capture groups.
  • even though writing a replacement grammar is "not hard", the hard part is making a grammar that would be an exact mirror of the one supported by PCRE2. For instance, there is no guarantee that PCRE2 understands U$ in the same way as .NET understands ToUpper() (think of everything that can go wrong with Unicode).
  • for these reasons if this repo ever explores an extended replacement grammar, my feeling is that it shouldn't be a "homegrown" flavor to fit both shoes, but a parallel track (alternative functions) that directly plug into the PCRE2 API.

Wishing you both a great day.

@ltrzesniewski
Copy link
Owner

if this repo ever explores an extended replacement grammar, my feeling is that it shouldn't be a "homegrown" flavor to fit both shoes, but a parallel track (alternative functions) that directly plug into the PCRE2 API

Agreed. I only mirrored .NET's replacement pattern syntax but never intended to replicate PCRE2's one. If I'll do anything about this, I'll integrate PCRE2's feature directly.

@boolbag
Copy link

boolbag commented May 7, 2019

By the way thank you for maintaining this project all this time, Lucas! A huge number of people are indebted to you. :)

@ltrzesniewski
Copy link
Owner

On an unrelated note: It seems the $n shorthand for referring to a capture group doesn't work? Using $3 in place of ${3} above just outputs $3 verbatim.

I fixed this in v0.10.1. There was a bug when $n was the last element of the replacement pattern.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants