Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode ligature pairs like "fi" and "ss" in a lookbehind, plus -i flag, throws a "Variable length lookbehind not implemented" error #336

Open
elias6 opened this issue Mar 31, 2021 · 12 comments
Labels

Comments

@elias6
Copy link

elias6 commented Mar 31, 2021

I am using ack 3.5.0.

If I run echo 'BROWNFOX' | ack -i '(?<!fire)fox' from my shell, I get this output:

ack: Invalid regex '(?i)(?<!fire)fox':
  Variable length lookbehind not implemented in regex m/(?i)(?<!fire)fox/ at /usr/local/bin/ack line 602.

But strangely, if I run echo 'BROWNFOX' | ack -i '(?<!ice)fox', I get BROWNFOX as I would expect.

It seems like I only get the error if the lookbehind begins with a lowercase or uppercase f, and has at least one character after it. I do not get the error if I don't use -i.

@petdance
Copy link
Collaborator

I think something in Perl is getting confused in the regex parser, and this is not an ack-specific problem. Here are some tests I've tried.

$ perl -E'$x = qr/(?<!ice)fox/'
$ perl -E'$x = qr/(?<!fire)fox/'
$ perl -E'$x = qr/(?i)(?<!fire)fox/'
Variable length lookbehind not implemented in regex m/(?i)(?<!fire)fox/ at -e line 1.
$ perl -E'$x = qr/(?<!fire)fox/'
$ perl -E'$x = qr/(?i)(?<!fire)fox/'
Variable length lookbehind not implemented in regex m/(?i)(?<!fire)fox/ at -e line 1.
$ perl -E'$x = qr/(?i)(?<!big)fox/'
$ perl -E'$x = qr/(?i)(?<!fire)fox/'
Variable length lookbehind not implemented in regex m/(?i)(?<!fire)fox/ at -e line 1.
$ perl -E'$x = qr/(?i)(?<!fre)fox/'
$ perl -E'$x = qr/(?i)(?<!dog)fox/'
$ perl -E'$x = qr/(?i)(?<!dig)fox/'
$ perl -E'$x = qr/(?i)(?<!fig)fox/'
Variable length lookbehind not implemented in regex m/(?i)(?<!fig)fox/ at -e line 1.

@petdance
Copy link
Collaborator

petdance commented Mar 31, 2021

It looks like the problem is that fi with /i is seen as variable length, as discussed here: https://stackoverflow.com/questions/50356241/variable-length-lookbehind-not-implemented-but-it-isnt-variable-length

Thanks to @wolfsage for pointing me to the StackOverflow answer.

@petdance
Copy link
Collaborator

So it looks like the fix is that ack needs to add /aa on the regexes it makes. This will stop it from matching ligatures like it did in the past, but I'm OK with that.

@n1vux
Copy link
Contributor

n1vux commented Mar 31, 2021

interestingly this error comes and goes with version of Perl.
perlbrew exec perl -e 'print 1 if q(BROWNFOX) =~ /(?<!fire)fox/i'

  • works fine for Perl 5.6 through 5.16.3
  • fails to compile on 5.17.11 through 5.29.5
  • works with warning 5.30.0 , variable lookback now experimental

Perl 5.30 gives Variable length lookbehind is experimental in regex; marked by <-...

(With -E fails for Perl 5.6 - 5.8.x of course. Adding /aa works on 5.16+, i presume it works on 5.14 when it was added, i don't have that in my Perlbrew farm. Of course /aa fails on 5.6 - 5.12. )

@n1vux
Copy link
Contributor

n1vux commented Mar 31, 2021

So for Perl version 5.12 or lesser, we do nothing;
for 5.14+, we insert /aa
(should determine which 5.13.x it was inserted in just to be right ?)

@n1vux
Copy link
Contributor

n1vux commented Mar 31, 2021

This /aa fix may well break the unicode wide character workarounds i'd offered folks in the past ?

End user workaround that @elias6 can use immediately for this edgecase is to wrap their RE on commandline with (?aa:...) or prefix with (?aa:)

@n1vux
Copy link
Contributor

n1vux commented Mar 31, 2021

compare #222 #153 #262 #258 to see offered workarounds and conflicting feature requests
... and whole "Unicode" tag in Issues https://github.com/beyondgrep/ack3/issues?q=is%3Aissue+utf+label%3Aunicode

@elias6
Copy link
Author

elias6 commented Mar 31, 2021

Hmm... maybe it does have something to do with ligatures. I get the error when I run ack -i '(?<!ff)', ack -i '(?<!fi)', and ack -i '(?<!fl)', but not ack -i '(?<!fx)'.

This is what I see when I run ack --version:

ack v3.5.0 (standalone version)
Running under Perl v5.18.4 at /usr/bin/perl

@elias6
Copy link
Author

elias6 commented Mar 31, 2021

@n1vux thanks for offering your workaround. I think my use case is complex enough that it is not worth figuring out how to use it. I have been just doing ack -i '(?<!.ire)fox' and manually picking out the strings I'm looking for.

@Grinnz
Copy link

Grinnz commented Mar 31, 2021

If text is intended to be matched as ASCII bytes only then applying the aa modifier universally on Perl 5.14+ may be warranted. For example, the byte 0xA0 read into a Perl string without decoding will be interpreted as the unicode character U+00A0 NO-BREAK SPACE when matching with unicode rules, and so \s may match it. But this byte only represents this character if the file happened to be encoded in ISO-8859-1 because that happens to correspond to the unicode mapping. If the file is not being decoded from bytes into characters, \s should not match unicode space characters, even those within the range of possible bytes, and the a/aa modifier achieves this.

On the other hand, if there are instances where the file contents get decoded before matching against the regex, and thus unicode matching is expected to work, the a/aa modifier would disable that ability.

@petdance
Copy link
Collaborator

Bill, thanks for pointing out the other Unicode-related tickets. It may be that Can't We Just.... add /aa all over is opening a bigger can of worms.

@petdance petdance added the bug label Mar 31, 2021
@petdance petdance changed the title Strange "Variable length lookbehind not implemented" error Unicode ligature pairs like "fi" and "ss" in a lookbehind, plus -i flag, throws a "Variable length lookbehind not implemented" error Mar 31, 2021
@n1vux
Copy link
Contributor

n1vux commented Apr 12, 2021

While we stand s(t)olidly on an assumption that ack is for source-code, and that any natural language use is "off label" use, since Perl and others permit Unicode (typically UTF-8) in source code files including identifiers not just character strings and comments, we really do need to support Unicode at a minimum for adequately scanning Unicode::Tussle's POD and source 😄 (ref to tcgrep website ticket above).

Adding an --(no-)ascii flag (which can be set on or off in .ackrc and reveresed on the commandline) to ack so that the user can decide if they want Flat Ascii or Unicode may be useful and even necessary. (This flag would be also opposite to a --unicode flag that selected UTF-8 vs 16/32 and byte-order, if we ever expand to support such messes?)

This issue and the 4 that i mentioned up thread should ALL be tagged with the unicode label here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants