Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nested regex captures #77

Open
pkoppstein opened this issue May 19, 2023 · 4 comments
Open

Nested regex captures #77

pkoppstein opened this issue May 19, 2023 · 4 comments
Labels
help wanted Extra attention is needed

Comments

@pkoppstein
Copy link

jq and gojq are agreed:

jq -R 'scan("^(([^:]+): *(.*))?")' <<< '((a):(b))'
[
  "((a):(b))",
  "((a)",
  "(b))"
]

$ gojq -R 'scan("^(([^:]+): *(.*))?")' <<< '((a):(b))'
[
  "((a):(b))",
  "((a)",
  "(b))"
]

But:

$ jaq -R 'scan("^(([^:]+): *(.*))?")' <<< '((a):(b))'
"((a):(b))"
@01mf02 01mf02 added the help wanted Extra attention is needed label Jun 26, 2023
@01mf02
Copy link
Owner

01mf02 commented Jun 26, 2023

This is a known restriction of the regex library that jaq uses. As suggested in the linked post, we could use in jaq Rust bindings to oniguruma, the regex library behind jq. While this step would increase compatibility with jq, that would require figuring out how this impacts the build process of jaq (given that it links to a C library), for example whether this impacts the ability to build jaq for WASM. The impact on performance should also be measured. And of course, it would require porting the current regex routines of jaq to oniguruma. If someone is interested in doing all this, I would consider merging a corresponding PR.

@01mf02 01mf02 changed the title scan/1 Overlapping regex captures Jun 26, 2023
@pkoppstein
Copy link
Author

pkoppstein commented Jun 26, 2023

For reference here’s the link to onig, the relevant Rust crate:

https://crates.io/crates/onig

@pkoppstein pkoppstein changed the title Overlapping regex captures Nested regex captures Jun 29, 2023
@kklingenberg
Copy link
Contributor

FYI, there's another crate based on regex that provides the look-around feature needed for this: https://docs.rs/fancy-regex/latest/fancy_regex/. It hasn't reached 1.0 as of this comment.

@kklingenberg
Copy link
Contributor

kklingenberg commented Feb 17, 2024

There something that doesn't quite add up for me. I can see that jaq is missing the look-around feature in regexes, but I don't think that's related to the given examples of behaviour mismatch.

The manual defines scan as (emphasis added):

Emit a stream of the non-overlapping substrings of the input that match the regex [...]

I believe the issue raised by @pkoppstein is actually about the definition of scan, which it would seem that jaq follows (and maybe jq and gojq don't). This is further confirmed by swapping scan with match where jaq agrees with jq:

$ jq -R 'match("^(([^:]+): *(.*))?")' <<< '((a):(b))'
{
  "offset": 0,
  "length": 9,
  "string": "((a):(b))",
  "captures": [
    {
      "offset": 0,
      "length": 9,
      "string": "((a):(b))",
      "name": null
    },
    {
      "offset": 0,
      "length": 4,
      "string": "((a)",
      "name": null
    },
    {
      "offset": 5,
      "length": 4,
      "string": "(b))",
      "name": null
    }
  ]
}
$ jaq -R 'match("^(([^:]+): *(.*))?")' <<< '((a):(b))'
{
  "offset": 0,
  "length": 9,
  "string": "((a):(b))",
  "captures": [
    {
      "offset": 0,
      "length": 9,
      "string": "((a):(b))"
    },
    {
      "offset": 0,
      "length": 4,
      "string": "((a)"
    },
    {
      "offset": 5,
      "length": 4,
      "string": "(b))"
    }
  ]
}

Which I interpret as jq actually finding a single match in the input. Then the issue is scan is yielding that single match in jaq's version, but something else (the captures?) in jq and gojq's versions.

EDIT: I see now that jaq's scan is not really scanning the whole input unless given the g flag:

$ jaq -R 'scan("a")' <<< 'aaa'
"a"
$ jaq -R 'scan("a"; "g")' <<< 'aaa'
"a"
"a"
"a"

So that's another difference from jq, which seems to have the g flag on by default for scan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants