Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiplication and division in Proposal A #42

Open
glyn opened this issue Jun 2, 2020 · 31 comments
Open

Multiplication and division in Proposal A #42

glyn opened this issue Jun 2, 2020 · 31 comments
Labels
Proposal A Anything discussing the advance of the Proposal A for a formal specification of JSONPath

Comments

@glyn
Copy link
Collaborator

glyn commented Jun 2, 2020

Proposal A lists multiplication and division as "to dos". I'm nervous that this is the start of scripting and once these operations are introduced, the proposal may be drawn into implementing some not particularly well scoped subset of JavaScript.

Can we instead agree that scripting is out of scope and drop multiplication and division? If not, how/where do we draw the line?

@jchesterpivotal
Copy link

jchesterpivotal commented Jun 2, 2020

I agree that multiplication and division seem a bit out of place and open an enticing door. It's already hard enough to verify that something isn't Turing-complete for safety purposes. Any operation that adds choice or change makes it more likely that unintended completeness will be achieved.

@glyn
Copy link
Collaborator Author

glyn commented Jun 2, 2020

@jchesterpivotal Perhaps that particular horse has already bolted since JSONPath filters include logical expressions. See accidentally Turing complete.

@jchesterpivotal
Copy link

Even now Turing's ghost taunts us. It's enough to drive a man to Church.

@cburgmer
Copy link
Owner

cburgmer commented Jun 3, 2020

Full agree here.

The TODO part I think points to the fact that Proposal A does return a valid response (an empty list) instead of NOT_SUPPORTED. The problem is that @.key/10 is parsed as a valid dot notation, which I think it shouldn't.

Hi @jchesterpivotal, thanks for chiming in!

@cburgmer
Copy link
Owner

cburgmer commented Jun 3, 2020

Thanks @glyn, please keep those issues coming!

cburgmer added a commit that referenced this issue Jun 3, 2020
…ons in filter expressions. '-' is actually part of the path. #42
@glyn
Copy link
Collaborator Author

glyn commented Jun 4, 2020

The problem is that @.key/10 is parsed as a valid dot notation, which I think it shouldn't.

Hmmm, I'm not convinced, for three reasons.

Firstly, the same would then have to apply to $.key/10 for consistency. Jayway and Goessner (online here), as well as yaml-jsonpath (currently online here) match $.key/10 in:

{"key/10": "x"}

Going back to @.key/10 the results aren't so consistent. yaml-jsonpath matches $[?(@.key/10)] in:

[{"key/10": "x"}]

whereas Jayway fails to parse the path with Could not find matching close for / when parsing regex in : $[?(@.key/10)] (!) and Goessner parses the path but fails to match it.

That said, the above could be handled using bracket child notation which presumably should accept arithmetic operators.

Secondly, it seems somewhat arbitrary to reject just the basic four arithmetic operators.

Thirdly, limiting the check to these four operators effectively acknowledges the existence of implementations which support these operators, but not of other implementations with a broader set of operators.

@cburgmer
Copy link
Owner

cburgmer commented Jun 4, 2020

Yes, I see the problem here. Maybe this one is best tackled by not looking at the consensus too strictly, but come up with a good reasoning to which characters are allowed, and which ones are not (and why, of course).

The initial Regex was trying to exclude characters that would interfere with the existing parsing rules, but then I felt being too generous was making it harder for future additions to the syntax.

@glyn
Copy link
Collaborator Author

glyn commented Jun 4, 2020

Yes, I see the problem here. Maybe this one is best tackled by not looking at the consensus too strictly, but come up with a good reasoning to which characters are allowed, and which ones are not (and why, of course).

+1

The initial Regex was trying to exclude characters that would interfere with the existing parsing rules, but then I felt being too generous was making it harder for future additions to the syntax.

I have precisely the same feeling and yet JSON is pretty permissive. Thankfully, we have bracket child notation as a fall-back if we end up being a bit too restrictive on the dot child front. I would have liked to keep the child name syntax uniform between bracket child and dot child, but I've come to realise this is unrealistic and that the bracket child notation was almost certainly introduced precisely to support a broader character set in child names.

As you observe, it's easier to relax restrictions in future than to tighten them up, so as long as we don't rule out a big chunk of current usages, we can probably be pretty restrictive.

How about starting with identifier syntax?

@remorhaz
Copy link
Collaborator

remorhaz commented Jun 4, 2020

We have a consensus in non-ASCII symbols, so we should probably start from supporting Unicode. And Unicode has a good standard for identifiers: https://unicode.org/reports/tr31/

@remorhaz
Copy link
Collaborator

remorhaz commented Jun 4, 2020

I have something like this in my implementation's lexer for a NAME token, for example:

/[\p{ID_Start}_][\p{ID_Continue}\$\-@]*/

@cburgmer
Copy link
Owner

cburgmer commented Jun 4, 2020

That's amazing @remorhaz
I still need to read through the entirety of this document, but my understanding is that we could rely solely on Unicode to define this for us?

I've applied this locally to the Proposal A's grammar without your modifications to the two classes:

DotChildName
  = $(char:.&{ return /\p{ID_Start}/u.test(char) })
    $((char:.&{ return /\p{ID_Continue}/u.test(char) })*)
    { return text(); }

This will fail some of the queries:

  • $.key-dash
  • $.$
  • $.2
  • $.-1
  • $.2
  • $[?(@.key-50==-100)]

@cburgmer
Copy link
Owner

cburgmer commented Jun 4, 2020

I've come to realise this is unrealistic and that the bracket child notation was almost certainly introduced precisely to support a broader character set in child names

+1

@remorhaz
Copy link
Collaborator

remorhaz commented Jun 8, 2020

is that we could rely solely on Unicode to define this for us?
I'd say, we can rely on it (and probably other Unicode classes) as a basis.

See EcmaScript identifier description for an example. They combine UnicodeLetter non-terminal (combined of several Unicode properties) with some selected ASCII chars to construct their own IdentifierStart. We can do the same using either ID_Start/ID_Continue properties or using another property set. In fact, the latter could be better because JsonPath "identifiers" are not exactly what identifiers are in most languages. My point is just using any Unicode properties we need if possible instead of inveniting our own character sets from scratch.

@glyn
Copy link
Collaborator Author

glyn commented Jun 8, 2020

Following on from the idea of only allowing some kind of identifier as a dotted child name:

This will fail some of the queries:

  • $.key-dash
  • $.$
  • $.2
  • $.-1
  • $[?(@.key-50==-100)]

The child names like key-dash and 2 look likely to crop up in the wild.

We might be able to allow leading digits so that $.2 is valid.

But I don't see how we can allow $.key-dash without also allowing $.key-50 (and therefore allowing @.key-50 too).

Also, I don't see how we can allow $.-1 without also allowing $.key-50.

Not sure about $ in dotted child names.

@cburgmer
Copy link
Owner

cburgmer commented Jun 8, 2020

We might be able to allow leading digits so that $.2 is valid.

Do note that in the current implementation of Proposal A, $.2 would only return a value for {"2": 42}, not for [0, 1, 42].

@remorhaz
Copy link
Collaborator

remorhaz commented Jun 8, 2020

Not sure about $ in dotted child names.

There's no consensus on it, though it's not difficult to implement and I'd rather vote for it.

@cburgmer cburgmer added the Proposal A Anything discussing the advance of the Proposal A for a formal specification of JSONPath label Jun 8, 2020
@cburgmer
Copy link
Owner

cburgmer commented Jun 8, 2020

I feel we should try to allow all common identifier patterns found across the programming languages that currently implement JSONPath (if that's possible without overlapping with syntax elements too much).
$ and - for example seem to be used a lot in some languages, it might make sense to go look for some more.

@remorhaz
Copy link
Collaborator

remorhaz commented Jun 9, 2020

Maybe it's worth trying to use /\p{ID_Continue}+/ as a basis? At least it will allow such identifiers as $.2. Since we don't really need to distinguish between numeric literals and identifiers here, we can avoid using ID_Start for first character of identifier, and probably we can use the same pattern for any character of identifier.

@glyn
Copy link
Collaborator Author

glyn commented Jun 9, 2020

Sounds reasonable if ID_Continue includes -. Does it?

@remorhaz
Copy link
Collaborator

remorhaz commented Jun 9, 2020

Sounds reasonable if ID_Continue includes -. Does it?

No, it doesn't. We should add any additional symbols explicitly to the symbol class, like /[\p{ID_Continue}\-]+/

@remorhaz
Copy link
Collaborator

remorhaz commented Jun 9, 2020

What's important to know about ID_* - these properties already exclude lots of symbols, that are technically letters, but are either narrow-specific (historic alphabets) either not a good choice for an ID (vertical scripts like Mongolian, etc.).

Unicode standard describes identifiers in the following way:

...identifier consists of a string of characters beginning with a letter or an ideograph, and followed by any number of letters, ideographs, digits, or underscores.

In JsonPath we probably don't need to restrict the first symbol, so we can just consider identifier as "non-empty sequence of letters, ideographs, digits, uderscores AND $, -, etc.". Unicode standard allows us to act this way:

...(implementation) shall declare that it uses a profile and define that profile with a precise specification of the characters that are added to or removed from the sets of code points defined by... properties.

@bhmj
Copy link

bhmj commented Aug 3, 2020

But I don't see how we can allow $.key-dash without also allowing $.key-50 (and therefore allowing @.key-50 too).

Maybe we should instead define a simple set of characters which are not allowed in an identifier, like $ @ + - * / & | ! < = > [ ] ( ) because they have a special meaning in jsonpath and thus any identifier containg those should be referred to using bracket notation? It's always easier to look though a short list of "forbidden" characters than to google a set of /\p{ID_Continue}+/ and try to figure out if your identifier matches it or not.

@glyn
Copy link
Collaborator Author

glyn commented Aug 3, 2020

But I don't see how we can allow $.key-dash without also allowing $.key-50 (and therefore allowing @.key-50 too).

Maybe we should instead define a simple set of characters which are not allowed in an identifier, like $ @ + - * / & | ! < = > [ ] ( ) because they have a special meaning in jsonpath and thus any identifier containg those should be referred to using bracket notation? It's always easier to look though a short list of "forbidden" characters than to google a set of /\p{ID_Continue}+/ and try to figure out if your identifier matches it or not.

That approach would allow letters, which according to a comment above:

are not a good choice for an ID (vertical scripts like Mongolian, etc.)

@remorhaz
Copy link
Collaborator

remorhaz commented Aug 3, 2020

But maybe that's not a great problem. The more I think, the more I agree that we should avoid Unicode properties in standard. I know what I'm talking about - I have implemented Unicode properties manually and the solution includes seeking ~600 ranges of symbols (which may require pre-building a b-tree index!)

Mongolian letters and cuneiform poorly fit in strings, too; yet JSON allows them and no one seems to suffer.

@bhmj
Copy link

bhmj commented Aug 3, 2020

This approach leads us to /\w+/ character set for dot notated identifiers and the rest is covered by brackets. Problem solved )

@glyn
Copy link
Collaborator Author

glyn commented Aug 3, 2020

Which definition of \w do you have in mind? Python's definition depends on the current locale, for example.

@bhmj
Copy link

bhmj commented Aug 3, 2020

The classic [a-zA-Z0-9_] as with re.ASCII parameter.

@glyn
Copy link
Collaborator Author

glyn commented Aug 3, 2020

I really like the idea of keeping this simple rather than introducing a "cottage industry" of identifier parsing. But I suspect that omitting - would break many existing queries. How about using the character set [a-zA-Z0-9_-] instead?

@bhmj
Copy link

bhmj commented Aug 3, 2020

Well, adding - to the character set will make it impossible to use arithmetic for filter expressions like $.foo[?(@.key-5 > 10)]. Are we ready to reject even simplest filter arithmetic at all? Alternatively we could oblige to use spaces to separate math operators and identifiers, but that is pretty tough and will bring A LOT of confusion: @.key-5 > 10 vs @.key - 5 > 10. So I suggest sticking with [a-zA-Z0-9_].

@glyn
Copy link
Collaborator Author

glyn commented Aug 3, 2020

I agree the alternative of using spaces to separate math operators and identifiers isn't desirable.

So there is a choice between allowing - in dotted child names and allowing arithmetic expressions in filters.

I guess Proposal A could go with [a-zA-Z0-9_] for now and adjust this later if necessary.

@cburgmer How would you like to play this? Should Proposal A try to lead the standard or follow it later?

@remorhaz
Copy link
Collaborator

remorhaz commented Aug 4, 2020

The classic [a-zA-Z0-9_] as with re.ASCII parameter.

This will break existing consensus on allowing non-ASCII symbols in dot-child.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Proposal A Anything discussing the advance of the Proposal A for a formal specification of JSONPath
Projects
None yet
Development

No branches or pull requests

5 participants