Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

json_extract, json_extract_scalar, and json_size should support Jayway style JSON path syntax #7049

Open
kevinwilfong opened this issue Oct 13, 2023 · 9 comments
Labels
enhancement New feature or request

Comments

@kevinwilfong
Copy link
Contributor

Description

The functions json_extract, json_extract_scalar, and json_size in Presto use the Jayway library to parse JSON paths and handle extraction https://github.com/json-path/JsonPath

Velox's version of these functions uses a JSON path tokenizer based on Presto's JsonPathTokenizer. JsonPathTokenizer supports a much simpler syntax, but is likely faster, Presto uses it if possible and falls back to Jayway's if it can parse the path.

Jayway's parser is quite extensive, supporting various operators, aggregates, regexes, filters.

It would be great if the Velox JSON path parser could support Jayway's syntax.

https://github.com/facebookincubator/velox/blob/e2ee0cad24d5407146d2da08b68c6701ee86e9da/velox/functions/prestosql/json/JsonPathTokenizer.h

@mbasmanova
Copy link
Contributor

CC: @amitkdutta @zacw7 @aditi-pandit

@kevinwilfong
Copy link
Contributor Author

There was an issue opened in simdjson recently for supporting JSONPath, this would get us most of the way there, I think Jayway just extended this with aggregates that can be tacked on the end of the path.

simdjson/simdjson#2070

@lemire
Copy link

lemire commented Oct 13, 2023

We will prioritize this feature in simdjson.

@lemire
Copy link

lemire commented Dec 10, 2023

Update: This is more ambitious a proposal than I considered at first.

@mbasmanova
Copy link
Contributor

mbasmanova commented Apr 23, 2024

In our production workload, we are seeing the following JSONPaths that are currently not supported.

  1. "foo" - Jayway allows paths that do not start with a '$'. For these paths, it simply prepends the path with '$.' before compiling.
  2. "$.[0].foo" - While JSONPath allows either dot-notation or bracket-notation, Jayway allows a mix.
  3. "$..name" - deep scan.

I believe we can support (1) and (2) with simple changes to JsonPathTokenizer and support (3) using something like "Tree Walking and JSON Element Types" in https://github.com/simdjson/simdjson/blob/master/doc/basics.md#json-path

See prestodb/presto#22589 for additional context.

@mbasmanova
Copy link
Contributor

@rui-mo @PHILO-HE Folks, do you know if Spark also uses JayWay to implement json-extract functions?

@PHILO-HE
Copy link
Contributor

@rui-mo @PHILO-HE Folks, do you know if Spark also uses JayWay to implement json-extract functions?

Hi @mbasmanova, Spark's implementation is based on Jackson. Here are some findings.

  1. "foo" - Jayway allows paths that do not start with a '$'. For these paths, it simply prepends the path with '$.' before compiling.

Spark requires json path starts with "$".

  1. "$.[0].foo" - While JSONPath allows either dot-notation or bracket-notation, Jayway allows a mix.

Spark allows a mix also.

@mbasmanova
Copy link
Contributor

@PHILO-HE Thank you for clarifying.

mbasmanova added a commit to mbasmanova/velox-1 that referenced this issue Apr 23, 2024
…alar] Presto functions

Summary:
json_extract_scalar and json_extract Presto function allow paths like 'foo'
and 'foo.bar'. These paths are non-standard. They are enabled via use of Jayway
engine, which handles paths without leading '$' by prepending the path
with '$.'. Path 'foo' becomes '$.foo'. However, this logic has some unexpected
side effects. Path '.foo' becomes '$..foo', which uses deep scan operator '..'
and matches 'foo' keys anywhere in the document, rather then just at the top
level.

This change allows Velox to support paths like 'foo' and 'foo.bar[2].bar', but
not '.foo' or '[10]'. These paths are handled as if they contained leading '$.' characters.

See prestodb/presto#22589

Part of facebookincubator#7049

Differential Revision: D56465662
mbasmanova added a commit to mbasmanova/velox-1 that referenced this issue Apr 23, 2024
…alar] Presto functions (facebookincubator#9584)

Summary:

json_extract_scalar and json_extract Presto function allow paths like 'foo'
and 'foo.bar'. These paths are non-standard. They are enabled via use of Jayway
engine, which handles paths without leading '$' by prepending the path
with '$.'. Path 'foo' becomes '$.foo'. However, this logic has some unexpected
side effects. Path '.foo' becomes '$..foo', which uses deep scan operator '..'
and matches 'foo' keys anywhere in the document, rather then just at the top
level.

This change allows Velox to support paths like 'foo' and 'foo.bar[2].bar', but
not '.foo' or '[10]'. These paths are handled as if they contained leading '$.' characters.

See prestodb/presto#22589

Part of facebookincubator#7049

Differential Revision: D56465662
mbasmanova added a commit to mbasmanova/velox-1 that referenced this issue Apr 23, 2024
…alar] Presto functions (facebookincubator#9584)

Summary:

json_extract_scalar and json_extract Presto function allow paths like 'foo'
and 'foo.bar'. These paths are non-standard. They are enabled via use of Jayway
engine, which handles paths without leading '$' by prepending the path
with '$.'. Path 'foo' becomes '$.foo'. However, this logic has some unexpected
side effects. Path '.foo' becomes '$..foo', which uses deep scan operator '..'
and matches 'foo' keys anywhere in the document, rather then just at the top
level.

This change allows Velox to support paths like 'foo' and 'foo.bar[2].bar', but
not '.foo' or '[10]'. These paths are handled as if they contained leading '$.' characters.

See prestodb/presto#22589

Part of facebookincubator#7049

Differential Revision: D56465662
mbasmanova added a commit to mbasmanova/velox-1 that referenced this issue Apr 23, 2024
…Presto

Summary:
json_extract_scalar and json_extract Presto function allow paths like '$.[0].
[1].[2].foo' that include "redundant" dots before opening brackets. These paths
are non-standard. They are enabled via use of Jayway engine. 

This change allows Velox to support these paths by ignoring dots that appear
before opening brackets.

See prestodb/presto#22589 Part of
facebookincubator#7049

Differential Revision: D56467690
mbasmanova added a commit to mbasmanova/velox-1 that referenced this issue Apr 23, 2024
…alar] Presto functions (facebookincubator#9584)

Summary:

json_extract_scalar and json_extract Presto function allow paths like 'foo'
and 'foo.bar'. These paths are non-standard. They are enabled via use of Jayway
engine, which handles paths without leading '$' by prepending the path
with '$.'. Path 'foo' becomes '$.foo'. However, this logic has some unexpected
side effects. Path '.foo' becomes '$..foo', which uses deep scan operator '..'
and matches 'foo' keys anywhere in the document, rather then just at the top
level.

This change allows Velox to support paths like 'foo' and 'foo.bar[2].bar', but
not '.foo' or '[10]'. These paths are handled as if they contained leading '$.' characters.

See prestodb/presto#22589

Part of facebookincubator#7049

Differential Revision: D56465662
mbasmanova added a commit to mbasmanova/velox-1 that referenced this issue Apr 23, 2024
…Presto (facebookincubator#9585)

Summary:

json_extract_scalar and json_extract Presto function allow paths like '$.[0].
[1].[2].foo' that include "redundant" dots before opening brackets. These paths
are non-standard. They are enabled via use of Jayway engine. 

This change allows Velox to support these paths by ignoring dots that appear
before opening brackets.

See prestodb/presto#22589 Part of
facebookincubator#7049

Reviewed By: xiaoxmeng

Differential Revision: D56467690
mbasmanova added a commit to mbasmanova/velox-1 that referenced this issue Apr 23, 2024
…alar] Presto functions (facebookincubator#9584)

Summary:

json_extract_scalar and json_extract Presto function allow paths like 'foo'
and 'foo.bar'. These paths are non-standard. They are enabled via use of Jayway
engine, which handles paths without leading '$' by prepending the path
with '$.'. Path 'foo' becomes '$.foo'. However, this logic has some unexpected
side effects. Path '.foo' becomes '$..foo', which uses deep scan operator '..'
and matches 'foo' keys anywhere in the document, rather then just at the top
level.

This change allows Velox to support paths like 'foo' and 'foo.bar[2].bar', but
not '.foo' or '[10]'. These paths are handled as if they contained leading '$.' characters.

See prestodb/presto#22589

Part of facebookincubator#7049

Differential Revision: D56465662
mbasmanova added a commit to mbasmanova/velox-1 that referenced this issue Apr 23, 2024
…Presto (facebookincubator#9585)

Summary:

json_extract_scalar and json_extract Presto function allow paths like '$.[0].
[1].[2].foo' that include "redundant" dots before opening brackets. These paths
are non-standard. They are enabled via use of Jayway engine. 

This change allows Velox to support these paths by ignoring dots that appear
before opening brackets.

See prestodb/presto#22589 Part of
facebookincubator#7049

Reviewed By: xiaoxmeng

Differential Revision: D56467690
facebook-github-bot pushed a commit that referenced this issue Apr 23, 2024
…alar] Presto functions (#9584)

Summary:
Pull Request resolved: #9584

json_extract_scalar and json_extract Presto function allow paths like 'foo'
and 'foo.bar'. These paths are non-standard. They are enabled via use of Jayway
engine, which handles paths without leading '$' by prepending the path
with '$.'. Path 'foo' becomes '$.foo'. However, this logic has some unexpected
side effects. Path '.foo' becomes '$..foo', which uses deep scan operator '..'
and matches 'foo' keys anywhere in the document, rather then just at the top
level.

This change allows Velox to support paths like 'foo' and 'foo.bar[2].bar', but
not '.foo' or '[10]'. These paths are handled as if they contained leading '$.' characters.

See prestodb/presto#22589

Part of #7049

Reviewed By: xiaoxmeng

Differential Revision: D56465662

fbshipit-source-id: c32bfe6c1b2587b9f996605a169abb934fd3d95c
facebook-github-bot pushed a commit that referenced this issue Apr 23, 2024
…Presto (#9585)

Summary:
Pull Request resolved: #9585

json_extract_scalar and json_extract Presto function allow paths like '$.[0].
[1].[2].foo' that include "redundant" dots before opening brackets. These paths
are non-standard. They are enabled via use of Jayway engine.

This change allows Velox to support these paths by ignoring dots that appear
before opening brackets.

See prestodb/presto#22589 Part of
#7049

Reviewed By: xiaoxmeng

Differential Revision: D56467690

fbshipit-source-id: 42866815ff186510679a0ab877e2796d324309c7
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants