Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add some config options to the JSON tokenizer #2031

Open
3 tasks
revans2 opened this issue May 13, 2024 · 0 comments
Open
3 tasks

[FEA] Add some config options to the JSON tokenizer #2031

revans2 opened this issue May 13, 2024 · 0 comments

Comments

@revans2
Copy link
Collaborator

revans2 commented May 13, 2024

Is your feature request related to a problem? Please describe.

For us to be able to support from_json or the JSON input format/SCAN using the same tokenizer currently used by get_json_object, that tokenizer needs to support more configuration options. Mostly because the defaults for from_json and for get_json_object are different.

This is to add in enough configuration that the default options for from_json will work.

  • allowNumericLeadingZeros get_json_object has this off, but from_json has it enabled by default.
  • allowNonNumericNumbers get_json_object has this off, but from_json has it enabled by default.
  • allowUnquotedControlChars get_json_object has this on, but from_json has it off by default.

maxNestingDepth is one that we need to handle, but we probably want to handle it very differently so we are going to do that work as a separate issue.

The following options are not needed and might be added in follow on work.

  • allowSingleQuotes. This is on by default for both, and we have not seen anyone disable it.
  • allowComments: This is off by default for both and until a customer asks for it I don't think we will try to support it.
  • allowUnquotedFieldNames: This is off by default in both and again until a customer asks for it we will not try to support it.
  • allowBackslashEscapingAnyCharacter: This is off by default in both and we will not support it until a customer asks for it.
  • maxNumLen: This is for newer versions of Spark and is a DOS fix that we don't need to worry about.
  • maxStringLen: Again this is for newer versions of Spark is a DOS fix that we don't need to worry about.

We need to be very careful as we do this work that we do not regress the performance of get_json_object. Adding more functionality will cause more registers to be used and might impact the occupancy, which is already bad.

NVIDIA/spark-rapids#10803

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant