Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Write a from_json implementation using the json_parser form spark-rapids-jni #2035

Open
revans2 opened this issue May 13, 2024 · 0 comments
Assignees

Comments

@revans2
Copy link
Collaborator

revans2 commented May 13, 2024

Is your feature request related to a problem? Please describe.
We have a version of from_json and json_scan that uses the CUDF tokenizer/parser. But we are seeing both performance and correctness issues. Not that the CUDF code is wrong. Just that Spark parses/validates JSON differently and trying to get all of the parts to match is not simple.

The get_json_object code is decently fast and is accurate. So the goal is that once we have enough configuration options #2031 and we can support all the nesting we need #2033 then we can figure out how to use that code to get out the data we need.

Please note that this task might be broken down into smaller tasks to be able to fully support all of the features that we need, because when this is done we want to fully support from_json for the default configs on the GPU on close to all of the data types.

For get_json_object, because it outputs only strings, it runs in two passes. The first pass will parse the input data, validate it, calculate the size of the output string column, and write that size out. Next a scan is done to turn the sizes into offsets. Finally a second kernel is run, which will parse the input data, validate it again, and write the data out.

Things get extra complicated here because we have to deal with multiple different nesting levels. I still think we can do it all in two passes of the parsing kernel.

For example:

Lets say I have data that looks like

{"a": [[1, 20, 3], null, [40]]}
{"a": [null, [5, 60]]}
{"a": []}

and a schema like STRUCT<a: Array<Array<String>>>

Before we process the data at all we are going to allocate a column of sizes for each nesting level and call a kernel with pointers to that data. The sizes are offsets for each nesting level per input row, not per output row.

So after the kernel runs, and it validates the input/etc. we end up with the following

a_level0,  a_level1, a_level2
3,             4,            6
2,             2,            3
0,             0,            0

For the first row a_level0 has 3 in it, because there are 3 entries in the top level array under a. For a_level1 it has 4 in it because there are a total 4 entries in all of the arrays at the next level. For a_level2 it has 6 in it because there are a total of 6 bytes needed to encode the leaf node string values.

Once we have these values we will do a scan on each of the levels to get offsets on a per-row basis.

a_level0,  a_level1, a_level2
3,             4,            6
5,             6,            9
5,             6,            9
5,             6,            9

With this information we can now allocate all of the final output buffers and run the kernel one last time. That kernel would look at the starting offset for each level of nesting and keep track of where it is at while it output the final offsets and data. a_level0 would be reused as the top level offsets for a. The rest of the temp offset columns would be thrown away.

a offsets_1
3
5
5
5
a offsets_2
3
3
4
4
6
6
a offsets_3
1
3
4
6
7
9
9
a leaf_data
'1'
'2'
'0'
'3'
'4'
'0'
'5'
'6'
'0'

Even though the algorithm does not appear to be too complex on the surface, we have the problem that we need a place to store all of the different offsets/counts for the different nesting levels. Happily this is related to the schema size/depth, so we can know it up front without running anything. We might need to start off with a hard coded maximum size, and fall back to the CPU if that size is too large. Then we can have a follow on issue to look at how to support more.

Be aware that we need to support a Map<STRING,STRING> as a top level data type for this. It would be great if we could also support top level arrays, but that is not a requirement. Any data type that we cannot support out of the box with this, we need to file a follow on issue for that support, and we need to make sure that we fall back to the CPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant