Enable `get_token_stream` to include `LineEnd` tokens with optional parameter. #15605

SurajAralihalli · 2024-04-26T23:00:17Z

Description

This PR adds parameter LineEndTokenOption to the get_token_stream and process_token_stream functions, enabling LineEnd tokens in the output. Also retained original declaration of get_token_stream to maintain backward compatibility.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>

cpp/include/cudf/io/detail/tokenize_json.hpp

karthikeyann

This adds another step where we need to remove these LineEnd before tree algorithms.

Do we need LineEnd tokens? if this is for finding the row number of tokens, it's possible to calculate using StructBegin, StructEnd.

cpp/src/io/json/nested_json_gpu.cu

shrshi

One high-level comment -

cpp/src/io/json/nested_json_gpu.cu

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>

karthikeyann · 2024-05-08T02:06:06Z

I and @shrshi discussed about a profile of @revans2 's prototype https://github.com/revans2/spark-rapids-jni/pull/new/get_json_obj_experiment.CUDF

A few outcomes of our meeting:

5 simulateDFA calls in total.
A few smaller FST take longer time than get_stack_context FST. will investigate further.
Consider fusing recover_from_error with PDA (json_to_tokens_fst FST).
It is possible to eliminate the process_token_stream function itself, if the post-processing code of tokens after get_token_stream, can handle Error tokens. That new code is going to process tokens row-wise, so it will be easier to nullify the entire row.
- @SurajAralihalli can the new post processing code of tokens replace Error tokens with null row?
- @shrshi suggested, FST may not be required to implement the process_token_stream anyway, since it's a stream compaction always. (Related PR [FEA] Adds option to recover from invalid JSON lines in JSON tokenizer #13344)

SurajAralihalli added 2 commits April 26, 2024 14:18

keep line end tokens

a71d160

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>

add tests

03e9824

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Apr 26, 2024

SurajAralihalli added 2 commits April 26, 2024 16:30

remove_line_end_token fix

e2bd05e

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>

missed test assert

f560780

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>

SurajAralihalli marked this pull request as ready for review April 26, 2024 23:47

SurajAralihalli requested a review from a team as a code owner April 26, 2024 23:47

SurajAralihalli requested review from bdice and ttnghia April 26, 2024 23:47

GregoryKimball requested review from karthikeyann and shrshi April 29, 2024 18:03

ttnghia reviewed Apr 30, 2024

View reviewed changes

cpp/include/cudf/io/detail/tokenize_json.hpp Outdated Show resolved Hide resolved

karthikeyann reviewed Apr 30, 2024

View reviewed changes

cpp/src/io/json/nested_json_gpu.cu Outdated Show resolved Hide resolved

shrshi reviewed Apr 30, 2024

View reviewed changes

cpp/src/io/json/nested_json_gpu.cu Outdated Show resolved Hide resolved

SurajAralihalli added 3 commits May 1, 2024 15:34

unify two functors

827c98a

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>

unify two functors fix param name

b711b80

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>

keep backwards compatibility

26112a5

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>

SurajAralihalli added Spark Functionality that helps Spark RAPIDS improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels May 1, 2024

SurajAralihalli requested review from karthikeyann, shrshi and ttnghia May 1, 2024 23:50

karthikeyann mentioned this pull request May 8, 2024

profiling prototype code in tests karthikeyann/cudf#9

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable `get_token_stream` to include `LineEnd` tokens with optional parameter. #15605

Enable `get_token_stream` to include `LineEnd` tokens with optional parameter. #15605

SurajAralihalli commented Apr 26, 2024 •

edited

karthikeyann left a comment

shrshi left a comment

karthikeyann commented May 8, 2024

Enable get_token_stream to include LineEnd tokens with optional parameter. #15605

Are you sure you want to change the base?

Enable get_token_stream to include LineEnd tokens with optional parameter. #15605

Conversation

SurajAralihalli commented Apr 26, 2024 • edited

Description

Checklist

karthikeyann left a comment

Choose a reason for hiding this comment

shrshi left a comment

Choose a reason for hiding this comment

karthikeyann commented May 8, 2024

Enable `get_token_stream` to include `LineEnd` tokens with optional parameter. #15605

Enable `get_token_stream` to include `LineEnd` tokens with optional parameter. #15605

SurajAralihalli commented Apr 26, 2024 •

edited