Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable get_token_stream to include LineEnd tokens with optional parameter. #15605

Open
wants to merge 7 commits into
base: branch-24.06
Choose a base branch
from

Conversation

SurajAralihalli
Copy link
Contributor

@SurajAralihalli SurajAralihalli commented Apr 26, 2024

Description

This PR adds parameter LineEndTokenOption to the get_token_stream and process_token_stream functions, enabling LineEnd tokens in the output. Also retained original declaration of get_token_stream to maintain backward compatibility.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>
Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Apr 26, 2024
Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>
Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>
@SurajAralihalli SurajAralihalli marked this pull request as ready for review April 26, 2024 23:47
@SurajAralihalli SurajAralihalli requested a review from a team as a code owner April 26, 2024 23:47
Copy link
Contributor

@karthikeyann karthikeyann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds another step where we need to remove these LineEnd before tree algorithms.

Do we need LineEnd tokens? if this is for finding the row number of tokens, it's possible to calculate using StructBegin, StructEnd.

cpp/src/io/json/nested_json_gpu.cu Outdated Show resolved Hide resolved
Copy link
Contributor

@shrshi shrshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One high-level comment -

cpp/src/io/json/nested_json_gpu.cu Outdated Show resolved Hide resolved
Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>
Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>
Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>
@SurajAralihalli SurajAralihalli added Spark Functionality that helps Spark RAPIDS improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels May 1, 2024
@karthikeyann
Copy link
Contributor

I and @shrshi discussed about a profile of @revans2 's prototype https://github.com/revans2/spark-rapids-jni/pull/new/get_json_obj_experiment.CUDF
image

A few outcomes of our meeting:

  • 5 simulateDFA calls in total.
  • A few smaller FST take longer time than get_stack_context FST. will investigate further.
  • Consider fusing recover_from_error with PDA (json_to_tokens_fst FST).
  • It is possible to eliminate the process_token_stream function itself, if the post-processing code of tokens after get_token_stream, can handle Error tokens. That new code is going to process tokens row-wise, so it will be easier to nullify the entire row.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
Status: In Progress
Status: No status
Development

Successfully merging this pull request may close these issues.

None yet

4 participants