Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output size assertions in unify_json_strings #1707

Open
jlowe opened this issue Jan 17, 2024 · 1 comment
Open

Output size assertions in unify_json_strings #1707

jlowe opened this issue Jan 17, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@jlowe
Copy link
Member

jlowe commented Jan 17, 2024

unify_json_strings in map_utils.cu has two places where it asserts on string column data sizes. The first is this one:

  auto const output_size =
    2l +                                            // two extra bracket characters '[' and ']'
    static_cast<int64_t>(chars_size) +
    static_cast<int64_t>(input.size() - 1) +        // append `,` character between input rows
    static_cast<int64_t>(input.null_count()) * 2l;  // replace null with "{}"
  CUDF_EXPECTS(output_size <= static_cast<int64_t>(std::numeric_limits<cudf::size_type>::max()),
               "The input json column is too large and causes overflow.");

Eventually cudf will support string columns with more than 2GB of character data, and this assert will fire when we do not want it to.

Later there's a sanity check assertion that used to be super cheap, because before the recent cudf strings column change, calculating the size of character data in a string column involved only accessing data already on the CPU. Now this assert requires an extra stream synchronization that could be avoided by simply computing it directly from the known output_size.

  auto const joined_input_size_bytes = joined_input_scv.chars_size(stream);
  CUDF_EXPECTS(joined_input_size_bytes + 2 == output_size, "Incorrect output size computation.");

We could consider removing this assertion if we never see issues with join_strings doing what it's supposed to do, and that would remove a stream synchronization required to access the string column's character data size.

@jlowe jlowe added bug Something isn't working ? - Needs Triage labels Jan 17, 2024
@sameerz
Copy link
Collaborator

sameerz commented Jan 23, 2024

Keeping this in the backlog until libcudf has long strings support per rapidsai/cudf#13733

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants