Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-15947: fixed skipped_column error in Python #16164

Open
wants to merge 10 commits into
base: rel-3.46.0
Choose a base branch
from

Conversation

wendycwong
Copy link
Contributor

@wendycwong wendycwong commented Apr 18, 2024

issue: #15947

The problem here is when called with h2o.H2OFrame, we did not take into account of skipped columns when trying to figure out the final column counts.

Fixed the bug and added Python test from Seb.
Fixed the bug in R and added R test.

@wendycwong wendycwong added the do not merge For PRs that are not supposed to be merged label Apr 26, 2024
Copy link
Contributor

@sebhrusen sebhrusen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for this fix @wendycwong, but I must admit that I still don't understand why the old test is still in place.

@@ -126,11 +126,9 @@ def H2OFrame_from_H2OFrame():


def H2OFrame_skipped_columns_is_BUGGY():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can rename the test now that it's fixed :)

raise ValueError(
"length of col_names should be equal to the number of columns parsed: %d vs %d"
% (len(column_names), parse_column_len))
"length of col_names minus lenght of skipped_columns should equal the number of columns parsed: "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: length

@@ -871,10 +871,10 @@ def parse_setup(raw_frames, destination_frame=None, header=0, separator=None, co
if column_names is not None:
if not isinstance(column_names, list): raise ValueError("col_names should be a list")
if (skipped_columns is not None) and len(skipped_columns)>0:
if (len(column_names)) != parse_column_len:
if ((len(column_names)-len(skipped_columns)) != parse_column_len) and (len(column_names) != parse_column_len):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain why with the current behaviour, parse_column_len must equal one or the other? why do we not always enforce len(column_names)-len(skipped_columns) == parse_column_len for example?
Especially given that this test is applied only if len(skipped_columns)>0 and given that above we define:

    parse_column_len = len(j["column_types"]) if skipped_columns is None else (len(j["column_types"])-len(skipped_columns))

therefore I don't understand the case where we would have len(column_names) == parse_column_len, as checked in the original test.

Should we not only test :

if (len(column_names)-len(skipped_columns)) != parse_column_len:

and if not, doesn't it show a more profound inconsistency/bug?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do not merge For PRs that are not supposed to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants