GH-15947: fixed skipped_column error in Python #16164

wendycwong · 2024-04-18T17:45:39Z

The problem here is when called with h2o.H2OFrame, we did not take into account of skipped columns when trying to figure out the final column counts.

Fixed the bug and added Python test from Seb.
Fixed the bug in R and added R test.

…specified when calling h2o.H2OFrame.

…ta frames to h2O data frames.

…specified or not.

sebhrusen

thanks for this fix @wendycwong, but I must admit that I still don't understand why the old test is still in place.

sebhrusen · 2024-04-24T21:41:51Z

h2o-py/tests/testdir_apis/Data_Manipulation/pyunit_h2oH2OFrame.py

@@ -126,11 +126,9 @@ def H2OFrame_from_H2OFrame():


 def H2OFrame_skipped_columns_is_BUGGY():


you can rename the test now that it's fixed :)

sebhrusen · 2024-05-13T15:09:08Z

h2o-py/h2o/h2o.py

                raise ValueError(
-                    "length of col_names should be equal to the number of columns parsed: %d vs %d"
-                    % (len(column_names), parse_column_len))
+                    "length of col_names minus lenght of skipped_columns should equal the number of columns parsed: "


typo: length

sebhrusen · 2024-05-13T15:19:14Z

h2o-py/h2o/h2o.py

@@ -871,10 +871,10 @@ def parse_setup(raw_frames, destination_frame=None, header=0, separator=None, co
    if column_names is not None:
        if not isinstance(column_names, list): raise ValueError("col_names should be a list")
        if (skipped_columns is not None) and len(skipped_columns)>0:
-            if (len(column_names)) != parse_column_len:
+            if ((len(column_names)-len(skipped_columns)) != parse_column_len) and (len(column_names) != parse_column_len):


can you explain why with the current behaviour, parse_column_len must equal one or the other? why do we not always enforce len(column_names)-len(skipped_columns) == parse_column_len for example?
Especially given that this test is applied only if len(skipped_columns)>0 and given that above we define:

parse_column_len = len(j["column_types"]) if skipped_columns is None else (len(j["column_types"])-len(skipped_columns))

therefore I don't understand the case where we would have len(column_names) == parse_column_len, as checked in the original test.

Should we not only test :

if (len(column_names)-len(skipped_columns)) != parse_column_len:

and if not, doesn't it show a more profound inconsistency/bug?

wendycwong added 2 commits April 18, 2024 10:41

fixed skipped_column error in Python

b05f559

GH-15947: fixed column length discrepancies when skipped_columns are …

f1b5cba

…specified when calling h2o.H2OFrame.

wendycwong requested review from maurever and sebhrusen April 18, 2024 17:45

wendycwong added 8 commits April 18, 2024 16:00

GH-15947: fixed skipped columns for normal import_file path.

c25e378

GH-15947: added same skipped column capability when transforming R da…

ad34c93

…ta frames to h2O data frames.

Add conditions to check correct column length when skipped_column is …

dc79749

…specified or not.

Add comment to skipped_columns parameter.

b6ec166

add parameter to avoid R cmd test failure.

f7d377d

fix R cmd failure

bc20d2c

Clarify error message

4890d1d

add missing bracket.

1f17a8f

wendycwong added the do not merge For PRs that are not supposed to be merged label Apr 26, 2024

sebhrusen reviewed May 13, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-15947: fixed skipped_column error in Python #16164

GH-15947: fixed skipped_column error in Python #16164

wendycwong commented Apr 18, 2024 •

edited

sebhrusen left a comment

sebhrusen Apr 24, 2024

sebhrusen May 13, 2024

sebhrusen May 13, 2024

		@@ -126,11 +126,9 @@ def H2OFrame_from_H2OFrame():


		def H2OFrame_skipped_columns_is_BUGGY():

GH-15947: fixed skipped_column error in Python #16164

Are you sure you want to change the base?

GH-15947: fixed skipped_column error in Python #16164

Conversation

wendycwong commented Apr 18, 2024 • edited

sebhrusen left a comment

Choose a reason for hiding this comment

sebhrusen Apr 24, 2024

Choose a reason for hiding this comment

sebhrusen May 13, 2024

Choose a reason for hiding this comment

sebhrusen May 13, 2024

Choose a reason for hiding this comment

wendycwong commented Apr 18, 2024 •

edited