Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python H2OFrame constructor 'skipped_columns' doesn't work #15947

Open
sebhrusen opened this issue Nov 22, 2023 · 3 comments · May be fixed by #16071
Open

Python H2OFrame constructor 'skipped_columns' doesn't work #15947

sebhrusen opened this issue Nov 22, 2023 · 3 comments · May be fixed by #16071
Assignees
Milestone

Comments

@sebhrusen
Copy link
Contributor

When constructing a H2OFrame in Python using the constructor, the skipped_columns parameter systematically causes a bug due to a mismatch during the call to parse_setup

In [1]: data = [
   ...:     [1, 4, "a",  1],
   ...:     [2, 5, "b",  0],
   ...:     [3, 6, "", 1],
   ...: ]

In [2]: h2o.H2OFrame(data, skipped_columns=[1, 2])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-2-38e7cbb577be> in <module>
----> 1 h2o.H2OFrame(data, skipped_columns=[2])

~/repos/h2o/h2o-3/h2o-py/h2o/frame.py in __init__(self, python_obj, destination_frame, header, separator, column_names, column_types, na_strings, skipped_columns, force_col_types)
    119             if python_obj is not None:
    120                 self._upload_python_object(python_obj, destination_frame, header, separator,
--> 121                                            column_names, column_types, na_strings, skipped_columns, force_col_types)
    122
    123     @staticmethod

~/repos/h2o/h2o-3/h2o-py/h2o/frame.py in _upload_python_object(self, python_obj, destination_frame, header, separator, column_names, column_types, na_strings, skipped_columns, force_col_types)
    160         tmp_file.close()  # close the streams
    161         self._upload_parse(tmp_path, destination_frame, 1, separator, column_names, column_types, na_strings,
--> 162                            skipped_columns, force_col_types)
    163         os.remove(tmp_path)  # delete the tmp file
    164

~/repos/h2o/h2o-3/h2o-py/h2o/frame.py in _upload_parse(self, path, destination_frame, header, sep, column_names, column_types, na_strings, skipped_columns, force_col_types, quotechar, escapechar)
    466         rawkey = ret["destination_frame"]
    467         self._parse(rawkey, destination_frame, header, sep, column_names, column_types, na_strings, skipped_columns,
--> 468                     force_col_types, quotechar=quotechar, escapechar=escapechar)
    469         return self
    470

~/repos/h2o/h2o-3/h2o-py/h2o/frame.py in _parse(self, rawkey, destination_frame, header, separator, column_names, column_types, na_strings, skipped_columns, force_col_types, custom_non_data_line_markers, partition_by, quotechar, escapechar)
    473                escapechar=None):
    474         setup = h2o.parse_setup(rawkey, destination_frame, header, separator, column_names, column_types, na_strings,
--> 475                                 skipped_columns, force_col_types, custom_non_data_line_markers, partition_by, quotechar, escapechar)
    476         return self._parse_raw(setup)
    477

~/repos/h2o/h2o-3/h2o-py/h2o/h2o.py in parse_setup(raw_frames, destination_frame, header, separator, column_names, column_types, na_strings, skipped_columns, force_col_types, custom_non_data_line_markers, partition_by, quotechar, escapechar)
    874                 raise ValueError(
    875                     "length of col_names should be equal to the number of columns parsed: %d vs %d"
--> 876                     % (len(column_names), parse_column_len))
    877         else:
    878             if len(column_names) != len(j["column_types"]): raise ValueError(

ValueError: length of col_names should be equal to the number of columns parsed: 4 vs 2

Expected result

Parse progress: 
  C1    C2
   1     1
   2     0
   3     1
[3 rows x 2 columns]

Discovered when writing tests for #15898
See test in h2o-py/tests/testdir_apis/Data_Manipulation/pyunit_h2oH2OFrame.py

Note that h2o.parse_setup function in Py looks particularly meesy with a lot of logic done on client side when it could probably be done server-side:

  1. first setup call (j = api("POST /3/ParseSetup", data=kwargs)) passing only some of the params
  2. then fiddling with the result of the previous call + the other parameters not passed there to try to figure out the "real" parsing parameters.

This looks weird, why not let the ParseSetup figure out all the good params?

@Devanshusisodiya
Copy link

Devanshusisodiya commented Feb 13, 2024

hi @sebhrusen @wendycwong, i have fixed this bug and wanted to test it, how can i run the test suite?

Edit: figured it out
update: writing the new test

Devanshusisodiya added a commit to Devanshusisodiya/h2o-3 that referenced this issue Feb 13, 2024
Devanshusisodiya added a commit to Devanshusisodiya/h2o-3 that referenced this issue Feb 13, 2024
Devanshusisodiya added a commit to Devanshusisodiya/h2o-3 that referenced this issue Feb 13, 2024
Devanshusisodiya added a commit to Devanshusisodiya/h2o-3 that referenced this issue Feb 13, 2024
@Devanshusisodiya
Copy link

hi @wendycwong @sebhrusen, raised a pr #16071 for the fix of this issue.

please review and comment.

@Devanshusisodiya
Copy link

hi @tomasfryda @wendycwong, can you take a look at the pr #16071

wendycwong pushed a commit that referenced this issue Apr 18, 2024
@wendycwong wendycwong added this to the 3.46.0.2 milestone Apr 18, 2024
wendycwong pushed a commit that referenced this issue Apr 19, 2024
@valenad1 valenad1 modified the milestones: 3.46.0.2, 3.46.0.3 May 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants