Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added warning , if column is multi categorical in h2o.cor() #12903 #15674

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

Mohit1345
Copy link
Contributor

fixes #12903

It will return "NA" if multi categorical columns are passed in it.
Modified in both R and python h2o.cor() method.

Copy link
Contributor

@tomasfryda tomasfryda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Mohit1345. It will need a bit more changes before merging. Please don't be discouraged by the requested changes, h2o's code takes some time to get used to.

Comment on lines -3185 to -3187
if y is None:
y = self
if use is None: use = "complete.obs" if na_rm else "everything"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be deleted.

if y is None:
y = self
if use is None: use = "complete.obs" if na_rm else "everything"
y_categorical = any(self.types[col_name] == "enum" for col_name in y)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems incorrect - y H2OFrame not a list of columns. Also you can use y.isfactor() to check if it's categorical - the output is a list of boolean values in the same order as are the columns.

y_categorical = any(self.types[col_name] == "enum" for col_name in y)

if y_categorical:
num_unique_levels = {col: len(self[col].levels()) for col in y}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of len(self[col].levels()) use self[col].nlevels()[0]. nlevels returns the just a list of cardinalities so there is less communication with the backend and lower memory use in the python client. Also since the y is an H2OFrame (see the assert on the line 3182) you can use something like dict(zip(y.columns, y.nlevels())) to get the same thing.


if multi_categorical:
import warnings
warnings.warn("NA")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make the warning more informative.

For example:

for col, card in num_unique_levels.items():
    if card > 2:
        warnings.warn("Column {} contains {} levels. Only numerical and binary columns are supported.".format(col, card))


if ((x_categorical && length(unique(h2o.levels(x))) > 2) || (y_categorical && length(unique(h2o.levels(y))) > 2)) {
warning("NA")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make the warning more informative.

@Mohit1345
Copy link
Contributor Author

Thank you @Mohit1345. It will need a bit more changes before merging. Please don't be discouraged by the requested changes, h2o's code takes some time to get used to.

Sure, will try to make changes

@Devanshusisodiya
Copy link

Devanshusisodiya commented Feb 11, 2024

Hi @tomasfryda @wendycwong , I have raised a PR for this issue, please take a look at it here

@Devanshusisodiya
Copy link

Hi @tomasfryda @wendycwong please review this PR #16070

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Warning if User Passes Categorical Columns to h2o.cor()
3 participants