Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added warning , if column is multi categorical in h2o.cor() #12903 #15674

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
13 changes: 10 additions & 3 deletions h2o-py/h2o/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -3182,9 +3182,16 @@ def cor(self, y=None, na_rm=False, use=None, method="Pearson"):
assert_is_type(y, H2OFrame, None)
assert_is_type(na_rm, bool)
assert_is_type(use, None, "everything", "all.obs", "complete.obs")
if y is None:
y = self
if use is None: use = "complete.obs" if na_rm else "everything"
Comment on lines -3185 to -3187
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be deleted.

y_categorical = any(self.types[col_name] == "enum" for col_name in y)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems incorrect - y H2OFrame not a list of columns. Also you can use y.isfactor() to check if it's categorical - the output is a list of boolean values in the same order as are the columns.


if y_categorical:
num_unique_levels = {col: len(self[col].levels()) for col in y}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of len(self[col].levels()) use self[col].nlevels()[0]. nlevels returns the just a list of cardinalities so there is less communication with the backend and lower memory use in the python client. Also since the y is an H2OFrame (see the assert on the line 3182) you can use something like dict(zip(y.columns, y.nlevels())) to get the same thing.

multi_categorical = any(num_levels > 2 for num_levels in num_unique_levels.values())

if multi_categorical:
import warnings
warnings.warn("NA")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make the warning more informative.

For example:

for col, card in num_unique_levels.items():
    if card > 2:
        warnings.warn("Column {} contains {} levels. Only numerical and binary columns are supported.".format(col, card))


if self.nrow == 1 or (self.ncol == 1 and y.ncol == 1): return ExprNode("cor", self, y, use, method)._eager_scalar()
return H2OFrame._expr(expr=ExprNode("cor", self, y, use, method))._frame()

Expand Down
23 changes: 17 additions & 6 deletions h2o-r/h2o-package/R/frame.R
Original file line number Diff line number Diff line change
Expand Up @@ -2880,25 +2880,36 @@ var <- function(x, y = NULL, na.rm = FALSE, use) {
#' cor(prostate$AGE)
#' }
#' @export
h2o.cor <- function(x, y=NULL,na.rm = FALSE, use, method="Pearson"){
h2o.cor <- function(x, y = NULL, na.rm = FALSE, use, method = "Pearson") {
# Eager, mostly to match prior semantics but no real reason it need to be
if( is.null(y) ){
if (is.null(y)) {
y <- x
}
if(missing(use)) {
if (missing(use)) {
if (na.rm) use <- "complete.obs" else use <- "everything"
}

if (is.null(method) || is.na(method)) {
stop("Correlation method must be specified.")
}

# Check for categorical columns in x and y
x_categorical <- any(h2o.isfactor(x))
y_categorical <- any(h2o.isfactor(y))

if ((x_categorical && length(unique(h2o.levels(x))) > 2) || (y_categorical && length(unique(h2o.levels(y))) > 2)) {
warning("NA")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make the warning more informative.

}


# Eager, mostly to match prior semantics but no real reason it need to be
expr <- .newExpr("cor",x,y,.quote(use), .quote(method))
if( (nrow(x)==1L || (ncol(x)==1L && ncol(y)==1L)) ) .eval.scalar(expr)
else .fetch.data(expr,ncol(x))
expr <- .newExpr("cor", x, y, .quote(use), .quote(method))
if ((nrow(x) == 1L || (ncol(x) == 1L && ncol(y) == 1L))) .eval.scalar(expr)
else .fetch.data(expr, ncol(x))
}



#'
#' Compute a pairwise distance measure between all rows of two numeric H2OFrames.
#'
Expand Down