New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] FEA Gower distance #16834
base: main
Are you sure you want to change the base?
[MRG] FEA Gower distance #16834
Conversation
I'm not sure why the CI fails, and I can't figure why the link to the @cmarmo do you happen to know how to fix the sphinx issues here? |
it's because it's not included in the API ref (classes.rst) i think |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making a start.
cols = [] | ||
|
||
col_idx = _get_column_indices(X, cols) | ||
X_cat = _safe_indexing(X, col_idx, axis=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we care that this is (I think) always a copying operation even if col_idx is empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't that be an issue with _safe_indexing
?
returns 1-S. | ||
""" | ||
def _nanmanhatan(x, y): | ||
return np.nansum(np.abs(x - y)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is using this row-wise function call worthwhile relative to something more vectorised?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's kinda easy to do more vectorized when X and Y are of the same size, otherwise I'm not sure if it's worth the complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this implementation is significantly faster than the cosine distances for instance. So I don't think we should worry about the speed too much?
return np.sum(~_object_dtype_isnan(x) & ~_object_dtype_isnan(y)) | ||
|
||
def _nanhamming(x, y): | ||
return np.sum(x != y) - np.sum( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or _non_nans(x, y) - np.sum(x == y)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't seem to help the time at least when I test it.
Co-authored-by: Joel Nothman <joel.nothman@gmail.com>
I think this is up to be reviewed. It's at a good state IMO. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iterating
if not hasattr(X, "shape"): | ||
X = check_array(X, dtype=np.object, force_all_finite=False) | ||
|
||
cols = categorical_features |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what is gained by aliasing here.
|
||
def gower_distances(X, Y=None, categorical_features=None, scale=True, | ||
min_values=None, scale_factor=None): | ||
"""Compute the distances between the observations in X and Y, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP257: this should be a one-line summary
|
||
min_values : ndarray of shape (n_features,), default=None | ||
Per feature adjustment for minimum. Equivalent to | ||
``min_values - X.min(axis=0) * scale_factor`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is it a function of itself?
and ``scale_factor`` have to be provided as well. | ||
|
||
min_values : ndarray of shape (n_features,), default=None | ||
Per feature adjustment for minimum. Equivalent to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does "Equivalent to" apply to the case where min_values=None
? Use the words "if None"
if X is None or len(X) == 0: | ||
raise ValueError("X can not be None or empty") | ||
|
||
if scale: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether it's worth running locals().update(_precompute_metric_params(X, Y, 'gower', **locals())
to reduce code duplication... and move to a more elegant solution eventually?
dtype=float, | ||
force_all_finite=False) | ||
if scale: | ||
scale_data = X_num if Y_num is X_num else np.vstack((X_num, Y_num)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the else
case should never be executed.
@@ -1744,6 +1952,17 @@ def pairwise_distances(X, Y=None, metric="euclidean", *, n_jobs=None, | |||
check_non_negative(X, whom=whom) | |||
return X | |||
elif metric in PAIRWISE_DISTANCE_FUNCTIONS: | |||
if metric == 'gower': | |||
""" | |||
# These convertions are necessary for matrices with string values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# These convertions are necessary for matrices with string values | |
# These conversions are necessary for matrices with string values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear why this code is commented out.
any word on this? |
@adrinjalali do you think you'll be available to help deliver this over September? |
@jnothman yeah I have this and sample props on my plate as major things I'm going to focus on. The new job started 1st of August, and I'm slowly finding my routine and settling down. So I'm optimistic that the answer is yes. |
@adrinjalali I'm happy to push this to approval if you're up to making the changes! :) |
@jnothman It'll be tough for me to touch this in the next 2 months, after that I'm happy to work on it. I'm happy to review if you push changes here. |
I'm curious what the status of this is? I've had to roll out my own (non-optimized) implementation of Gower's distance in the meantime. |
Closes #9555
Fixes #5884
This PR continues the work done in #9555. Since the discussion on that PR is really long at this point, and this is a different implementation based on
scipy.spatial.distance.cdist
, I decided to create a new one to keep discussions separate.Up until 4859b81 (at the time of opening this PR), there are minor changes to the existing tests. I'm planning to clean up the tests, but they were very helpful catching the issues with the edge cases, good job on that @marcelobeckmann.
However, I've changed a few design decisions:
categorical_features
is mandatory, and if not passed, the data is assumed to be numerical. If string columns are present, it'll simply fail to convert them to floats (a numpy error).categorical_features
now also accepts a callable, which can be created usingmake_column_selector
the same way as it's used inColumnTransformer
.MinMaxScaler
instead of its own scaler.Before I finalize the user guide and fix the tests, there are a few issues I'd like us to discuss, and I'll leave inline comments to start those discussions.