Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] FEA Gower distance #16834

Open
wants to merge 235 commits into
base: main
Choose a base branch
from
Open

Conversation

adrinjalali
Copy link
Member

Closes #9555
Fixes #5884

This PR continues the work done in #9555. Since the discussion on that PR is really long at this point, and this is a different implementation based on scipy.spatial.distance.cdist, I decided to create a new one to keep discussions separate.

Up until 4859b81 (at the time of opening this PR), there are minor changes to the existing tests. I'm planning to clean up the tests, but they were very helpful catching the issues with the edge cases, good job on that @marcelobeckmann.

However, I've changed a few design decisions:

  • categorical_features is mandatory, and if not passed, the data is assumed to be numerical. If string columns are present, it'll simply fail to convert them to floats (a numpy error).
  • categorical_features now also accepts a callable, which can be created using make_column_selector the same way as it's used in ColumnTransformer.
  • It uses a MinMaxScaler instead of its own scaler.

Before I finalize the user guide and fix the tests, there are a few issues I'd like us to discuss, and I'll leave inline comments to start those discussions.

@adrinjalali
Copy link
Member Author

I'm not sure why the CI fails, and I can't figure why the link to the gower_distances is not working in metrics.rst, otherwise this is ready for another round of reviews.

@cmarmo do you happen to know how to fix the sphinx issues here?

@NicolasHug
Copy link
Member

I can't figure why the link to the gower_distances is not working in metrics.rst

it's because it's not included in the API ref (classes.rst) i think

@cmarmo
Copy link
Member

cmarmo commented Apr 15, 2020

@cmarmo do you happen to know how to fix the sphinx issues here?

Not sure... maybe try to round to 0.667... in the Gower distance example?
I was wrong, the solution is in @jnothman comment

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making a start.

sklearn/metrics/pairwise.py Outdated Show resolved Hide resolved
cols = []

col_idx = _get_column_indices(X, cols)
X_cat = _safe_indexing(X, col_idx, axis=1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we care that this is (I think) always a copying operation even if col_idx is empty?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't that be an issue with _safe_indexing?

sklearn/metrics/pairwise.py Outdated Show resolved Hide resolved
sklearn/metrics/pairwise.py Show resolved Hide resolved
sklearn/metrics/pairwise.py Show resolved Hide resolved
sklearn/metrics/pairwise.py Outdated Show resolved Hide resolved
returns 1-S.
"""
def _nanmanhatan(x, y):
return np.nansum(np.abs(x - y))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is using this row-wise function call worthwhile relative to something more vectorised?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's kinda easy to do more vectorized when X and Y are of the same size, otherwise I'm not sure if it's worth the complexity.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this implementation is significantly faster than the cosine distances for instance. So I don't think we should worry about the speed too much?

sklearn/metrics/pairwise.py Outdated Show resolved Hide resolved
return np.sum(~_object_dtype_isnan(x) & ~_object_dtype_isnan(y))

def _nanhamming(x, y):
return np.sum(x != y) - np.sum(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or _non_nans(x, y) - np.sum(x == y)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't seem to help the time at least when I test it.

doc/modules/metrics.rst Outdated Show resolved Hide resolved
@adrinjalali adrinjalali changed the title [WIP] FEA Gower distance [MRG] FEA Gower distance May 29, 2020
@adrinjalali
Copy link
Member Author

I think this is up to be reviewed. It's at a good state IMO.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iterating

if not hasattr(X, "shape"):
X = check_array(X, dtype=np.object, force_all_finite=False)

cols = categorical_features
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what is gained by aliasing here.


def gower_distances(X, Y=None, categorical_features=None, scale=True,
min_values=None, scale_factor=None):
"""Compute the distances between the observations in X and Y,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP257: this should be a one-line summary


min_values : ndarray of shape (n_features,), default=None
Per feature adjustment for minimum. Equivalent to
``min_values - X.min(axis=0) * scale_factor``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is it a function of itself?

and ``scale_factor`` have to be provided as well.

min_values : ndarray of shape (n_features,), default=None
Per feature adjustment for minimum. Equivalent to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does "Equivalent to" apply to the case where min_values=None? Use the words "if None"

if X is None or len(X) == 0:
raise ValueError("X can not be None or empty")

if scale:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether it's worth running locals().update(_precompute_metric_params(X, Y, 'gower', **locals()) to reduce code duplication... and move to a more elegant solution eventually?

dtype=float,
force_all_finite=False)
if scale:
scale_data = X_num if Y_num is X_num else np.vstack((X_num, Y_num))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the else case should never be executed.

@@ -1744,6 +1952,17 @@ def pairwise_distances(X, Y=None, metric="euclidean", *, n_jobs=None,
check_non_negative(X, whom=whom)
return X
elif metric in PAIRWISE_DISTANCE_FUNCTIONS:
if metric == 'gower':
"""
# These convertions are necessary for matrices with string values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# These convertions are necessary for matrices with string values
# These conversions are necessary for matrices with string values

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear why this code is commented out.

@AlistairLR112
Copy link

any word on this?

@jnothman
Copy link
Member

@adrinjalali do you think you'll be available to help deliver this over September?

@adrinjalali
Copy link
Member Author

@jnothman yeah I have this and sample props on my plate as major things I'm going to focus on. The new job started 1st of August, and I'm slowly finding my routine and settling down. So I'm optimistic that the answer is yes.

Base automatically changed from master to main January 22, 2021 10:52
@jnothman
Copy link
Member

@adrinjalali I'm happy to push this to approval if you're up to making the changes! :)

@adrinjalali
Copy link
Member Author

adrinjalali commented Mar 21, 2021

@jnothman It'll be tough for me to touch this in the next 2 months, after that I'm happy to work on it. I'm happy to review if you push changes here.

@KendallPark
Copy link

KendallPark commented Aug 13, 2021

I'm curious what the status of this is? I've had to roll out my own (non-optimized) implementation of Gower's distance in the meantime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement Gower Similarity Coefficient