[MRG] FEA Gower distance #16834

adrinjalali · 2020-04-03T14:56:06Z

This PR continues the work done in #9555. Since the discussion on that PR is really long at this point, and this is a different implementation based on scipy.spatial.distance.cdist, I decided to create a new one to keep discussions separate.

Up until 4859b81 (at the time of opening this PR), there are minor changes to the existing tests. I'm planning to clean up the tests, but they were very helpful catching the issues with the edge cases, good job on that @marcelobeckmann.

However, I've changed a few design decisions:

categorical_features is mandatory, and if not passed, the data is assumed to be numerical. If string columns are present, it'll simply fail to convert them to floats (a numpy error).
categorical_features now also accepts a callable, which can be created using make_column_selector the same way as it's used in ColumnTransformer.
It uses a MinMaxScaler instead of its own scaler.

Before I finalize the user guide and fix the tests, there are a few issues I'd like us to discuss, and I'll leave inline comments to start those discussions.

…into b5584

adrinjalali · 2020-04-15T14:20:28Z

I'm not sure why the CI fails, and I can't figure why the link to the gower_distances is not working in metrics.rst, otherwise this is ready for another round of reviews.

@cmarmo do you happen to know how to fix the sphinx issues here?

NicolasHug · 2020-04-15T14:25:30Z

I can't figure why the link to the gower_distances is not working in metrics.rst

it's because it's not included in the API ref (classes.rst) i think

cmarmo · 2020-04-15T16:58:13Z

@cmarmo do you happen to know how to fix the sphinx issues here?

~~Not sure... maybe try to round to 0.667... in the Gower distance example?~~
I was wrong, the solution is in @jnothman comment

jnothman

Making a start.

sklearn/metrics/pairwise.py

jnothman · 2020-04-22T00:32:29Z

sklearn/metrics/pairwise.py

+        cols = []
+
+    col_idx = _get_column_indices(X, cols)
+    X_cat = _safe_indexing(X, col_idx, axis=1)


Do we care that this is (I think) always a copying operation even if col_idx is empty?

shouldn't that be an issue with _safe_indexing?

sklearn/metrics/pairwise.py

jnothman · 2020-04-22T00:43:34Z

sklearn/metrics/pairwise.py

+    returns 1-S.
+    """
+    def _nanmanhatan(x, y):
+        return np.nansum(np.abs(x - y))


Is using this row-wise function call worthwhile relative to something more vectorised?

It's kinda easy to do more vectorized when X and Y are of the same size, otherwise I'm not sure if it's worth the complexity.

Also, this implementation is significantly faster than the cosine distances for instance. So I don't think we should worry about the speed too much?

sklearn/metrics/pairwise.py

jnothman · 2020-04-22T00:47:44Z

sklearn/metrics/pairwise.py

+        return np.sum(~_object_dtype_isnan(x) & ~_object_dtype_isnan(y))
+
+    def _nanhamming(x, y):
+        return np.sum(x != y) - np.sum(


or _non_nans(x, y) - np.sum(x == y)?

doesn't seem to help the time at least when I test it.

doc/modules/metrics.rst

sklearn/metrics/pairwise.py

Co-authored-by: Joel Nothman <joel.nothman@gmail.com>

adrinjalali · 2020-05-29T14:27:13Z

I think this is up to be reviewed. It's at a good state IMO.

jnothman

iterating

jnothman · 2020-05-31T13:01:13Z

sklearn/metrics/pairwise.py

+    if not hasattr(X, "shape"):
+        X = check_array(X, dtype=np.object, force_all_finite=False)
+
+    cols = categorical_features


I don't understand what is gained by aliasing here.

jnothman · 2020-05-31T13:02:43Z

sklearn/metrics/pairwise.py

+
+def gower_distances(X, Y=None, categorical_features=None, scale=True,
+                    min_values=None, scale_factor=None):
+    """Compute the distances between the observations in X and Y,


PEP257: this should be a one-line summary

jnothman · 2020-05-31T13:04:40Z

sklearn/metrics/pairwise.py

+
+    min_values : ndarray of shape (n_features,), default=None
+        Per feature adjustment for minimum. Equivalent to
+        ``min_values - X.min(axis=0) * scale_factor``


why is it a function of itself?

jnothman · 2020-05-31T13:05:01Z

sklearn/metrics/pairwise.py

+        and ``scale_factor`` have to be provided as well.
+
+    min_values : ndarray of shape (n_features,), default=None
+        Per feature adjustment for minimum. Equivalent to


Does "Equivalent to" apply to the case where min_values=None? Use the words "if None"

jnothman · 2020-05-31T13:10:07Z

sklearn/metrics/pairwise.py

+    if X is None or len(X) == 0:
+        raise ValueError("X can not be None or empty")
+
+    if scale:


I wonder whether it's worth running locals().update(_precompute_metric_params(X, Y, 'gower', **locals()) to reduce code duplication... and move to a more elegant solution eventually?

jnothman · 2020-05-31T13:10:47Z

sklearn/metrics/pairwise.py

+                                             dtype=float,
+                                             force_all_finite=False)
+        if scale:
+            scale_data = X_num if Y_num is X_num else np.vstack((X_num, Y_num))


the else case should never be executed.

jnothman · 2020-05-31T13:12:10Z

sklearn/metrics/pairwise.py

@@ -1744,6 +1952,17 @@ def pairwise_distances(X, Y=None, metric="euclidean", *, n_jobs=None,
        check_non_negative(X, whom=whom)
        return X
    elif metric in PAIRWISE_DISTANCE_FUNCTIONS:
+        if metric == 'gower':
+            """
+            # These convertions are necessary for matrices with string values


Suggested change

# These convertions are necessary for matrices with string values

# These conversions are necessary for matrices with string values

It's not clear why this code is commented out.

AlistairLR112 · 2020-07-20T12:59:14Z

any word on this?

jnothman · 2020-08-21T02:43:32Z

@adrinjalali do you think you'll be available to help deliver this over September?

adrinjalali · 2020-08-21T12:25:09Z

@jnothman yeah I have this and sample props on my plate as major things I'm going to focus on. The new job started 1st of August, and I'm slowly finding my routine and settling down. So I'm optimistic that the answer is yes.

jnothman · 2021-03-16T01:17:28Z

@adrinjalali I'm happy to push this to approval if you're up to making the changes! :)

adrinjalali · 2021-03-21T18:46:41Z

@jnothman It'll be tough for me to touch this in the next 2 months, after that I'm happy to work on it. I'm happy to review if you push changes here.

KendallPark · 2021-08-13T04:30:41Z

I'm curious what the status of this is? I've had to roll out my own (non-optimized) implementation of Gower's distance in the meantime.

marcelobeckmann and others added 30 commits April 10, 2019 07:32

Fix flake8 errors

e5fdbbb

Test rebase

dcf96f4

Test rebase

41a2748

Test CI

d3221a7

Test rebase

da71fba

Merge after remote pull

a63c43f

Test rebase

e50d9d9

Test rebase

47b20a9

Test rebase

12b773b

Test CI

a32f8e7

Test rebase

3480bf2

Test rebase

7be14ba

Text rebase

3d1f2bc

Fix CI errors

181a750

Improve test coverage

b8da4c9

Merge branch 'master' into HEAD

e31f72b

Test CI

5096b76

Test rebase

1230b6f

More changes

16b756f

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

db6303b

…into b5584

Fix merge

7cb7ce9

Improve test coverage

9679345

Test rebase

066d9fa

Test rebase

348bf40

Test CI

1b6f8b6

Test rebase

89b8884

Test rebase

ab8a61d

Test rebase

ef90d8e

Test rebase

dd1fdcd

Test rebase

71ce0c5

adrinjalali added 4 commits April 15, 2020 11:57

simplify tests

b0e3b11

fix sphinx warning

c0502b9

skip metrics.rst on pandas-less CI

adb7854

enable ellipsis

7287b1a

jnothman reviewed Apr 22, 2020

View reviewed changes

sklearn/metrics/pairwise.py Outdated Show resolved Hide resolved

adrinjalali and others added 10 commits May 28, 2020 14:00

Merge remote-tracking branch 'upstream/master' into b5584

f31ca7b

fix example in metrics.rst

330f57a

Co-authored-by: Joel Nothman <joel.nothman@gmail.com>

Merge branch 'b5584' of github.com:adrinjalali/scikit-learn into b5584

aa8758d

fix merge

0e8feb8

require scaling factors when Y is given and scale=True

9635f76

fix docstring

3708a67

Joel's optimization and categorical features consistency

9262412

remove unused import

d8b445e

remove doctest directive

3d433fc

fix metrics.rst doctest

7b6278c

adrinjalali changed the title ~~[WIP] FEA Gower distance~~ [MRG] FEA Gower distance May 29, 2020

jnothman reviewed May 31, 2020

View reviewed changes

Base automatically changed from master to main January 22, 2021 10:52

iamDecode mentioned this pull request Jan 24, 2022

Introduce k-nearest neighbors estimators iamDecode/sklearn-pmml-model#38

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] FEA Gower distance #16834

[MRG] FEA Gower distance #16834

adrinjalali commented Apr 3, 2020

adrinjalali commented Apr 15, 2020

NicolasHug commented Apr 15, 2020

cmarmo commented Apr 15, 2020 •

edited

jnothman left a comment

jnothman Apr 22, 2020

adrinjalali May 28, 2020

jnothman Apr 22, 2020

adrinjalali May 29, 2020

adrinjalali May 29, 2020

jnothman Apr 22, 2020

adrinjalali May 29, 2020

adrinjalali commented May 29, 2020

jnothman left a comment

jnothman May 31, 2020

jnothman May 31, 2020

jnothman May 31, 2020

jnothman May 31, 2020

jnothman May 31, 2020

jnothman May 31, 2020

jnothman May 31, 2020

jnothman May 31, 2020

AlistairLR112 commented Jul 20, 2020

jnothman commented Aug 21, 2020

adrinjalali commented Aug 21, 2020

jnothman commented Mar 16, 2021

adrinjalali commented Mar 21, 2021 •

edited

KendallPark commented Aug 13, 2021 •

edited

	# These convertions are necessary for matrices with string values
	# These conversions are necessary for matrices with string values

[MRG] FEA Gower distance #16834

Are you sure you want to change the base?

[MRG] FEA Gower distance #16834

Conversation

adrinjalali commented Apr 3, 2020

adrinjalali commented Apr 15, 2020

NicolasHug commented Apr 15, 2020

cmarmo commented Apr 15, 2020 • edited

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adrinjalali commented May 29, 2020

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlistairLR112 commented Jul 20, 2020

jnothman commented Aug 21, 2020

adrinjalali commented Aug 21, 2020

jnothman commented Mar 16, 2021

adrinjalali commented Mar 21, 2021 • edited

KendallPark commented Aug 13, 2021 • edited

cmarmo commented Apr 15, 2020 •

edited

adrinjalali commented Mar 21, 2021 •

edited

KendallPark commented Aug 13, 2021 •

edited