Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] FEA Gower distance #16834

Open
wants to merge 235 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 225 commits
Commits
Show all changes
235 commits
Select commit Hold shift + click to select a range
e5fdbbb
Fix flake8 errors
marcelobeckmann Dec 20, 2018
dcf96f4
Test rebase
marcelobeckmann Dec 20, 2018
41a2748
Test rebase
marcelobeckmann Apr 10, 2019
d3221a7
Test CI
marcelobeckmann Apr 25, 2019
da71fba
Test rebase
marcelobeckmann Apr 25, 2019
a63c43f
Merge after remote pull
marcelobeckmann Apr 25, 2019
e50d9d9
Test rebase
marcelobeckmann Apr 25, 2019
47b20a9
Test rebase
marcelobeckmann Dec 20, 2018
12b773b
Test rebase
marcelobeckmann Apr 10, 2019
a32f8e7
Test CI
marcelobeckmann Apr 25, 2019
3480bf2
Test rebase
marcelobeckmann Apr 25, 2019
7be14ba
Test rebase
marcelobeckmann Apr 25, 2019
3d1f2bc
Text rebase
marcelobeckmann Apr 26, 2019
181a750
Fix CI errors
marcelobeckmann Apr 26, 2019
b8da4c9
Improve test coverage
marcelobeckmann Apr 30, 2019
e31f72b
Merge branch 'master' into HEAD
jnothman Apr 30, 2019
5096b76
Test CI
marcelobeckmann Apr 25, 2019
1230b6f
Test rebase
marcelobeckmann Apr 25, 2019
16b756f
More changes
marcelobeckmann Apr 30, 2019
db6303b
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
marcelobeckmann May 1, 2019
7cb7ce9
Fix merge
marcelobeckmann May 1, 2019
9679345
Improve test coverage
marcelobeckmann May 1, 2019
066d9fa
Test rebase
marcelobeckmann Dec 20, 2018
348bf40
Test rebase
marcelobeckmann Apr 10, 2019
1b6f8b6
Test CI
marcelobeckmann Apr 25, 2019
89b8884
Test rebase
marcelobeckmann Apr 25, 2019
ab8a61d
Test rebase
marcelobeckmann Apr 25, 2019
ef90d8e
Test rebase
marcelobeckmann Dec 20, 2018
dd1fdcd
Test rebase
marcelobeckmann Apr 25, 2019
71ce0c5
Test rebase
marcelobeckmann Apr 25, 2019
705fec9
Fix CI errors
marcelobeckmann Apr 26, 2019
9e5a2ac
Improve test coverage
marcelobeckmann Apr 30, 2019
1ed4550
TST Remove np.seterr calls in test files (#13712)
aditya1702 Apr 26, 2019
a3a3135
FIX Correct brier_score_loss when there's only one class in y_true (#…
qinhanmin2014 Apr 26, 2019
ecb50be
CI skip HashVectorizer test on pypy (#13729)
glemaitre Apr 26, 2019
57693b1
MAINT removed close_figure helper (#13730)
NicolasHug Apr 26, 2019
9fd98c7
[MRG+2] Faster Gradient Boosting Decision Trees with binned features …
NicolasHug Apr 26, 2019
ed2ce90
DOC Fixing language in Hamming loss docstring. (#13735)
mitar Apr 27, 2019
6708e0d
FEA OPTICS: add extract_xi method (#12077)
adrinjalali Apr 27, 2019
2ca1fa4
DOC new convention is :pr: not :issue:
jnothman Apr 27, 2019
fb21d0f
DOC what's new cleaning (#13706)
jnothman Apr 27, 2019
8cc70af
FIX euclidean_distances float32 numerical instabilities (#13554)
jeremiedbb Apr 29, 2019
a7654e4
DOC Update release dates
jnothman Apr 27, 2019
52bb273
DOC Add commit contributors
jnothman Apr 27, 2019
fcc9519
DOC bump version
jnothman Apr 27, 2019
1df1fea
DOC move 0.20 to previous releases
jnothman Apr 29, 2019
0339b55
Added distance_threshold parameter to hierarchical clustering (#9069)
VathsalaAchar Apr 29, 2019
19f1c57
DOC add missing kernels to pairwise_kernels (#13746)
hossein-pourbozorg Apr 30, 2019
dcf3a37
DOC more ambiguous May release date for 0.21
jnothman Apr 30, 2019
2b1a697
TST Ignore Kmeans test failures on MacOS (#12648)
qinhanmin2014 Apr 30, 2019
2cb2802
FIX Optics paper typo which resulted in undersized clusters (#13750)
qinhanmin2014 Apr 30, 2019
84dfcf1
TST use approximate equality for float comparison (#13749)
jnothman Apr 30, 2019
d798f06
More changes
marcelobeckmann Apr 30, 2019
4889385
Improve test coverage
marcelobeckmann May 1, 2019
1cd6979
Merge branch 'b5584' of https://github.com/marcelobeckmann/scikit-lea…
marcelobeckmann May 1, 2019
206cd26
Merge branch 'master' of https://github.com/marcelobeckmann/scikit-learn
marcelobeckmann May 1, 2019
43b77ef
Test rebase
marcelobeckmann Dec 20, 2018
6d847d4
Test rebase
marcelobeckmann Apr 10, 2019
9379e2c
Test CI
marcelobeckmann Apr 25, 2019
4bf77e7
Test rebase
marcelobeckmann Apr 25, 2019
992b5cb
Test rebase
marcelobeckmann Apr 25, 2019
dbc6f55
Test rebase
marcelobeckmann Dec 20, 2018
da825de
Test rebase
marcelobeckmann Apr 25, 2019
3090915
Test rebase
marcelobeckmann Apr 25, 2019
4d10175
Fix CI errors
marcelobeckmann Apr 26, 2019
460484f
Improve test coverage
marcelobeckmann Apr 30, 2019
8b7f236
TST Remove np.seterr calls in test files (#13712)
aditya1702 Apr 26, 2019
c699f8d
FIX Correct brier_score_loss when there's only one class in y_true (#…
qinhanmin2014 Apr 26, 2019
ddf9022
CI skip HashVectorizer test on pypy (#13729)
glemaitre Apr 26, 2019
b3ad764
MAINT removed close_figure helper (#13730)
NicolasHug Apr 26, 2019
b99aacc
[MRG+2] Faster Gradient Boosting Decision Trees with binned features …
NicolasHug Apr 26, 2019
8f1bcd3
DOC Fixing language in Hamming loss docstring. (#13735)
mitar Apr 27, 2019
cda1b54
FEA OPTICS: add extract_xi method (#12077)
adrinjalali Apr 27, 2019
6b10f24
DOC new convention is :pr: not :issue:
jnothman Apr 27, 2019
5209834
DOC what's new cleaning (#13706)
jnothman Apr 27, 2019
cc0184f
FIX euclidean_distances float32 numerical instabilities (#13554)
jeremiedbb Apr 29, 2019
ceb4b44
DOC Update release dates
jnothman Apr 27, 2019
172d21f
DOC Add commit contributors
jnothman Apr 27, 2019
f3b1544
DOC bump version
jnothman Apr 27, 2019
8ed3ecb
DOC move 0.20 to previous releases
jnothman Apr 29, 2019
0c4d489
Added distance_threshold parameter to hierarchical clustering (#9069)
VathsalaAchar Apr 29, 2019
6c1054e
DOC add missing kernels to pairwise_kernels (#13746)
hossein-pourbozorg Apr 30, 2019
ca12d35
DOC more ambiguous May release date for 0.21
jnothman Apr 30, 2019
9c09d9b
TST Ignore Kmeans test failures on MacOS (#12648)
qinhanmin2014 Apr 30, 2019
ed74af0
FIX Optics paper typo which resulted in undersized clusters (#13750)
qinhanmin2014 Apr 30, 2019
f40273f
TST use approximate equality for float comparison (#13749)
jnothman Apr 30, 2019
7dd2a9b
Improve test coverage
marcelobeckmann May 1, 2019
c5a4472
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
marcelobeckmann May 1, 2019
3a9f576
Improve test coverage
marcelobeckmann May 1, 2019
d3b10fe
Fix flake8 errors
marcelobeckmann May 1, 2019
fcb4763
Use _precompute_metric_params
marcelobeckmann May 7, 2019
6f2d98d
Fix flake8 errors
marcelobeckmann May 7, 2019
5c6c30d
Fix flake8 errors
marcelobeckmann May 7, 2019
745de05
Add ranges as parameters for data scale
marcelobeckmann May 11, 2019
6fa6e88
Fix flake8 errors
marcelobeckmann May 11, 2019
077b3cb
Fix flake8 errors
marcelobeckmann May 11, 2019
ded653a
Fix incorrect replace in metrics.rst
marcelobeckmann May 12, 2019
eb1ee32
Fix flake8 errors
marcelobeckmann May 12, 2019
782eb3d
Fix flake8 errors
marcelobeckmann May 12, 2019
4c03f5c
Update with master branch
marcelobeckmann May 20, 2019
3ca56d5
Update with master branch
marcelobeckmann May 20, 2019
29d82d5
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
marcelobeckmann May 21, 2019
16d339d
Merge with master branch
marcelobeckmann May 21, 2019
6ea57ac
Merge with master branch
marcelobeckmann May 21, 2019
16b9377
Fix utf-8 encoding
marcelobeckmann May 21, 2019
1474df8
Fix incorrect merge
marcelobeckmann May 22, 2019
49a5ac2
Merge branch 'b5584' of https://github.com/marcelobeckmann/scikit-lea…
marcelobeckmann May 22, 2019
c92d47d
Fix merge issues in pairwise.py
jnothman May 22, 2019
67491ce
Remove bak files
marcelobeckmann May 22, 2019
e123d36
Provide proper support to pairwise_distances method
marcelobeckmann Jun 13, 2019
a993bbe
Fix flake8 errors
marcelobeckmann Jun 14, 2019
5e4cf76
Fix flake8 errors
marcelobeckmann Jun 14, 2019
66650fa
Fix flake8 errors
marcelobeckmann Jun 14, 2019
23966ff
Fix flake8 errors
marcelobeckmann Jun 14, 2019
ae7f556
Apply minor fixes, tests, and comments
marcelobeckmann Jun 17, 2019
d257bba
Simplified range calculation
marcelobeckmann Jun 18, 2019
52ad60f
Fix flake8 error
marcelobeckmann Jun 18, 2019
336c183
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
marcelobeckmann Jun 20, 2019
c4959fa
Remove gower for sparse matrix tests
marcelobeckmann Jun 21, 2019
faa404f
Fix flake8 errors
marcelobeckmann Jun 21, 2019
4a2d89e
Remove unnecessary if
marcelobeckmann Jun 21, 2019
5b84803
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
marcelobeckmann Jun 21, 2019
cc58403
Makes a proper user of precomputed parameters
marcelobeckmann Jul 7, 2019
f69fd04
Fix flake8 errors
marcelobeckmann Jul 7, 2019
098bef9
Fix flake8 errors
marcelobeckmann Jul 7, 2019
bab9ca0
Add more tests cases
marcelobeckmann Jul 7, 2019
bba8828
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
marcelobeckmann Jul 7, 2019
26779a0
Remove variables and simplify code readability
marcelobeckmann Jul 29, 2019
87e2f63
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
marcelobeckmann Jul 29, 2019
ff6366b
Fix flake8 errors
marcelobeckmann Jul 29, 2019
0931f81
Improve robustness for pairwise_distances with gower
marcelobeckmann Aug 6, 2019
fa44c39
Fix flake8 errors
marcelobeckmann Aug 6, 2019
4a46ae1
Fix flake8 errors
marcelobeckmann Aug 6, 2019
38d99d5
Fix flake8 errors
marcelobeckmann Aug 6, 2019
a93efa5
Fix flake8 errors
marcelobeckmann Aug 6, 2019
512428d
Fix flake8 errors
marcelobeckmann Aug 6, 2019
6a403d5
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
marcelobeckmann Aug 6, 2019
29cd45e
Remove unnecessary conversion
marcelobeckmann Aug 6, 2019
a811d57
Improve robustness to test categorical values in other deployments
marcelobeckmann Aug 7, 2019
4d6d584
Fix flake8 errors
marcelobeckmann Aug 7, 2019
dbd4af5
Fix compilation error
marcelobeckmann Aug 7, 2019
091a7fa
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
marcelobeckmann Aug 7, 2019
117cca0
Fix flake8 errors
marcelobeckmann Aug 7, 2019
34e78ae
Detect incorrect NaN comparison in other deployments
marcelobeckmann Aug 8, 2019
0a802c3
Detect incorrect NaN comparison in other deployments
marcelobeckmann Aug 12, 2019
6df57c2
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
marcelobeckmann Aug 12, 2019
e610965
Detect test discrepancies in other deployments
marcelobeckmann Aug 12, 2019
1cddfdf
Detect test discrepancies in other deployments
marcelobeckmann Aug 12, 2019
850caa6
Detect test discrepancies in other deployments
marcelobeckmann Aug 12, 2019
545e496
Detect test discrepancies in other deployments
marcelobeckmann Aug 13, 2019
6b438cf
Detect test discrepancies in other deployments
marcelobeckmann Aug 13, 2019
b9d2188
Detect test discrepancies in other deployments
marcelobeckmann Aug 13, 2019
df73f9e
Detect test discrepancies in other deployments
marcelobeckmann Aug 13, 2019
bc08577
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
marcelobeckmann Aug 13, 2019
a73852e
Detect test discrepancies in other deployments
marcelobeckmann Aug 13, 2019
127bc7b
Detect test discrepancies in other deployments
marcelobeckmann Aug 13, 2019
8cd9ca3
Detect test discrepancies in other deployments
marcelobeckmann Aug 13, 2019
e8c6624
Detect test discrepancies in other deployments
marcelobeckmann Aug 13, 2019
a339e48
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
marcelobeckmann Aug 13, 2019
27b7fd9
Detect test discrepancies in other deployments
marcelobeckmann Aug 13, 2019
8e37937
Detect test discrepancies in other deployments
marcelobeckmann Aug 13, 2019
d11d8e7
Detect test discrepancies in other deployments
marcelobeckmann Aug 13, 2019
c375fee
Use the _object_dtype_isnan to detect nan in mixed matrices data
marcelobeckmann Aug 21, 2019
462c6f3
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
marcelobeckmann Aug 21, 2019
8f4e9de
Fix flake8 error
marcelobeckmann Aug 21, 2019
58770f0
Fix code after code review
marcelobeckmann Sep 11, 2019
b3270a8
Merge with head
marcelobeckmann Sep 11, 2019
188e0ca
Fix flake8 errors
marcelobeckmann Sep 11, 2019
eb1ab6b
Remove files added incorrectly
marcelobeckmann Sep 19, 2019
eab56c4
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
marcelobeckmann Sep 19, 2019
5f3421e
Fix flake8 errors
marcelobeckmann Sep 19, 2019
9317415
Changes after code review
marcelobeckmann Sep 20, 2019
d16f833
Fix flake8 errors
marcelobeckmann Sep 20, 2019
b23fc65
Fix flake8 errors
marcelobeckmann Sep 20, 2019
da6b46d
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
marcelobeckmann Sep 20, 2019
d84be70
New proposal to avoid ZeroDivisionError
marcelobeckmann Sep 23, 2019
e67579d
New proposal to avoid ZeroDivisionError
marcelobeckmann Sep 23, 2019
e5167e0
Fix flake8 errors
marcelobeckmann Sep 23, 2019
7de895b
Improve categorical detection given hints from code review
marcelobeckmann Oct 23, 2019
c88cf0f
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
marcelobeckmann Oct 23, 2019
08e692a
Fix flake8 errors
marcelobeckmann Oct 23, 2019
19e4f0b
Fix flake8 errors
marcelobeckmann Oct 23, 2019
7d480b2
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
marcelobeckmann Oct 23, 2019
14d0d8b
Revert problematic merge with other's failures
marcelobeckmann Oct 24, 2019
3b3bb54
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
marcelobeckmann Oct 24, 2019
82707d2
Fix merge conflicts
marcelobeckmann Oct 24, 2019
e187e01
Fix flake8 errors
marcelobeckmann Oct 24, 2019
7370840
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
marcelobeckmann Oct 25, 2019
a86ba38
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
marcelobeckmann Oct 29, 2019
c0f3ee2
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
marcelobeckmann Oct 31, 2019
d1a116f
Propose fix after code review
marcelobeckmann Nov 12, 2019
8ddfb1b
Propose fix after code review
marcelobeckmann Nov 14, 2019
b37f750
Propose fix after code review
marcelobeckmann Nov 15, 2019
88f835d
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
marcelobeckmann Nov 15, 2019
77d925f
Fix unit test errors in other types of deployment
marcelobeckmann Nov 15, 2019
72bc1dc
Improve performance for nan columns
marcelobeckmann Nov 20, 2019
984a6a0
Fix code after review
marcelobeckmann Nov 20, 2019
cf861bd
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
marcelobeckmann Nov 20, 2019
1510744
Make some prints to figure out the unit test error in some specifc pl…
marcelobeckmann Nov 21, 2019
988028a
Make some prints to figure out the unit test error in some specifc pl…
marcelobeckmann Nov 22, 2019
8454f97
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
marcelobeckmann Nov 22, 2019
f1d840d
Make some prints to figure out the unit test error in some specifc pl…
marcelobeckmann Nov 26, 2019
a8f2a65
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
marcelobeckmann Nov 26, 2019
37359f0
Make some prints to figure out the unit test error in some specifc pl…
marcelobeckmann Nov 26, 2019
63c179e
Revert improvement to check full nan columns
marcelobeckmann Nov 27, 2019
8786f5d
Merge remote-tracking branch 'upstream/master' into b5584
adrinjalali Mar 29, 2020
c1f3599
UG minor cleanup
adrinjalali Mar 29, 2020
fa281c3
prototype implementation, changes to _safe_indexing
adrinjalali Mar 30, 2020
c8e840d
handling some edge cases, passing some of the tests
adrinjalali Mar 31, 2020
4859b81
fix edge cases and tests
adrinjalali Apr 3, 2020
6ace7da
fix boolean mask issue
adrinjalali Apr 7, 2020
3232262
inverse -> complement and tests
adrinjalali Apr 7, 2020
89fae67
Merge remote-tracking branch 'upstream/master' into b5584
adrinjalali Apr 7, 2020
2ff6e0a
remove unused import
adrinjalali Apr 7, 2020
ab8a1f8
fix duplicate function names
adrinjalali Apr 7, 2020
d8c2fb5
ENH UG
adrinjalali Apr 7, 2020
3d6cd99
minor UG issue
adrinjalali Apr 7, 2020
08177af
add make_column_selector example to UG
adrinjalali Apr 8, 2020
58f4b81
trying to fix the missing link
adrinjalali Apr 8, 2020
0312f1e
add a few new tests
adrinjalali Apr 8, 2020
80e71a0
more test cleanup
adrinjalali Apr 14, 2020
b0e3b11
simplify tests
adrinjalali Apr 15, 2020
c0502b9
fix sphinx warning
adrinjalali Apr 15, 2020
adb7854
skip metrics.rst on pandas-less CI
adrinjalali Apr 15, 2020
7287b1a
enable ellipsis
adrinjalali Apr 15, 2020
f31ca7b
Merge remote-tracking branch 'upstream/master' into b5584
adrinjalali May 28, 2020
330f57a
fix example in metrics.rst
adrinjalali May 28, 2020
aa8758d
Merge branch 'b5584' of github.com:adrinjalali/scikit-learn into b5584
adrinjalali May 28, 2020
0e8feb8
fix merge
adrinjalali May 28, 2020
9635f76
require scaling factors when Y is given and scale=True
adrinjalali May 29, 2020
3708a67
fix docstring
adrinjalali May 29, 2020
9262412
Joel's optimization and categorical features consistency
adrinjalali May 29, 2020
d8b445e
remove unused import
adrinjalali May 29, 2020
3d433fc
remove doctest directive
adrinjalali May 29, 2020
7b6278c
fix metrics.rst doctest
adrinjalali May 29, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
9 changes: 9 additions & 0 deletions doc/conftest.py
Expand Up @@ -50,6 +50,13 @@ def setup_compose():
raise SkipTest("Skipping compose.rst, pandas not installed")


def setup_metrics():
try:
import pandas # noqa
except ImportError:
raise SkipTest("Skipping metrics.rst, pandas not installed")


def setup_impute():
try:
import pandas # noqa
Expand Down Expand Up @@ -82,6 +89,8 @@ def pytest_runtest_setup(item):
setup_working_with_text_data()
elif fname.endswith('modules/compose.rst') or is_index:
setup_compose()
elif fname.endswith('modules/metrics.rst') or is_index:
setup_metrics()
elif IS_PYPY and fname.endswith('modules/feature_extraction.rst'):
raise SkipTest('FeatureHasher is not compatible with PyPy')
elif fname.endswith('modules/impute.rst'):
Expand Down
55 changes: 55 additions & 0 deletions doc/modules/metrics.rst
Expand Up @@ -93,6 +93,61 @@ is equivalent to :func:`linear_kernel`, only slower.)
Information Retrieval. Cambridge University Press.
https://nlp.stanford.edu/IR-book/html/htmledition/the-vector-space-model-for-scoring-1.html

.. _gower_distances:

Gower distances
-----------------

The function :func:`~sklearn.metrics.pairwise.gower_distances` computes the
distances between the observations in X and Y, that may contain combinations of
numerical, boolean, or categorical attributes, using an implementation of Gower
Similarity.

.. math::

g(\mathbf{x}, \mathbf{y}) = \frac{\sum_i(s(x_i, y_i))}{|\{i| x_i \neq \text{missing} \land y_i \neq \text{missing}\}|}

Where:

:math:`x, y` : array_like of shape (n_features,) are the observations to be compared.

:math:`s(x_i, y_i)` : Calculates the distance as:

- :math:`s(x_i, y_i) := 0`, if either :math:`x_i` or :math:`y_i` are missing.
- :math:`s(x_i, y_i) := \text{int}(x_i == y_i)`, if :math:`i` represents a
boolean or categorical attribute.
- :math:`s(x_i, y_i) := abs(x_i - y_i)`, if :math:`i` represents a numerical
attribute.


The Gower formula combines a Manhattan (L1) distance for numeric features
with Hamming distance for categorical features to obtain a general coefficient
for categorical and numeric data.

The :func:`gower_distances` function expects the user to specify the
categorical features, otherwise it will assume all features are numerical. If
the data is a `pandas.DataFrame`, you can use
:func:`~sklearn.compose.make_column_selector` to select features::

>>> import pandas as pd # doctest: +ELLIPSIS
>>> from sklearn.compose import make_column_selector as selector
>>> from sklearn.metrics.pairwise import gower_distances
>>> X = pd.DataFrame(
... {'city': ['London', 'London', 'Paris', 'Sallisaw'],
... 'expert_rating': [5, 3, 4, 5],
... 'user_rating': [4, 5, 4, 3]})
>>> gower_distances(X, categorical_features=selector(dtype_include=object))
array([[0. , 0.5 , 0.5 , 0.5 ],
[0.5 , 0. , 0.666... , 1. ],
[0.5 , 0.666... , 0. , 0.666... ],
adrinjalali marked this conversation as resolved.
Show resolved Hide resolved
[0.5 , 1. , 0.666... , 0. ]])

.. topic:: References:

* Gower, J.C., 1971, A General Coefficient of Similarity and Some of Its
Properties, Biometrics, Vol. 27, No. 4. (Dec., 1971), pp. 857-871.
http://members.cbio.mines-paristech.fr/~jvert/svn/bibli/local/Gower1971general.pdf

.. _linear_kernel:

Linear kernel
Expand Down
214 changes: 210 additions & 4 deletions sklearn/metrics/pairwise.py
Expand Up @@ -21,16 +21,21 @@

from ..utils.validation import _num_samples
from ..utils.validation import check_non_negative
from ..utils.validation import check_consistent_length
from ..utils import check_array
from ..utils import gen_even_slices
from ..utils import gen_batches, get_chunk_n_rows
from ..utils import is_scalar_nan
from ..utils import _safe_indexing
from ..utils import _get_column_indices
from ..utils.extmath import row_norms, safe_sparse_dot
from ..preprocessing import normalize
from ..utils._mask import _get_mask

from ._pairwise_fast import _chi2_kernel_fast, _sparse_manhattan
from ..exceptions import DataConversionWarning
from ..utils.fixes import _object_dtype_isnan
from ..preprocessing import MinMaxScaler


# Utility Functions
Expand Down Expand Up @@ -544,7 +549,7 @@ def pairwise_distances_argmin_min(X, Y, axis=1, metric="euclidean",
Valid values for metric are:

- from scikit-learn: ['cityblock', 'cosine', 'euclidean', 'l1', 'l2',
'manhattan']
'manhattan', 'gower']

- from scipy.spatial.distance: ['braycurtis', 'canberra', 'chebyshev',
'correlation', 'dice', 'hamming', 'jaccard', 'kulsinski',
Expand Down Expand Up @@ -632,7 +637,7 @@ def pairwise_distances_argmin(X, Y, axis=1, metric="euclidean",
Valid values for metric are:

- from scikit-learn: ['cityblock', 'cosine', 'euclidean', 'l1', 'l2',
'manhattan']
'manhattan', 'gower']

- from scipy.spatial.distance: ['braycurtis', 'canberra', 'chebyshev',
'correlation', 'dice', 'hamming', 'jaccard', 'kulsinski',
Expand Down Expand Up @@ -829,6 +834,172 @@ def cosine_distances(X, Y=None):
return S


def _split_categorical_numerical(X, categorical_features):
# the following bit is done before check_pairwise_array to avoid converting
# numerical data to object dtype. First we split the data into categorical
# and numerical, then we do check_array

if X is None:
return None, None

# TODO: this should be more like check_array(..., accept_pandas=True)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here I'm basically converting the data to a numpy array only if it's not a pandas dataframe or an array already. Feels like it should be a check_array(X, ..., accept_pandas=True) call.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why avoid calling check_array if X is already an array?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair, but still need to avoid calling it if it's a pandas DF.

if not hasattr(X, "shape"):
X = check_array(X, dtype=np.object, force_all_finite=False)

if callable(categorical_features):
cols = categorical_features(X)
else:
cols = categorical_features
if cols is None:
cols = []

col_idx = _get_column_indices(X, cols)
X_cat = _safe_indexing(X, col_idx, axis=1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we care that this is (I think) always a copying operation even if col_idx is empty?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't that be an issue with _safe_indexing?

X_num = _safe_indexing(X, col_idx, axis=1, complement=True)

return X_cat, X_num


def gower_distances(X, Y=None, categorical_features=None, scale=True,
min_values=None, scale_factor=None):
"""Compute the distances between the observations in X and Y,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP257: this should be a one-line summary

that may contain mixed types of data, using an implementation
of Gower formula.

Parameters
----------
X : {array-like, pandas.DataFrame} of shape (n_samples, n_features)

Y : {array-like, pandas.DataFrame} of shape (n_samples, n_features), \
default=None

categorical_features : array-like of str, array-like of int, \
array-like of bool, slice or callable, default=None
Indexes the data on its second axis. Integers are interpreted as
positional columns, while strings can reference DataFrame columns
by name.
A callable is passed the input data `X` and can return any of the
above. To select multiple columns by name or dtype, you can use
:obj:`~sklearn.compose.make_column_selector`.

By default all non-numeric columns are considered categorical.

scale : bool, default=True
Indicates if the numerical columns will be scaled between 0 and 1.
If false, it is assumed the numerical columns are already scaled.
The scaling factors, _i.e._ min and max, are taken from both ``X`` and
adrinjalali marked this conversation as resolved.
Show resolved Hide resolved
``Y``.

min_values : ndarray of shape (n_features,), default=None
adrinjalali marked this conversation as resolved.
Show resolved Hide resolved
Per feature adjustment for minimum. Equivalent to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does "Equivalent to" apply to the case where min_values=None? Use the words "if None"

``min_values - X.min(axis=0) * scale_factor``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is it a function of itself?

If provided, ``scale_factor`` should be provided as well.

scale_factor : ndarray of shape (n_features,), default=None
adrinjalali marked this conversation as resolved.
Show resolved Hide resolved
Per feature relative scaling of the data. Equivalent to
``(max_values - min_values) / (X.max(axis=0) - X.min(axis=0))``
If provided, ``min_values`` should be provided as well.

Returns
-------
distances : ndarray of shape (n_samples_X, n_samples_Y)

References
----------
Gower, J.C., 1971, A General Coefficient of Similarity and Some of Its
Properties.

Notes
-----
Categorical ordinal attributes should be treated as numeric for the purpose
of Gower similarity.

Current implementation does not support sparse matrices.

All the non-numerical types (e.g., str), are treated as categorical
adrinjalali marked this conversation as resolved.
Show resolved Hide resolved
features.

This implementation modifies the Gower's original similarity measure in
the folowing aspects:

* The values in the original similarity S range between 0 and 1. To
adrinjalali marked this conversation as resolved.
Show resolved Hide resolved
guarantee this, it is assumed the numerical features of X and Y are
scaled between 0 and 1.

* Different from the original similarity S, this implementation
returns 1-S.
"""
def _nanmanhatan(x, y):
return np.nansum(np.abs(x - y))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is using this row-wise function call worthwhile relative to something more vectorised?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's kinda easy to do more vectorized when X and Y are of the same size, otherwise I'm not sure if it's worth the complexity.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this implementation is significantly faster than the cosine distances for instance. So I don't think we should worry about the speed too much?


def _non_nans(x, y):
return np.sum(~_object_dtype_isnan(x) & ~_object_dtype_isnan(y))
adrinjalali marked this conversation as resolved.
Show resolved Hide resolved

def _nanhamming(x, y):
return np.sum(x != y) - np.sum(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or _non_nans(x, y) - np.sum(x == y)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't seem to help the time at least when I test it.

_object_dtype_isnan(x) | _object_dtype_isnan(y))

if issparse(X) or issparse(Y):
raise TypeError("Gower distance does not support sparse matrices")

if X is None or len(X) == 0:
raise ValueError("X can not be None or empty")

if scale:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether it's worth running locals().update(_precompute_metric_params(X, Y, 'gower', **locals()) to reduce code duplication... and move to a more elegant solution eventually?

if (scale_factor is None) != (min_values is None):
raise ValueError("min_value and scale_factor should be provided "
"together.")
X_cat, X_num = _split_categorical_numerical(X, categorical_features)
Y_cat, Y_num = _split_categorical_numerical(Y, categorical_features)
adrinjalali marked this conversation as resolved.
Show resolved Hide resolved

if min_values is not None:
min_values = np.asarray(min_values)
scale_factor = np.asarray(scale_factor)
check_consistent_length(min_values, scale_factor,
np.ndarray(shape=(X_num.shape[1], 0)))

if X_num.shape[1]:
X_num, Y_num = check_pairwise_arrays(X_num, Y_num, precomputed=False,
dtype=float,
force_all_finite=False)
if scale:
scale_data = X_num if Y_num is X_num else np.vstack((X_num, Y_num))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the else case should never be executed.

if scale_factor is None:
trs = MinMaxScaler().fit(scale_data)
else:
trs = MinMaxScaler()
trs.scale_ = scale_factor
trs.min_ = min_values
X_num = trs.transform(X_num)
Y_num = trs.transform(Y_num)

nan_manhatan = distance.cdist(X_num, Y_num, _nanmanhatan)
valid_num = distance.cdist(X_num, Y_num, _non_nans)
else:
nan_manhatan = valid_num = None

if X_cat.shape[1]:
X_cat, Y_cat = check_pairwise_arrays(X_cat, Y_cat, precomputed=False,
dtype=np.object,
force_all_finite=False)
nan_hamming = distance.cdist(X_cat, Y_cat, _nanhamming)
valid_cat = distance.cdist(X_cat, Y_cat, _non_nans)
else:
nan_hamming = valid_cat = None

# based on whether there are categorical and/or numerical data present,
# we compute the distance metric
# Division by zero and nans warnings are ignored since they are expected
with np.errstate(divide='ignore', invalid='ignore'):
if valid_num is not None and valid_cat is not None:
D = (nan_manhatan + nan_hamming) / (valid_num + valid_cat)
elif valid_num is not None:
D = nan_manhatan / valid_num
else:
D = nan_hamming / valid_cat
return D


# Paired distances
def paired_euclidean_distances(X, Y):
"""
Expand Down Expand Up @@ -905,7 +1076,7 @@ def paired_cosine_distances(X, Y):
'l2': paired_euclidean_distances,
'l1': paired_manhattan_distances,
'manhattan': paired_manhattan_distances,
'cityblock': paired_manhattan_distances}
'cityblock': paired_manhattan_distances, }


def paired_distances(X, Y, metric="euclidean", **kwds):
Expand Down Expand Up @@ -1298,6 +1469,7 @@ def chi2_kernel(X, Y=None, gamma=1.):
'l2': euclidean_distances,
'l1': manhattan_distances,
'manhattan': manhattan_distances,
'gower': gower_distances,
'precomputed': None, # HACK: precomputed is always allowed, never called
'nan_euclidean': nan_euclidean_distances,
}
Expand All @@ -1322,6 +1494,7 @@ def distance_metrics():
'l1' metrics.pairwise.manhattan_distances
'l2' metrics.pairwise.euclidean_distances
'manhattan' metrics.pairwise.manhattan_distances
'gower' metrics.pairwise.gower_distances
'nan_euclidean' metrics.pairwise.nan_euclidean_distances
=============== ========================================

Expand Down Expand Up @@ -1400,7 +1573,7 @@ def _pairwise_callable(X, Y, metric, force_all_finite=True, **kwds):
'mahalanobis', 'matching', 'minkowski', 'rogerstanimoto',
'russellrao', 'seuclidean', 'sokalmichener',
'sokalsneath', 'sqeuclidean', 'yule', "wminkowski",
'nan_euclidean', 'haversine']
'nan_euclidean', 'haversine', 'gower']

_NAN_METRICS = ['nan_euclidean']

Expand Down Expand Up @@ -1429,6 +1602,28 @@ def _check_chunk_size(reduced, chunk_size):
def _precompute_metric_params(X, Y, metric=None, **kwds):
"""Precompute data-derived metric parameters if not provided
"""
if metric == 'gower':
categorical_features = kwds.get('categorical_features', None)

_, X_num = _split_categorical_numerical(X, categorical_features)
_, Y_num = _split_categorical_numerical(Y, categorical_features)

scale = kwds.get('scale', True)
if not scale:
return {'min_values': None, 'scale_factor': None, 'scale': False}

scale_factor = kwds.get('scale_factor', None)
min_values = kwds.get('min_values', None)
if min_values is None:
data = X_num if Y is X or Y is None else np.vstack((X_num, Y_num))
trs = MinMaxScaler().fit(data)
min_values = trs.min_
scale_factor = trs.scale_

return {'min_values': min_values,
'scale_factor': scale_factor,
'scale': True}
adrinjalali marked this conversation as resolved.
Show resolved Hide resolved

if metric == "seuclidean" and 'V' not in kwds:
if X is Y:
V = np.var(X, axis=0, ddof=1)
Expand Down Expand Up @@ -1721,6 +1916,17 @@ def pairwise_distances(X, Y=None, metric="euclidean", n_jobs=None,
check_non_negative(X, whom=whom)
return X
elif metric in PAIRWISE_DISTANCE_FUNCTIONS:
if metric == 'gower':
"""
# These convertions are necessary for matrices with string values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# These convertions are necessary for matrices with string values
# These conversions are necessary for matrices with string values

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear why this code is commented out.

if not isinstance(X, np.ndarray):
X = np.asarray(X, dtype=np.object)
if Y is not None and not isinstance(Y, np.ndarray):
Y = np.asarray(Y, dtype=np.object)
"""
params = _precompute_metric_params(X, Y, metric=metric, **kwds)
kwds.update(**params)

func = PAIRWISE_DISTANCE_FUNCTIONS[metric]
elif callable(metric):
func = partial(_pairwise_callable, metric=metric,
Expand Down