Implement Gower Similarity Coefficient #5884

raghavrv · 2015-11-19T15:04:22Z

As suggested by @lesshaste

Paper - http://cbio.ensmp.fr/~jvert/svn/bibli/local/Gower1971general.pdf

I can implement this if there is sufficient interest?

lesshaste · 2015-11-19T15:15:33Z

Thanks.

This documentation for daisy from R might be relevant too https://stat.ethz.ch/R-manual/R-devel/library/cluster/html/daisy.html as it is a popular use case for the Gower coefficient.

agramfort · 2015-11-19T15:35:03Z

suggested where? in what context?

lesshaste · 2015-11-19T15:39:18Z

@agramfort I suggested it on gitter. The main interest for this coefficient is when the variables have mixed types (that is categorical, numerical, ordinal) . One popular use case is in the R package daisy() mentioned before when clustering data with mixed types (see page 27 of https://cran.r-project.org/web/packages/cluster/cluster.pdf). More generally http://www.clustan.talktalk.net/gower_similarity.html claims "Gower's General Similarity Coefficient is one of the most popular measures of proximity for mixed data types." which seems like a plausible claim.

agramfort · 2015-11-19T15:42:11Z

is there a benchmark or convincing example that would motivate this?

lesshaste · 2015-11-19T15:54:22Z

@agramfort I think it's more that we have no other way of calculating a dissimilarity coefficient for mixed data types currently and this appears to be the standard one. I can find lots of examples and question/answers online where people explain what the Gower coefficient is or suggest its use for mixed data types but nothing I could call a benchmark yet. The original paper has been cited 2298 times according to Google scholar.

agramfort · 2015-11-19T17:09:52Z

ok I am convinced :)

lesshaste · 2015-11-19T17:28:00Z

@agramfort Great! This change would complement #4899 nicely which introduces native categorical variable support for trees.

Having said that, I now realise that scikit-learn has no native support for ordinals at all currently so this part of my suggestion would be slightly ahead of its time. I suppose one could regard it in a positive way as the first step in support for ordinal features.

raghavrv · 2015-11-20T09:30:12Z

@amueller To be tagged with [New Feature]...

marcelobeckmann · 2017-01-17T21:10:14Z

Hi,

In order to contribute somehow, I implemented the Gower function, according the original paper, and the respective adptations necessary in the pdist module, because internally the pdist makes several numerical transformations that will fail if you use a matrix with mixed data.

The results I obtained with this so far are the same from R´s daisy function.

The source code is available at this jupyter notebook: https://sourceforge.net/projects/gower-distance-4python/files/

Feel free to use it

ashimb9 · 2017-07-16T07:06:19Z

I was just wondering if there was any update on this? Plus, is the issue noted by @marcelobeckmann still relevant?

agramfort · 2017-07-16T12:24:37Z

@ashimb9 it seems we need someone to integrate the code from @marcelobeckmann

ashimb9 · 2017-07-16T21:47:27Z

@agramfort Hmm, in that case I am going to have a go when I have some free time. By the way, do you happen to know anything about the current state of the issue noted above: "in the pdist module, because internally the pdist makes several numerical transformations that will fail if you use a matrix with mixed data"

marcelobeckmann · 2017-07-17T09:32:01Z

Hi, there are some private functions (e.g., _convert_to_double, _copy_array_if_base_present) in pdist that assume the underlying data is completely numeric, which is not true when you have a Dataframe with categorical data.

I volunteer to integrate this code and make it available in a fork, you can assign this ticket to me.

jnothman · 2017-07-17T10:18:21Z

The github assign feature only works for team members

…

On 17 Jul 2017 7:32 pm, "marcelobeckmann" ***@***.***> wrote: Hi, there are some private functions (e.g., _convert_to_double, _copy_array_if_base_present) in pdist that assume the underlying data is completely numeric, which is not true when you have a Dataframe with categorical data. I volunteer to integrate this code and make it available in a fork, you can assign this ticket to me. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#5884 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz62L3HHzGsSerW5G3n-Z8rrNoV6mwks5sOyoTgaJpZM4Glm0p> .

marcelobeckmann · 2017-07-17T10:21:37Z

No worries, I'll fork it and you can get the get code later. For me the important is to contribute. I'll let you know when done.

ashimb9 · 2017-07-17T21:07:15Z

Thanks @marcelobeckmann for taking this up. While you are at it (and if it is feasible for you), I was wondering if you would consider adding support for gower calculation on data with NaN values also, as implemented in the daisy package in R (which you have also referenced above)?

marcelobeckmann · 2017-08-09T17:26:24Z

I finished the integration of Gower to sklearn.metrics.pairwise (also observing the treatment of NaN values). I'm going to prepare some unit tests before to submit my forked code.

ashimb9 · 2017-08-09T18:45:30Z

@marcelobeckmann Great! Thank you so much, especially for including NaN support! :)

PS: If I may suggest, you might want to consider initiating a pull request so the reviewers can begin looking at your code while you work on the unit tests and so forth.

marcelobeckmann · 2017-08-17T13:40:23Z

I made a pull request some days ago, b5884.

jnothman · 2017-08-17T13:41:54Z

Yes, it's in the queue to be reviewed.

…

On 17 August 2017 at 23:40, Marcelo Beckmann ***@***.***> wrote: I made a pull request some days ago, b5884. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#5884 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz69uMu0XsoAUfvwWikkadjGCk5yvKks5sZELKgaJpZM4Glm0p> .

marcelobeckmann · 2017-10-03T19:01:50Z

I made the changes required by CI, and all the checks have passed.

pierrewessman · 2017-10-12T07:04:35Z

@marcelobeckmann great work! you might want to change row 659 to something like:
ranges_of_numeric[col] = (1 - min / max, 0)[max == 0] if (max!=0) else 0.0

Im getting division by zero-warnings in your second test case otherwise.

marcelobeckmann · 2017-11-10T18:59:35Z

Hi, I changed the code to avoid warnings as proposed by Pierre Wessman, and CI is green. I need someone to review my code.

Ali-ry · 2017-11-30T11:50:42Z

@marcelobeckmann and potentially others.

Hi Marcelo (or potentially others), got a few quick question in regards to your implementation of gower coefficient which you have placed here: https://sourceforge.net/projects/gower-distance-4python/files/.

Do I need a panda dataFrame for feeding the original data into the function or can I use a numpy array too?
I am importing my data into a numpy array. All columns are numerical real numbers apart from the first column which is the unique ID. I am getting two issues,

firstly, when I run the function, it returns Data Conversion Warning saying the dtype U7 was converted to object!!. I assumed it was because the array entries for some reason appear in quotation marks and hence are string. So i cast the type of array entries to int32 for example and it still gives the conversion error saying int32 was converted to objects
Secondly, and probably linked to above, every time I run the function and plot the result I receive a different visualisation (different spread of the points).

Would you be able to advise me on the above please?

Thanks very much

erickalfaro · 2019-03-16T03:30:41Z

Any updates? @marcelobeckmann

marcelobeckmann · 2019-03-16T10:18:26Z

Work in progress after review.

lsabi · 2019-05-10T10:07:33Z

Has the PR been approved? @marcelobeckmann

marcelobeckmann · 2019-05-11T09:57:28Z

Not yet, work is in progress after some recent code review.

lsabi · 2019-05-11T12:19:34Z

Too bad I need it.

Is just the function available somewhere? So I can use it on my own (for research purpose)

Thanks

marcelobeckmann · 2019-05-13T11:16:40Z

You can take the latest commit of this function in this PR:
#9555

lsabi · 2019-05-13T13:04:23Z

I managed to make it work locally. Thanks!

PhysB · 2019-05-30T17:20:16Z

Just a quick +1 on this ticket! Thanks for all the work on this.

willbarnett · 2019-11-13T20:41:00Z

Bump. This would be a great addition. I can't believe it has taken 4 years for a relatively simple calculation to make it into sklearn!!

jnothman · 2019-11-13T20:46:06Z

Or you could say: thanks for your dedicated persistence over four years of volunteered effort!

willbarnett · 2019-11-13T21:17:33Z

Or you could say: thanks for your dedicated persistence over four years of volunteered effort!

You are right, sorry. I didn't mean to come across as rude. I greatly appreciate the effort. I've been using this locally for a while now, and it would be great to see it added. It's the only distance metric that I know of for mixed data types.

jnothman · 2019-11-13T21:49:02Z

Aside from the volunteer effort, and that the core devs have not considered this urgent, there are indeed challenges around how to handle mixed types, and around how to perform the scaling in a train-test setup.

mohyneenm · 2019-11-14T08:26:52Z

Looking forward to it in sklearn.

bzip2 · 2020-01-23T16:19:05Z

Someone who claims to have "borrowed ideas" from this thread has released a package on github to calculate Gower distance (similarity, technically). Speaking of distance and similarity, the example is identical to the one from @marcelobeckmann. I've only glanced at the code so far, but here's a glimpse:

From @marcelobeckmann's notebook:

    # This is to normalize the numeric values between 0 and 1.
    X_num = np.divide(X_num ,max_of_numeric,out=np.zeros_like(X_num), where=max_of_numeric!=0)

From "Michael Yan":

    # This is to normalize the numeric values between 0 and 1.
    Z_num = np.divide(Z_num ,num_max,out=np.zeros_like(Z_num), where=num_max!=0)

marcelobeckmann · 2020-01-25T09:33:13Z

Hi guys, thanks to keep an eye on this.

I'm glad people is taking the code and trying to improve it, that is the purpose to be open source, despite some credit is appreciated.

Hopefully this code will be part of scikit-learn, if this PR #9555 be accepted.

Best regards,

Marcelo Beckmann

Bortrex · 2020-03-25T11:30:32Z

Good luck in the process!!

wgova · 2021-09-21T03:54:54Z

@marcelobeckmann - what is the data size limitation for this implementation?

How different is it from https://pypi.org/project/gower/?

marcelobeckmann · 2021-09-29T07:25:01Z

Hi, You can ask your sklearn algorithm to use any distance, including gower, row by row, aka pair by pair (that's why we have the pairwise module). With this the size of data won't be an issue despite be more expensive in terms of computation. Nevertheless, I'm not involved in this project anymore, I'm not aware about the decisions to move to scipy. Best regards,

…

On Tue 21 Sep 2021, 04:55 Webster Gova, ***@***.***> wrote: @marcelobeckmann <https://github.com/marcelobeckmann> - what is the data size limitation for this implementation? How different is it from https://pypi.org/project/gower/? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#5884 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAG4N326P7NPGVJFEBO2JRLUC76ZXANCNFSM4BUWNUUQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

k3ybladewielder · 2022-04-07T19:51:21Z

any news ? Can I Help ?

msat59 · 2023-05-02T15:04:57Z

Why isn't this implementation used?

marcelobeckmann · 2023-05-02T16:58:46Z

Someone proposed to abandon this implementation, and wanted to make another implementation, by using scipy to combine manhathan and another distance (I can't remember the second distance right now) to provide the gower calculation, then the sckit reviewers agreed with that. After that I became busy with other projects, and also, Covid came and messed up everything for everyone. The person that proposed that new approach never had time to implement it anyway, but anyone is very welcome to go ahead and keep going with the existing PR. Unfortunatelly I don't have time to volunteer on this anymore. I wish I could buy more time, like I buy more books.

…

On Tue 2 May 2023, 16:05 msat59, ***@***.***> wrote: Why isn't this implementation <https://github.com/adrinjalali/scikit-learn/blob/7b6278cd157415c87dd3c596e885b4d1de3e0d45/sklearn/metrics/pairwise.py#L870> used? — Reply to this email directly, view it on GitHub <#5884 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAG4N35QWD4HY7RRBKL2UT3XEEPCLANCNFSM4BUWNUUQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

ashimb9 · 2023-05-02T17:07:16Z

Hi Marcelo, I remember this PR from the very beginning days many years ago. It is very unfortunate that the PR did not get merged after all the work you put in. That being said, have you released or considered releasing this as a standalone package? I very much appreciate all the work you put into this! Best, Ashim On Tue, May 2, 2023 at 12:59 PM Marcelo Beckmann ***@***.***> wrote:

…

Someone proposed to abandon this implementation, and wanted to make another implementation, by using scipy to combine manhathan and another distance (I can't remember the second distance right now) to provide the gower calculation, then the sckit reviewers agreed with that. After that I became busy with other projects, and also, Covid came and messed up everything for everyone. The person that proposed that new approach never had time to implement it anyway, but anyone is very welcome to go ahead and keep going with the existing PR. Unfortunatelly I don't have time to volunteer on this anymore. I wish I could buy more time, like I buy more books. On Tue 2 May 2023, 16:05 msat59, ***@***.***> wrote: > Why isn't this implementation > < https://github.com/adrinjalali/scikit-learn/blob/7b6278cd157415c87dd3c596e885b4d1de3e0d45/sklearn/metrics/pairwise.py#L870 > > used? > > — > Reply to this email directly, view it on GitHub > < #5884 (comment) >, > or unsubscribe > < https://github.com/notifications/unsubscribe-auth/AAG4N35QWD4HY7RRBKL2UT3XEEPCLANCNFSM4BUWNUUQ > > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> > — Reply to this email directly, view it on GitHub <#5884 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFMGFKLEQZTHWWNTXJBMG4DXEE4NLANCNFSM4BUWNUUQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

marcelobeckmann · 2023-05-02T21:33:30Z

Actually, there is one standalone package derived from this code ( https://pypi.org/project/gower/). The person that prepared that package also made a reference to myself as the provider of the core code. That's the beauty of open source. pip install gower makes the call.

…

On Tue 2 May 2023, 18:07 ashimb9, ***@***.***> wrote: Hi Marcelo, I remember this PR from the very beginning days many years ago. It is very unfortunate that the PR did not get merged after all the work you put in. That being said, have you released or considered releasing this as a standalone package? I very much appreciate all the work you put into this! Best, Ashim On Tue, May 2, 2023 at 12:59 PM Marcelo Beckmann ***@***.***> wrote: > Someone proposed to abandon this implementation, and wanted to make another > implementation, by using scipy to combine manhathan and another distance (I > can't remember the second distance right now) to provide the gower > calculation, then the sckit reviewers agreed with that. After that I became > busy with other projects, and also, Covid came and messed up everything for > everyone. > > The person that proposed that new approach never had time to implement it > anyway, but anyone is very welcome to go ahead and keep going with the > existing PR. > > Unfortunatelly I don't have time to volunteer on this anymore. I wish I > could buy more time, like I buy more books. > > > > On Tue 2 May 2023, 16:05 msat59, ***@***.***> wrote: > > > Why isn't this implementation > > < > https://github.com/adrinjalali/scikit-learn/blob/7b6278cd157415c87dd3c596e885b4d1de3e0d45/sklearn/metrics/pairwise.py#L870 > > > > used? > > > > — > > Reply to this email directly, view it on GitHub > > < > #5884 (comment) > >, > > or unsubscribe > > < > https://github.com/notifications/unsubscribe-auth/AAG4N35QWD4HY7RRBKL2UT3XEEPCLANCNFSM4BUWNUUQ > > > > . > > You are receiving this because you were mentioned.Message ID: > > ***@***.***> > > > > — > Reply to this email directly, view it on GitHub > < #5884 (comment) >, > or unsubscribe > < https://github.com/notifications/unsubscribe-auth/AFMGFKLEQZTHWWNTXJBMG4DXEE4NLANCNFSM4BUWNUUQ > > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> > — Reply to this email directly, view it on GitHub <#5884 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAG4N33VYKQ6R42ZILSZ3DLXEE5NBANCNFSM4BUWNUUQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

ashimb9 · 2023-05-03T13:06:03Z

Great to hear that someone released a pkg and you received the appropriate credit. :) On Tue, May 2, 2023 at 5:33 PM Marcelo Beckmann ***@***.***> wrote:

…

Actually, there is one standalone package derived from this code ( https://pypi.org/project/gower/). The person that prepared that package also made a reference to myself as the provider of the core code. That's the beauty of open source. pip install gower makes the call. On Tue 2 May 2023, 18:07 ashimb9, ***@***.***> wrote: > Hi Marcelo, > > I remember this PR from the very beginning days many years ago. It is very > unfortunate that the PR did not get merged after all the work you put in. > > That being said, have you released or considered releasing this as a > standalone package? > > I very much appreciate all the work you put into this! > > Best, > > Ashim > > > On Tue, May 2, 2023 at 12:59 PM Marcelo Beckmann ***@***.***> > wrote: > > > Someone proposed to abandon this implementation, and wanted to make > another > > implementation, by using scipy to combine manhathan and another distance > (I > > can't remember the second distance right now) to provide the gower > > calculation, then the sckit reviewers agreed with that. After that I > became > > busy with other projects, and also, Covid came and messed up everything > for > > everyone. > > > > The person that proposed that new approach never had time to implement it > > anyway, but anyone is very welcome to go ahead and keep going with the > > existing PR. > > > > Unfortunatelly I don't have time to volunteer on this anymore. I wish I > > could buy more time, like I buy more books. > > > > > > > > On Tue 2 May 2023, 16:05 msat59, ***@***.***> wrote: > > > > > Why isn't this implementation > > > < > > > https://github.com/adrinjalali/scikit-learn/blob/7b6278cd157415c87dd3c596e885b4d1de3e0d45/sklearn/metrics/pairwise.py#L870 > > > > > > used? > > > > > > — > > > Reply to this email directly, view it on GitHub > > > < > > > #5884 (comment) > > >, > > > or unsubscribe > > > < > > > https://github.com/notifications/unsubscribe-auth/AAG4N35QWD4HY7RRBKL2UT3XEEPCLANCNFSM4BUWNUUQ > > > > > > . > > > You are receiving this because you were mentioned.Message ID: > > > ***@***.***> > > > > > > > — > > Reply to this email directly, view it on GitHub > > < > #5884 (comment) > >, > > or unsubscribe > > < > https://github.com/notifications/unsubscribe-auth/AFMGFKLEQZTHWWNTXJBMG4DXEE4NLANCNFSM4BUWNUUQ > > > > . > > You are receiving this because you were mentioned.Message ID: > > ***@***.***> > > > > — > Reply to this email directly, view it on GitHub > < #5884 (comment) >, > or unsubscribe > < https://github.com/notifications/unsubscribe-auth/AAG4N33VYKQ6R42ZILSZ3DLXEE5NBANCNFSM4BUWNUUQ > > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> > — Reply to this email directly, view it on GitHub <#5884 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFMGFKLEXOUB2ICK76QAHVLXEF4TRANCNFSM4BUWNUUQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

lesshaste · 2023-05-03T13:24:40Z

Do you think there are any prospects of it being included in scikit learn?

marcelobeckmann · 2023-05-09T06:49:25Z

Just adding that I strongly recommend to be a scikit-learn volunteer. It's an experience that aggregates a lot for your knowledge and career, and you can mention this in your CV. Anyone is welcome to pick an open issue and create a pull request with code to solve that issue, or to be a reviewer in some pull request. On Thu 4 May 2023, 17:04 Marcelo Beckmann, ***@***.***> wrote:

…

Hi, I think this is not a priority for scikit-learn anymore, but scikit-learn code is mostly made out of the work of volunteers, there is nothing blocking anyone to promote and get this PR approved. I don't have time available for this anymore, unfortunately. On Wed 3 May 2023, 14:24 lesshaste, ***@***.***> wrote: > Do you think there are any prospects of it being included in scikit learn? > > — > Reply to this email directly, view it on GitHub > <#5884 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AAG4N3ZVI5WLDJ4XAS7MGKLXEJMCJANCNFSM4BUWNUUQ> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >

raghavrv changed the title ~~Implement Gower Similarity Metric~~ Implement Gower Similarity Coefficient Nov 19, 2015

amueller added New Feature Need Contributor labels Oct 7, 2016

ashimb9 mentioned this issue Aug 20, 2017

Implement Gower similarity coeficient #9555

Closed

lesteve added help wanted and removed Need Contributor labels Oct 18, 2017

adrinjalali linked a pull request Apr 3, 2020 that will close this issue

[MRG] FEA Gower distance #16834

Open

cmarmo removed the help wanted label Jun 4, 2020

cmarmo added module:metrics module:neighbors labels Dec 9, 2021

Implement Gower Similarity Coefficient #5884

Implement Gower Similarity Coefficient #5884

Comments

raghavrv commented Nov 19, 2015

lesshaste commented Nov 19, 2015

agramfort commented Nov 19, 2015 via email

lesshaste commented Nov 19, 2015

agramfort commented Nov 19, 2015 via email

lesshaste commented Nov 19, 2015

agramfort commented Nov 19, 2015 via email

lesshaste commented Nov 19, 2015

raghavrv commented Nov 20, 2015

marcelobeckmann commented Jan 17, 2017

ashimb9 commented Jul 16, 2017

agramfort commented Jul 16, 2017

ashimb9 commented Jul 16, 2017

marcelobeckmann commented Jul 17, 2017

jnothman commented Jul 17, 2017 via email

marcelobeckmann commented Jul 17, 2017

ashimb9 commented Jul 17, 2017

marcelobeckmann commented Aug 9, 2017 • edited

ashimb9 commented Aug 9, 2017

marcelobeckmann commented Aug 17, 2017

jnothman commented Aug 17, 2017 via email

marcelobeckmann commented Oct 3, 2017 • edited

pierrewessman commented Oct 12, 2017 • edited

marcelobeckmann commented Nov 10, 2017

Ali-ry commented Nov 30, 2017 • edited

erickalfaro commented Mar 16, 2019

marcelobeckmann commented Mar 16, 2019

lsabi commented May 10, 2019

marcelobeckmann commented May 11, 2019

lsabi commented May 11, 2019

marcelobeckmann commented May 13, 2019

lsabi commented May 13, 2019

PhysB commented May 30, 2019

willbarnett commented Nov 13, 2019

jnothman commented Nov 13, 2019 via email

willbarnett commented Nov 13, 2019

jnothman commented Nov 13, 2019 via email

mohyneenm commented Nov 14, 2019

bzip2 commented Jan 23, 2020

marcelobeckmann commented Jan 25, 2020

Bortrex commented Mar 25, 2020

wgova commented Sep 21, 2021

marcelobeckmann commented Sep 29, 2021 via email

k3ybladewielder commented Apr 7, 2022

msat59 commented May 2, 2023

marcelobeckmann commented May 2, 2023 via email

ashimb9 commented May 2, 2023 via email

marcelobeckmann commented May 2, 2023 via email

ashimb9 commented May 3, 2023 via email

lesshaste commented May 3, 2023

marcelobeckmann commented May 9, 2023 via email

marcelobeckmann commented Aug 9, 2017 •

edited

marcelobeckmann commented Oct 3, 2017 •

edited

pierrewessman commented Oct 12, 2017 •

edited

Ali-ry commented Nov 30, 2017 •

edited