New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Gower Similarity Coefficient #5884
Comments
Thanks. This documentation for |
suggested where? in what context?
|
@agramfort I suggested it on gitter. The main interest for this coefficient is when the variables have mixed types (that is categorical, numerical, ordinal) . One popular use case is in the R package |
is there a benchmark or convincing example that would motivate this?
|
@agramfort I think it's more that we have no other way of calculating a dissimilarity coefficient for mixed data types currently and this appears to be the standard one. I can find lots of examples and question/answers online where people explain what the Gower coefficient is or suggest its use for mixed data types but nothing I could call a benchmark yet. The original paper has been cited 2298 times according to Google scholar. |
ok I am convinced :)
|
@agramfort Great! This change would complement #4899 nicely which introduces native categorical variable support for trees. Having said that, I now realise that scikit-learn has no native support for ordinals at all currently so this part of my suggestion would be slightly ahead of its time. I suppose one could regard it in a positive way as the first step in support for ordinal features. |
@amueller To be tagged with |
Hi, In order to contribute somehow, I implemented the Gower function, according the original paper, and the respective adptations necessary in the pdist module, because internally the pdist makes several numerical transformations that will fail if you use a matrix with mixed data. The results I obtained with this so far are the same from R´s daisy function. The source code is available at this jupyter notebook: https://sourceforge.net/projects/gower-distance-4python/files/ Feel free to use it |
I was just wondering if there was any update on this? Plus, is the issue noted by @marcelobeckmann still relevant? |
@ashimb9 it seems we need someone to integrate the code from @marcelobeckmann |
@agramfort Hmm, in that case I am going to have a go when I have some free time. By the way, do you happen to know anything about the current state of the issue noted above: "in the pdist module, because internally the pdist makes several numerical transformations that will fail if you use a matrix with mixed data" |
Hi, there are some private functions (e.g., _convert_to_double, _copy_array_if_base_present) in pdist that assume the underlying data is completely numeric, which is not true when you have a Dataframe with categorical data. I volunteer to integrate this code and make it available in a fork, you can assign this ticket to me. |
The github assign feature only works for team members
…On 17 Jul 2017 7:32 pm, "marcelobeckmann" ***@***.***> wrote:
Hi, there are some private functions (e.g., _convert_to_double,
_copy_array_if_base_present) in pdist that assume the underlying data is
completely numeric, which is not true when you have a Dataframe with
categorical data.
I volunteer to integrate this code and make it available in a fork, you
can assign this ticket to me.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5884 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz62L3HHzGsSerW5G3n-Z8rrNoV6mwks5sOyoTgaJpZM4Glm0p>
.
|
No worries, I'll fork it and you can get the get code later. For me the important is to contribute. I'll let you know when done. |
Thanks @marcelobeckmann for taking this up. While you are at it (and if it is feasible for you), I was wondering if you would consider adding support for gower calculation on data with NaN values also, as implemented in the daisy package in R (which you have also referenced above)? |
I finished the integration of Gower to sklearn.metrics.pairwise (also observing the treatment of NaN values). I'm going to prepare some unit tests before to submit my forked code. |
@marcelobeckmann Great! Thank you so much, especially for including NaN support! :) PS: If I may suggest, you might want to consider initiating a pull request so the reviewers can begin looking at your code while you work on the unit tests and so forth. |
I made a pull request some days ago, b5884. |
Yes, it's in the queue to be reviewed.
…On 17 August 2017 at 23:40, Marcelo Beckmann ***@***.***> wrote:
I made a pull request some days ago, b5884.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5884 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz69uMu0XsoAUfvwWikkadjGCk5yvKks5sZELKgaJpZM4Glm0p>
.
|
I made the changes required by CI, and all the checks have passed. |
@marcelobeckmann great work! you might want to change row 659 to something like: Im getting division by zero-warnings in your second test case otherwise. |
Hi, I changed the code to avoid warnings as proposed by Pierre Wessman, and CI is green. I need someone to review my code. |
@marcelobeckmann and potentially others. Hi Marcelo (or potentially others), got a few quick question in regards to your implementation of gower coefficient which you have placed here: https://sourceforge.net/projects/gower-distance-4python/files/.
Would you be able to advise me on the above please? Thanks very much |
Any updates? @marcelobeckmann |
Work in progress after review. |
Has the PR been approved? @marcelobeckmann |
Not yet, work is in progress after some recent code review. |
Too bad I need it. Is just the function available somewhere? So I can use it on my own (for research purpose) Thanks |
You can take the latest commit of this function in this PR: |
I managed to make it work locally. Thanks! |
Just a quick +1 on this ticket! Thanks for all the work on this. |
Bump. This would be a great addition. I can't believe it has taken 4 years for a relatively simple calculation to make it into sklearn!! |
Or you could say: thanks for your dedicated persistence over four years of
volunteered effort!
|
You are right, sorry. I didn't mean to come across as rude. I greatly appreciate the effort. I've been using this locally for a while now, and it would be great to see it added. It's the only distance metric that I know of for mixed data types. |
Aside from the volunteer effort, and that the core devs have not considered
this urgent, there are indeed challenges around how to handle mixed types,
and around how to perform the scaling in a train-test setup.
|
Looking forward to it in sklearn. |
Someone who claims to have "borrowed ideas" from this thread has released a package on github to calculate Gower distance (similarity, technically). Speaking of distance and similarity, the example is identical to the one from @marcelobeckmann. I've only glanced at the code so far, but here's a glimpse: From @marcelobeckmann's notebook:
From "Michael Yan":
|
Hi guys, thanks to keep an eye on this. I'm glad people is taking the code and trying to improve it, that is the purpose to be open source, despite some credit is appreciated. Hopefully this code will be part of scikit-learn, if this PR #9555 be accepted. Best regards, Marcelo Beckmann |
Good luck in the process!! |
@marcelobeckmann - what is the data size limitation for this implementation? How different is it from https://pypi.org/project/gower/? |
Hi,
You can ask your sklearn algorithm to use any distance, including gower,
row by row, aka pair by pair (that's why we have the pairwise module). With
this the size of data won't be an issue despite be more expensive in terms
of computation.
Nevertheless, I'm not involved in this project anymore, I'm not aware about
the decisions to move to scipy.
Best regards,
…On Tue 21 Sep 2021, 04:55 Webster Gova, ***@***.***> wrote:
@marcelobeckmann <https://github.com/marcelobeckmann> - what is the data
size limitation for this implementation?
How different is it from https://pypi.org/project/gower/?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5884 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAG4N326P7NPGVJFEBO2JRLUC76ZXANCNFSM4BUWNUUQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
any news ? Can I Help ? |
Why isn't this implementation used? |
Someone proposed to abandon this implementation, and wanted to make another
implementation, by using scipy to combine manhathan and another distance (I
can't remember the second distance right now) to provide the gower
calculation, then the sckit reviewers agreed with that. After that I became
busy with other projects, and also, Covid came and messed up everything for
everyone.
The person that proposed that new approach never had time to implement it
anyway, but anyone is very welcome to go ahead and keep going with the
existing PR.
Unfortunatelly I don't have time to volunteer on this anymore. I wish I
could buy more time, like I buy more books.
…On Tue 2 May 2023, 16:05 msat59, ***@***.***> wrote:
Why isn't this implementation
<https://github.com/adrinjalali/scikit-learn/blob/7b6278cd157415c87dd3c596e885b4d1de3e0d45/sklearn/metrics/pairwise.py#L870>
used?
—
Reply to this email directly, view it on GitHub
<#5884 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAG4N35QWD4HY7RRBKL2UT3XEEPCLANCNFSM4BUWNUUQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Hi Marcelo,
I remember this PR from the very beginning days many years ago. It is very
unfortunate that the PR did not get merged after all the work you put in.
That being said, have you released or considered releasing this as a
standalone package?
I very much appreciate all the work you put into this!
Best,
Ashim
On Tue, May 2, 2023 at 12:59 PM Marcelo Beckmann ***@***.***>
wrote:
… Someone proposed to abandon this implementation, and wanted to make another
implementation, by using scipy to combine manhathan and another distance (I
can't remember the second distance right now) to provide the gower
calculation, then the sckit reviewers agreed with that. After that I became
busy with other projects, and also, Covid came and messed up everything for
everyone.
The person that proposed that new approach never had time to implement it
anyway, but anyone is very welcome to go ahead and keep going with the
existing PR.
Unfortunatelly I don't have time to volunteer on this anymore. I wish I
could buy more time, like I buy more books.
On Tue 2 May 2023, 16:05 msat59, ***@***.***> wrote:
> Why isn't this implementation
> <
https://github.com/adrinjalali/scikit-learn/blob/7b6278cd157415c87dd3c596e885b4d1de3e0d45/sklearn/metrics/pairwise.py#L870
>
> used?
>
> —
> Reply to this email directly, view it on GitHub
> <
#5884 (comment)
>,
> or unsubscribe
> <
https://github.com/notifications/unsubscribe-auth/AAG4N35QWD4HY7RRBKL2UT3XEEPCLANCNFSM4BUWNUUQ
>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
—
Reply to this email directly, view it on GitHub
<#5884 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFMGFKLEQZTHWWNTXJBMG4DXEE4NLANCNFSM4BUWNUUQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Actually, there is one standalone package derived from this code (
https://pypi.org/project/gower/).
The person that prepared that package also made a reference to myself as
the provider of the core code. That's the beauty of open source.
pip install gower makes the call.
…On Tue 2 May 2023, 18:07 ashimb9, ***@***.***> wrote:
Hi Marcelo,
I remember this PR from the very beginning days many years ago. It is very
unfortunate that the PR did not get merged after all the work you put in.
That being said, have you released or considered releasing this as a
standalone package?
I very much appreciate all the work you put into this!
Best,
Ashim
On Tue, May 2, 2023 at 12:59 PM Marcelo Beckmann ***@***.***>
wrote:
> Someone proposed to abandon this implementation, and wanted to make
another
> implementation, by using scipy to combine manhathan and another distance
(I
> can't remember the second distance right now) to provide the gower
> calculation, then the sckit reviewers agreed with that. After that I
became
> busy with other projects, and also, Covid came and messed up everything
for
> everyone.
>
> The person that proposed that new approach never had time to implement it
> anyway, but anyone is very welcome to go ahead and keep going with the
> existing PR.
>
> Unfortunatelly I don't have time to volunteer on this anymore. I wish I
> could buy more time, like I buy more books.
>
>
>
> On Tue 2 May 2023, 16:05 msat59, ***@***.***> wrote:
>
> > Why isn't this implementation
> > <
>
https://github.com/adrinjalali/scikit-learn/blob/7b6278cd157415c87dd3c596e885b4d1de3e0d45/sklearn/metrics/pairwise.py#L870
> >
> > used?
> >
> > —
> > Reply to this email directly, view it on GitHub
> > <
>
#5884 (comment)
> >,
> > or unsubscribe
> > <
>
https://github.com/notifications/unsubscribe-auth/AAG4N35QWD4HY7RRBKL2UT3XEEPCLANCNFSM4BUWNUUQ
> >
> > .
> > You are receiving this because you were mentioned.Message ID:
> > ***@***.***>
> >
>
> —
> Reply to this email directly, view it on GitHub
> <
#5884 (comment)
>,
> or unsubscribe
> <
https://github.com/notifications/unsubscribe-auth/AFMGFKLEQZTHWWNTXJBMG4DXEE4NLANCNFSM4BUWNUUQ
>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
—
Reply to this email directly, view it on GitHub
<#5884 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAG4N33VYKQ6R42ZILSZ3DLXEE5NBANCNFSM4BUWNUUQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Great to hear that someone released a pkg and you received the appropriate
credit. :)
On Tue, May 2, 2023 at 5:33 PM Marcelo Beckmann ***@***.***>
wrote:
… Actually, there is one standalone package derived from this code (
https://pypi.org/project/gower/).
The person that prepared that package also made a reference to myself as
the provider of the core code. That's the beauty of open source.
pip install gower makes the call.
On Tue 2 May 2023, 18:07 ashimb9, ***@***.***> wrote:
> Hi Marcelo,
>
> I remember this PR from the very beginning days many years ago. It is
very
> unfortunate that the PR did not get merged after all the work you put in.
>
> That being said, have you released or considered releasing this as a
> standalone package?
>
> I very much appreciate all the work you put into this!
>
> Best,
>
> Ashim
>
>
> On Tue, May 2, 2023 at 12:59 PM Marcelo Beckmann ***@***.***>
> wrote:
>
> > Someone proposed to abandon this implementation, and wanted to make
> another
> > implementation, by using scipy to combine manhathan and another
distance
> (I
> > can't remember the second distance right now) to provide the gower
> > calculation, then the sckit reviewers agreed with that. After that I
> became
> > busy with other projects, and also, Covid came and messed up everything
> for
> > everyone.
> >
> > The person that proposed that new approach never had time to implement
it
> > anyway, but anyone is very welcome to go ahead and keep going with the
> > existing PR.
> >
> > Unfortunatelly I don't have time to volunteer on this anymore. I wish I
> > could buy more time, like I buy more books.
> >
> >
> >
> > On Tue 2 May 2023, 16:05 msat59, ***@***.***> wrote:
> >
> > > Why isn't this implementation
> > > <
> >
>
https://github.com/adrinjalali/scikit-learn/blob/7b6278cd157415c87dd3c596e885b4d1de3e0d45/sklearn/metrics/pairwise.py#L870
> > >
> > > used?
> > >
> > > —
> > > Reply to this email directly, view it on GitHub
> > > <
> >
>
#5884 (comment)
> > >,
> > > or unsubscribe
> > > <
> >
>
https://github.com/notifications/unsubscribe-auth/AAG4N35QWD4HY7RRBKL2UT3XEEPCLANCNFSM4BUWNUUQ
> > >
> > > .
> > > You are receiving this because you were mentioned.Message ID:
> > > ***@***.***>
> > >
> >
> > —
> > Reply to this email directly, view it on GitHub
> > <
>
#5884 (comment)
> >,
> > or unsubscribe
> > <
>
https://github.com/notifications/unsubscribe-auth/AFMGFKLEQZTHWWNTXJBMG4DXEE4NLANCNFSM4BUWNUUQ
> >
> > .
> > You are receiving this because you were mentioned.Message ID:
> > ***@***.***>
> >
>
> —
> Reply to this email directly, view it on GitHub
> <
#5884 (comment)
>,
> or unsubscribe
> <
https://github.com/notifications/unsubscribe-auth/AAG4N33VYKQ6R42ZILSZ3DLXEE5NBANCNFSM4BUWNUUQ
>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
—
Reply to this email directly, view it on GitHub
<#5884 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFMGFKLEXOUB2ICK76QAHVLXEF4TRANCNFSM4BUWNUUQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Do you think there are any prospects of it being included in scikit learn? |
Just adding that I strongly recommend to be a scikit-learn volunteer. It's
an experience that aggregates a lot for your knowledge and career, and you
can mention this in your CV.
Anyone is welcome to pick an open issue and create a pull request with code
to solve that issue, or to be a reviewer in some pull request.
On Thu 4 May 2023, 17:04 Marcelo Beckmann, ***@***.***>
wrote:
… Hi,
I think this is not a priority for scikit-learn anymore, but scikit-learn
code is mostly made out of the work of volunteers, there is nothing
blocking anyone to promote and get this PR approved. I don't have time
available for this anymore, unfortunately.
On Wed 3 May 2023, 14:24 lesshaste, ***@***.***> wrote:
> Do you think there are any prospects of it being included in scikit learn?
>
> —
> Reply to this email directly, view it on GitHub
> <#5884 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AAG4N3ZVI5WLDJ4XAS7MGKLXEJMCJANCNFSM4BUWNUUQ>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
|
As suggested by @lesshaste
Paper - http://cbio.ensmp.fr/~jvert/svn/bibli/local/Gower1971general.pdf
I can implement this if there is sufficient interest?
@jnothman @amueller @agramfort
The text was updated successfully, but these errors were encountered: