Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Gower Similarity Coefficient #5884

Open
raghavrv opened this issue Nov 19, 2015 · 61 comments · May be fixed by #16834
Open

Implement Gower Similarity Coefficient #5884

raghavrv opened this issue Nov 19, 2015 · 61 comments · May be fixed by #16834

Comments

@raghavrv
Copy link
Member

As suggested by @lesshaste

Paper - http://cbio.ensmp.fr/~jvert/svn/bibli/local/Gower1971general.pdf

I can implement this if there is sufficient interest?

@jnothman @amueller @agramfort

@raghavrv raghavrv changed the title Implement Gower Similarity Metric Implement Gower Similarity Coefficient Nov 19, 2015
@lesshaste
Copy link

Thanks.

This documentation for daisy from R might be relevant too https://stat.ethz.ch/R-manual/R-devel/library/cluster/html/daisy.html as it is a popular use case for the Gower coefficient.

@agramfort
Copy link
Member

agramfort commented Nov 19, 2015 via email

@lesshaste
Copy link

@agramfort I suggested it on gitter. The main interest for this coefficient is when the variables have mixed types (that is categorical, numerical, ordinal) . One popular use case is in the R package daisy() mentioned before when clustering data with mixed types (see page 27 of https://cran.r-project.org/web/packages/cluster/cluster.pdf). More generally http://www.clustan.talktalk.net/gower_similarity.html claims "Gower's General Similarity Coefficient is one of the most popular measures of proximity for mixed data types." which seems like a plausible claim.

@agramfort
Copy link
Member

agramfort commented Nov 19, 2015 via email

@lesshaste
Copy link

@agramfort I think it's more that we have no other way of calculating a dissimilarity coefficient for mixed data types currently and this appears to be the standard one. I can find lots of examples and question/answers online where people explain what the Gower coefficient is or suggest its use for mixed data types but nothing I could call a benchmark yet. The original paper has been cited 2298 times according to Google scholar.

@agramfort
Copy link
Member

agramfort commented Nov 19, 2015 via email

@lesshaste
Copy link

@agramfort Great! This change would complement #4899 nicely which introduces native categorical variable support for trees.

Having said that, I now realise that scikit-learn has no native support for ordinals at all currently so this part of my suggestion would be slightly ahead of its time. I suppose one could regard it in a positive way as the first step in support for ordinal features.

@raghavrv
Copy link
Member Author

@amueller To be tagged with [New Feature]...

@marcelobeckmann
Copy link

Hi,

In order to contribute somehow, I implemented the Gower function, according the original paper, and the respective adptations necessary in the pdist module, because internally the pdist makes several numerical transformations that will fail if you use a matrix with mixed data.

The results I obtained with this so far are the same from R´s daisy function.

The source code is available at this jupyter notebook: https://sourceforge.net/projects/gower-distance-4python/files/

Feel free to use it

@ashimb9
Copy link
Contributor

ashimb9 commented Jul 16, 2017

I was just wondering if there was any update on this? Plus, is the issue noted by @marcelobeckmann still relevant?

@agramfort
Copy link
Member

@ashimb9 it seems we need someone to integrate the code from @marcelobeckmann

@ashimb9
Copy link
Contributor

ashimb9 commented Jul 16, 2017

@agramfort Hmm, in that case I am going to have a go when I have some free time. By the way, do you happen to know anything about the current state of the issue noted above: "in the pdist module, because internally the pdist makes several numerical transformations that will fail if you use a matrix with mixed data"

@marcelobeckmann
Copy link

Hi, there are some private functions (e.g., _convert_to_double, _copy_array_if_base_present) in pdist that assume the underlying data is completely numeric, which is not true when you have a Dataframe with categorical data.

I volunteer to integrate this code and make it available in a fork, you can assign this ticket to me.

@jnothman
Copy link
Member

jnothman commented Jul 17, 2017 via email

@marcelobeckmann
Copy link

No worries, I'll fork it and you can get the get code later. For me the important is to contribute. I'll let you know when done.

@ashimb9
Copy link
Contributor

ashimb9 commented Jul 17, 2017

Thanks @marcelobeckmann for taking this up. While you are at it (and if it is feasible for you), I was wondering if you would consider adding support for gower calculation on data with NaN values also, as implemented in the daisy package in R (which you have also referenced above)?

@marcelobeckmann
Copy link

marcelobeckmann commented Aug 9, 2017

I finished the integration of Gower to sklearn.metrics.pairwise (also observing the treatment of NaN values). I'm going to prepare some unit tests before to submit my forked code.

@ashimb9
Copy link
Contributor

ashimb9 commented Aug 9, 2017

@marcelobeckmann Great! Thank you so much, especially for including NaN support! :)

PS: If I may suggest, you might want to consider initiating a pull request so the reviewers can begin looking at your code while you work on the unit tests and so forth.

@marcelobeckmann
Copy link

I made a pull request some days ago, b5884.

@jnothman
Copy link
Member

jnothman commented Aug 17, 2017 via email

@marcelobeckmann
Copy link

marcelobeckmann commented Oct 3, 2017

I made the changes required by CI, and all the checks have passed.

@pierrewessman
Copy link

pierrewessman commented Oct 12, 2017

@marcelobeckmann great work! you might want to change row 659 to something like:
ranges_of_numeric[col] = (1 - min / max, 0)[max == 0] if (max!=0) else 0.0

Im getting division by zero-warnings in your second test case otherwise.

@marcelobeckmann
Copy link

Hi, I changed the code to avoid warnings as proposed by Pierre Wessman, and CI is green. I need someone to review my code.

@Ali-ry
Copy link

Ali-ry commented Nov 30, 2017

@marcelobeckmann and potentially others.

Hi Marcelo (or potentially others), got a few quick question in regards to your implementation of gower coefficient which you have placed here: https://sourceforge.net/projects/gower-distance-4python/files/.

  1. Do I need a panda dataFrame for feeding the original data into the function or can I use a numpy array too?

  2. I am importing my data into a numpy array. All columns are numerical real numbers apart from the first column which is the unique ID. I am getting two issues,

  • firstly, when I run the function, it returns Data Conversion Warning saying the dtype U7 was converted to object!!. I assumed it was because the array entries for some reason appear in quotation marks and hence are string. So i cast the type of array entries to int32 for example and it still gives the conversion error saying int32 was converted to objects

  • Secondly, and probably linked to above, every time I run the function and plot the result I receive a different visualisation (different spread of the points).

Would you be able to advise me on the above please?

Thanks very much

@erickalfaro
Copy link

Any updates? @marcelobeckmann

@marcelobeckmann
Copy link

Work in progress after review.

@lsabi
Copy link

lsabi commented May 10, 2019

Has the PR been approved? @marcelobeckmann

@marcelobeckmann
Copy link

Not yet, work is in progress after some recent code review.

@lsabi
Copy link

lsabi commented May 11, 2019

Too bad I need it.

Is just the function available somewhere? So I can use it on my own (for research purpose)

Thanks

@marcelobeckmann
Copy link

You can take the latest commit of this function in this PR:
#9555

@lsabi
Copy link

lsabi commented May 13, 2019

I managed to make it work locally. Thanks!

@PhysB
Copy link

PhysB commented May 30, 2019

Just a quick +1 on this ticket! Thanks for all the work on this.

@willbarnett
Copy link

Bump. This would be a great addition. I can't believe it has taken 4 years for a relatively simple calculation to make it into sklearn!!

@jnothman
Copy link
Member

jnothman commented Nov 13, 2019 via email

@willbarnett
Copy link

Or you could say: thanks for your dedicated persistence over four years of volunteered effort!

You are right, sorry. I didn't mean to come across as rude. I greatly appreciate the effort. I've been using this locally for a while now, and it would be great to see it added. It's the only distance metric that I know of for mixed data types.

@jnothman
Copy link
Member

jnothman commented Nov 13, 2019 via email

@mohyneenm
Copy link

Looking forward to it in sklearn.

@bzip2
Copy link

bzip2 commented Jan 23, 2020

Someone who claims to have "borrowed ideas" from this thread has released a package on github to calculate Gower distance (similarity, technically). Speaking of distance and similarity, the example is identical to the one from @marcelobeckmann. I've only glanced at the code so far, but here's a glimpse:

From @marcelobeckmann's notebook:

    # This is to normalize the numeric values between 0 and 1.
    X_num = np.divide(X_num ,max_of_numeric,out=np.zeros_like(X_num), where=max_of_numeric!=0)

From "Michael Yan":

    # This is to normalize the numeric values between 0 and 1.
    Z_num = np.divide(Z_num ,num_max,out=np.zeros_like(Z_num), where=num_max!=0)

@marcelobeckmann
Copy link

Hi guys, thanks to keep an eye on this.

I'm glad people is taking the code and trying to improve it, that is the purpose to be open source, despite some credit is appreciated.

Hopefully this code will be part of scikit-learn, if this PR #9555 be accepted.

Best regards,

Marcelo Beckmann

@Bortrex
Copy link

Bortrex commented Mar 25, 2020

Good luck in the process!!

@adrinjalali adrinjalali linked a pull request Apr 3, 2020 that will close this issue
@cmarmo cmarmo removed the help wanted label Jun 4, 2020
@wgova
Copy link

wgova commented Sep 21, 2021

@marcelobeckmann - what is the data size limitation for this implementation?

How different is it from https://pypi.org/project/gower/?

@marcelobeckmann
Copy link

marcelobeckmann commented Sep 29, 2021 via email

@k3ybladewielder
Copy link

any news ? Can I Help ?

@msat59
Copy link

msat59 commented May 2, 2023

Why isn't this implementation used?

@marcelobeckmann
Copy link

marcelobeckmann commented May 2, 2023 via email

@ashimb9
Copy link
Contributor

ashimb9 commented May 2, 2023 via email

@marcelobeckmann
Copy link

marcelobeckmann commented May 2, 2023 via email

@ashimb9
Copy link
Contributor

ashimb9 commented May 3, 2023 via email

@lesshaste
Copy link

Do you think there are any prospects of it being included in scikit learn?

@marcelobeckmann
Copy link

marcelobeckmann commented May 9, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet