Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why the medoids are outside of clusters #88

Open
dinani65 opened this issue Jan 4, 2021 · 7 comments
Open

why the medoids are outside of clusters #88

dinani65 opened this issue Jan 4, 2021 · 7 comments

Comments

@dinani65
Copy link

dinani65 commented Jan 4, 2021

Hi
I have used sklearn_extra for clustering my data based on cosine similarity. The data is 100-dimentional vectors.
After clustering, I reduce dimensionality to visualize the clustering. I am a bit confused about 'kmedoids.cluster_centers', is it returning medoids of clusters? should the medoids be approximately in the middle of the clusters?
when I visualize the clusters, the kmedoids.cluster_centers are outside of the clusters.
image

@TimotheeMathieu
Copy link
Contributor

TimotheeMathieu commented Jan 4, 2021

The visualization of a 100-dimensional vector is a very tricky question. I don't know how you reduced the dimension but the curse of dimensionality makes it so that the geometry is not very intuitive, informally we often say that in very high dimension all the points are far from one another. If a point is in the center of the cluster as you say, it is not necessarily so that when projected we always have a point center of the cluster. The fact that you use cosine distance does not help because cosine distance is not as intuitive as euclidean either.

Example :

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import cosine_distances
d = 100
N = 100
np.random.seed(42)

X = np.random.normal(size=[N,d])

y = X[:, 0] > 0 # let us say that the clustering is in the half space


D1 = cosine_distances(X[y], X[y]) # Distance matrix in cluster 1
center1  =  np.argmin(np.sum(D1, axis=1)) # compute the point that minimize the inertia


D2 = cosine_distances(X[~y], X[~y]) # Distance matrix in cluster 2
center2  =  np.argmin(np.sum(D2, axis=1)) # compute the point that minimize the inertia

plt.scatter(X[:, 0], X[:,1], c=y)
plt.scatter([X[center1][0]], [X[center1][1]], c = "lime")
plt.scatter([X[center2][0]], [X[center2][1]], c = "lime")

I obtain the following figure, centroids are in green :

image

EDIT : Rk that it can also be a convergence problem, KMedoids is not very stable in high dimension... no clustering algorithm is really stable in high dimension. It can also be a bug but we can't conclude that it is a bug with just what you say.

@kno10
Copy link
Contributor

kno10 commented Feb 2, 2021

You used cosine similarity. Distance does not matter, only angle.
Depending on how your data was before you projected it, these two points may well have had the smallest sum of angles.

@TimotheeMathieu
Copy link
Contributor

For the euclidean distance, we observe the same phenomenon

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import euclidean_distances
d = 100
N = 100
np.random.seed(42)

X = np.random.normal(size=[N,d])

y = X[:, 0] > 0 # let us say that the clustering is in the half space


D1 = euclidean_distances(X[y], X[y]) # Distance matrix in cluster 1
center1  =  np.argmin(np.sum(D1, axis=1)) # compute the point that minimize the inertia


D2 = euclidean_distances(X[~y], X[~y]) # Distance matrix in cluster 2
center2  =  np.argmin(np.sum(D2, axis=1)) # compute the point that minimize the inertia

plt.scatter(X[:, 0], X[:,1], c=y)
plt.scatter([X[center1][0]], [X[center1][1]], c = "lime")
plt.scatter([X[center2][0]], [X[center2][1]], c = "lime")

image

@kno10
Copy link
Contributor

kno10 commented Feb 2, 2021

center1 and center2 are indexes into the y and ~y subsets in your code.
X[center1] is wrong, you want X[y][center1] and X[~y][center2].
Hence you plot the wrong points.

@TimotheeMathieu
Copy link
Contributor

Thanks for the catch of the bug. Here is the result with the bug corrected, still not in the middle of the cluster:

image

@kno10
Copy link
Contributor

kno10 commented Feb 2, 2021

Because you only plot 2 of 100 dimensions. It is not central in each individually.

@TimotheeMathieu
Copy link
Contributor

TimotheeMathieu commented Feb 3, 2021

Yes it was exactly my point, thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants