Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why are you doing the K-means clustering? #32

Open
AakashKeswani opened this issue Oct 10, 2023 · 4 comments
Open

Why are you doing the K-means clustering? #32

AakashKeswani opened this issue Oct 10, 2023 · 4 comments

Comments

@AakashKeswani
Copy link
Contributor

Hi,

First I'd like to say thanks for publishing this repo! It's very helpful.

My question specifically refers to this description in the README:

_After the specified Optuna trials are complete, a 3-step KMeans clustering method is used to select the optimal parameter(s):

Each trial is placed in its nearest neighbor cluster based on its distance correlation to the target. The optimal number of clusters is determined using the elbow method. The cluster with the highest average correlation is selected with respect to its membership. In other words, a weighted score is used to select the cluster with the highest correlation but also with the most trials.
After the best correlation cluster is selected, the parameters of the trials within the cluster are also clustered. Again, the best cluster of indicator parameter(s) is selected with respect to its membership.
Finally, the centered best trial is selected from the best parameter cluster._

Since you are clustering by the correlation, and then picking the cluster with the best mean-correlation to the target, I'm not really sure what this is achieving. Why not just use the parameters from the trial with the highest correlation itself?

I can see how this would be useful if you were clustering by the parameters instead of the correlations. (That way you avoid outlier/overfit parameters by making sure you're using a cluster with similar parameters having a high correlation). But the description and the implementation don't seem to be actually using the parameter values in the clustering, they only cluster the scores.

Alternatively doing a k-fold optimization could help control for overfitting as well. Although I guess the user can implement that themselves if they want to.

Thanks again!
-Aakash

@jmrichardson
Copy link
Owner

Hi @AakashKeswani

The first cluster is based on the correlation to the target as you mentioned. Essentially, the goal here is to cluster the trials by correlation and choose the cluster with the highest mean correlation. This effectively gives you the trials that performed the best based on correlation. The second cluster is by trial parameters (not correlation) to avoid overfit parameters as you mentioned. The second cluster allows you to choose the trial within the cluster that in theory has close neighbors by parameters that also performed well. Hope this makes sense.

@AakashKeswani
Copy link
Contributor Author

Okay, that makes more sense, thanks!

The only issue with this is that the parameters would need to be normalized (or at least variance-scaled) before clustering since k-means is isotropic.
I'm not sure if that makes much of a difference here though since most parameters are just #days so they are comparable.

@jmrichardson
Copy link
Owner

Yes, you are correct that if the parameters clustered have different scales, then one parameter could disproportionately influence the clustering result, leading to biased or incorrect clusters. Although many parameters are days, there are many indicators with different units. If I recall correctly, I believe the code is using a min-max scaler but I am not sure as I don't have the time to review at the moment. Happy to accept a PR if you can review the optimize.py code. If not, I will try to do have a look in a few weeks. Thanks for pointing this out.

@AakashKeswani
Copy link
Contributor Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants