Why are you doing the K-means clustering? #32

AakashKeswani · 2023-10-10T23:24:02Z

Hi,

First I'd like to say thanks for publishing this repo! It's very helpful.

My question specifically refers to this description in the README:

_After the specified Optuna trials are complete, a 3-step KMeans clustering method is used to select the optimal parameter(s):

Each trial is placed in its nearest neighbor cluster based on its distance correlation to the target. The optimal number of clusters is determined using the elbow method. The cluster with the highest average correlation is selected with respect to its membership. In other words, a weighted score is used to select the cluster with the highest correlation but also with the most trials.
After the best correlation cluster is selected, the parameters of the trials within the cluster are also clustered. Again, the best cluster of indicator parameter(s) is selected with respect to its membership.
Finally, the centered best trial is selected from the best parameter cluster._

Since you are clustering by the correlation, and then picking the cluster with the best mean-correlation to the target, I'm not really sure what this is achieving. Why not just use the parameters from the trial with the highest correlation itself?

I can see how this would be useful if you were clustering by the parameters instead of the correlations. (That way you avoid outlier/overfit parameters by making sure you're using a cluster with similar parameters having a high correlation). But the description and the implementation don't seem to be actually using the parameter values in the clustering, they only cluster the scores.

Alternatively doing a k-fold optimization could help control for overfitting as well. Although I guess the user can implement that themselves if they want to.

Thanks again!
-Aakash

jmrichardson · 2023-10-11T02:11:50Z

Hi @AakashKeswani

The first cluster is based on the correlation to the target as you mentioned. Essentially, the goal here is to cluster the trials by correlation and choose the cluster with the highest mean correlation. This effectively gives you the trials that performed the best based on correlation. The second cluster is by trial parameters (not correlation) to avoid overfit parameters as you mentioned. The second cluster allows you to choose the trial within the cluster that in theory has close neighbors by parameters that also performed well. Hope this makes sense.

AakashKeswani · 2023-10-11T10:13:01Z

Okay, that makes more sense, thanks!

The only issue with this is that the parameters would need to be normalized (or at least variance-scaled) before clustering since k-means is isotropic.
I'm not sure if that makes much of a difference here though since most parameters are just #days so they are comparable.

jmrichardson · 2023-10-11T14:57:46Z

Yes, you are correct that if the parameters clustered have different scales, then one parameter could disproportionately influence the clustering result, leading to biased or incorrect clusters. Although many parameters are days, there are many indicators with different units. If I recall correctly, I believe the code is using a min-max scaler but I am not sure as I don't have the time to review at the moment. Happy to accept a PR if you can review the optimize.py code. If not, I will try to do have a look in a few weeks. Thanks for pointing this out.

AakashKeswani · 2023-10-11T19:49:07Z

https://github.com/jmrichardson/tuneta/pull/33/files?diff=split&w=0

AakashKeswani mentioned this issue Oct 11, 2023

Adjust parameter selection process after optuna trials. #33

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why are you doing the K-means clustering? #32

Why are you doing the K-means clustering? #32

AakashKeswani commented Oct 10, 2023

jmrichardson commented Oct 11, 2023

AakashKeswani commented Oct 11, 2023

jmrichardson commented Oct 11, 2023

AakashKeswani commented Oct 11, 2023

Why are you doing the K-means clustering? #32

Why are you doing the K-means clustering? #32

Comments

AakashKeswani commented Oct 10, 2023

jmrichardson commented Oct 11, 2023

AakashKeswani commented Oct 11, 2023

jmrichardson commented Oct 11, 2023

AakashKeswani commented Oct 11, 2023