Feature Clustering: A Potential Addition to scikit-learn's Dimensionality Reduction Techniques #25992

EwoutH · 2023-03-28T07:18:55Z

EwoutH
Mar 28, 2023

Hello scikit-learn community!

I came across an interesting article discussing feature clustering as a solution to many high-dimensional machine learning problems. Feature clustering is an unsupervised technique that groups features of a dataset into homogeneous clusters, providing an alternative method for data reduction, acceleration of algorithms, and mitigation of issues related to the "curse of dimensionality". It has also found applications in synthetic data generation using GANs.

Considering scikit-learn's focus on machine learning and its existing set of clustering algorithms and dimensionality reduction techniques, I think feature clustering could be a valuable addition to the library. The original article suggests two Python implementations for feature clustering: one based on hierarchical clustering and another using connected components from graph theory. Both methods are more accessible than techniques like PCA, as they do not involve complex linear algebra or calculus. However, it's important to note that feature clustering might not always be the best solution, depending on the problem at hand and the underlying data structure.

I would like to start a discussion on the possibility of implementing feature clustering in scikit-learn, and gather your thoughts on its potential benefits, drawbacks, and any modifications that might be necessary for a seamless integration with the library's existing functionalities.

Here's the link to the original article for further reading: Feature Clustering: A Simple Solution to Many Machine Learning Problems

adrinjalali · 2023-03-28T14:24:49Z

adrinjalali
Mar 28, 2023
Maintainer

I'm a bit skeptical of that article, when somebody says "No linear algebra or calculus is required: the method is essentially math-free.", it makes me cringe. It's not math free, we're just hiding it.

I'd be curious to see what @GaelVaroquaux thinks about this.

1 reply

EwoutH Mar 28, 2023
Author

The article itself isn't the best, but the idea of feature clustering for dimension reduction is interesting I think.

GaelVaroquaux · 2023-03-28T19:11:56Z

GaelVaroquaux
Mar 28, 2023
Maintainer

Maths are a tool to analyze an algorithm. Sure, anyone can write an algorithm. The question is: how to characterize it's properties?

The inclusion criterion for scikit-learn ( https://scikit-learn.org/stable/faq.html#what-are-the-inclusion-criteria-for-new-algorithms ) do specify that the method must be published and well cited for this specific purpose.

Another reason for scholarship work is to know the prior art and position one-self to it. Feature clustering for dimension reduction is a classic trick, that has been published many times, and it is implemented in scikit-learn:
https://scikit-learn.org/stable/modules/generated/sklearn.cluster.FeatureAgglomeration.html#sklearn.cluster.FeatureAgglomeration

Two examples of publications using it (with a clear selection bias that I'm author on them 😄 ) :
https://dl.acm.org/doi/abs/10.5555/3042573.3042749
and
https://www.sciencedirect.com/science/article/pii/S0031320311001439

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Clustering: A Potential Addition to scikit-learn's Dimensionality Reduction Techniques #25992

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Feature Clustering: A Potential Addition to scikit-learn's Dimensionality Reduction Techniques #25992

EwoutH Mar 28, 2023

Replies: 2 comments · 1 reply

adrinjalali Mar 28, 2023 Maintainer

EwoutH Mar 28, 2023 Author

GaelVaroquaux Mar 28, 2023 Maintainer

EwoutH
Mar 28, 2023

Replies: 2 comments 1 reply

adrinjalali
Mar 28, 2023
Maintainer

EwoutH Mar 28, 2023
Author

GaelVaroquaux
Mar 28, 2023
Maintainer