Replies: 2 comments 1 reply
-
I'm a bit skeptical of that article, when somebody says "No linear algebra or calculus is required: the method is essentially math-free.", it makes me cringe. It's not math free, we're just hiding it. I'd be curious to see what @GaelVaroquaux thinks about this. |
Beta Was this translation helpful? Give feedback.
-
Maths are a tool to analyze an algorithm. Sure, anyone can write an algorithm. The question is: how to characterize it's properties? The inclusion criterion for scikit-learn ( https://scikit-learn.org/stable/faq.html#what-are-the-inclusion-criteria-for-new-algorithms ) do specify that the method must be published and well cited for this specific purpose. Another reason for scholarship work is to know the prior art and position one-self to it. Feature clustering for dimension reduction is a classic trick, that has been published many times, and it is implemented in scikit-learn: Two examples of publications using it (with a clear selection bias that I'm author on them 😄 ) : |
Beta Was this translation helpful? Give feedback.
-
Hello scikit-learn community!
I came across an interesting article discussing feature clustering as a solution to many high-dimensional machine learning problems. Feature clustering is an unsupervised technique that groups features of a dataset into homogeneous clusters, providing an alternative method for data reduction, acceleration of algorithms, and mitigation of issues related to the "curse of dimensionality". It has also found applications in synthetic data generation using GANs.
Considering scikit-learn's focus on machine learning and its existing set of clustering algorithms and dimensionality reduction techniques, I think feature clustering could be a valuable addition to the library. The original article suggests two Python implementations for feature clustering: one based on hierarchical clustering and another using connected components from graph theory. Both methods are more accessible than techniques like PCA, as they do not involve complex linear algebra or calculus. However, it's important to note that feature clustering might not always be the best solution, depending on the problem at hand and the underlying data structure.
I would like to start a discussion on the possibility of implementing feature clustering in scikit-learn, and gather your thoughts on its potential benefits, drawbacks, and any modifications that might be necessary for a seamless integration with the library's existing functionalities.
Here's the link to the original article for further reading: Feature Clustering: A Simple Solution to Many Machine Learning Problems
Beta Was this translation helpful? Give feedback.
All reactions