kmeans-feature-importance

kmeans_interp is a wrapper around sklearn.cluster.KMeans which adds the property feature_importances_ that will act as a cluster-based feature weighting technique. Features are weighted using either of the two methods: wcss_min or unsup2sup.

Refer to this notebook for a direct demo .
Refer to my TDS article for more details Interpretable K-Means: Clusters Feature Importances.

Methodology

WCSS Minimizer

The method is a direct analysis of each centroid's sub-optimal position. K-Means aim is to minimize the Within-Cluster Sum of Squares and consequently the Between-Cluster Sum of Squares, and assuming that the distance metric used is euclidean. The euclidean distance from a cluster centroid C_j and point p_i:

And the WCSS for one cluster C_j that has p_m points is (Excuse the usage of i differently):

Then we will try to find the feature d_i that was responsible for the highest amount of WCSS (The sum of squares of each data point distance to its cluster centroid) minimization through finding the maximum absolute centroid dimensional movement.

Unsupervised 2 Supervised

Another interpretation approach is to convert the unsupervised classification problem into a supervised classification settings using an easily interpretable model such as tree-based models (We will be using a Random Forest Classifier). The steps to do this is as follows:

Change the cluster labels into One-vs-All for each label
Train a classifier to discriminate between each cluster and all other clusters
Extract the feature importances from the model (We will be using sklearn.ensemble.RandomForestClassifier)

Usage

Clone the repository git clone https://github.com/YousefGh/kmeans-feature-importance.git
Move kmeans_interp folder to your project directory
Follow the instructions below

You can instantiate KMeansInterp in the same way you sklearn.cluster.KMeans is instantiated, but you will need to provide the feature names to ordered_feature_names parameter, which should have the order of X features.

from kmeans_interp.kmeans_feature_imp import KMeansInterp

X = pd.DataFrame(...) # DataFrame is an example to deliver the idea of features order

kms = KMeansInterp(
	n_clusters=5,
	ordered_feature_names=X.columns.tolist(), 
	feature_importance_method='wcss_min', # or 'unsup2sup'
).fit(X.values)

# A dictionary where the key [0] is the cluster label, and [:10] will refer to the first 10 most important features
kms.feature_importances_[0][:10] # Features here are words
# [('film', 0.39589216529770005),
#  ('award', 0.1605575985825074),
#  ('actor', 0.12619074083837967),
#  ('oscar', 0.1178746877093894),
#  ('star', 0.1048044246433086),
#  ('actress', 0.0805780582173732),
#  ('movie', 0.07849181814402928),
#  ('director', 0.07750076034520005),
#  ('year', 0.05714139742209183),
#  ('won', 0.05598607819724065)]

The method was applied on a natural language processing (NLP) example which could be considered as an unsupervised cluster based keyword extraction technique

Example Output

I have chosen to apply the interpretation technique on an NLP problem since we can easily relate to the feature importances (words) which could be considered as a corpus-based keyword extraction technique where our aim is to cluster similar documents together using K-Means, and then apply the methods above. The dataset I have used can be found here Kaggle BBC-News. This dataset presents a classification problem but we will be using the categories as a final comparison

Category Cluster Feature Comparison `min_wcss`

Category Cluster Feature Comparison `unsup2sup`

Requirements

scikit-learn~=0.24.2
numpy

References

Y. Liu, Z. Li, H. Xiong, X. Gao and J. Wu, "Understanding of Internal Clustering Validation Measures," 2010 IEEE International Conference on Data Mining, 2010, pp. 911-916, doi: 10.1109/ICDM.2010.35.
Kriegel, HP., Schubert, E. & Zimek, A. The (black) art of runtime evaluation: Are we comparing algorithms or implementations?. Knowl Inf Syst 52, 341–378 (2017). https://doi.org/10.1007/s10115-016-1004-2
Ng, A., & Piech, C. (2021). CS221. Retrieved 18 July 2021, from https://stanford.edu/~cpiech/cs221/handouts/kmeans.html
Ismaili, Oumaima & Lemaire, Vincent & Cornuéjols, Antoine. (2014). A Supervised Methodology to Measure the Variables Contribution to a Clustering. 159-166. 10.1007/978-3-319-12637-1_20.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
images		images
kmeans_interp		kmeans_interp
.gitignore		.gitignore
2021_07_12_interpretable_kmeans.ipynb		2021_07_12_interpretable_kmeans.ipynb
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

images

images

kmeans_interp

kmeans_interp

.gitignore

.gitignore

2021_07_12_interpretable_kmeans.ipynb

2021_07_12_interpretable_kmeans.ipynb

CITATION.cff

CITATION.cff

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

kmeans-feature-importance

Methodology

WCSS Minimizer

Unsupervised 2 Supervised

Usage

Example Output

Category Cluster Feature Comparison `min_wcss`

Category Cluster Feature Comparison `unsup2sup`

Requirements

References

About

Releases

Packages

Languages

License

YousefGh/kmeans-feature-importance

Folders and files

Latest commit

History

Repository files navigation

kmeans-feature-importance

Methodology

WCSS Minimizer

Unsupervised 2 Supervised

Usage

Example Output

Category Cluster Feature Comparison min_wcss

Category Cluster Feature Comparison unsup2sup

Requirements

References

About

Topics

Resources

License

Stars

Watchers

Forks

Languages

Category Cluster Feature Comparison `min_wcss`

Category Cluster Feature Comparison `unsup2sup`