Short_Customer_Segmentation

This project includes a short customer segmentation of a small mall using clustering and exploratory analysis.

Check a file Mall_Customer_Segmentation.pdf to see a presentation of this analysis or go below to see slices of that presentation with a source code and plots.

Introduction :

Data set used in this project is in a file : 'Mall_Customers.csv'

Source code of this data analysis in :

jupyter lab file: 'mall_customer_segmentation_clustering.ipynb'
python file: mall_customer_segmentation_clustering.py

Descriptive and Exploratory Analysis :

Firstly, descriptive analysis to get to know the data - distributions, amount, statistic values, visualisations

# plots to visualize data : 
ax = sns.countplot(x='Gender', data=data, palette='pastel')
for container in ax.containers:
    ax.bar_label(container)
plt.show()

fig, axs = plt.subplots(ncols=3)
sns.histplot(x='Age', data=data, color='purple', ax=axs[0])
sns.histplot(x='Annual Income (k$)', data=data, color='purple', ax=axs[1])
sns.histplot(x='Spending Score (1-100)', data=data, color='purple', ax=axs[2])

Data set does not have any missing data and significant outliers.

I used the one-hot-encoding method for 'Gender' values

I used a Standard Scaler and then plot a correlation heatmap

# preparing data to analysis :
df = data.drop(columns='CustomerID')
one_hot_encoded_data = pd.get_dummies(df, columns = ['Gender'],dtype=int) 
to_scal_data = one_hot_encoded_data.drop(columns=['Gender_Female', 'Gender_Male'])

# standardization to the same scale :
scaler = StandardScaler()
scaled_data = scaler.fit_transform(to_scal_data)
d_ready = np.append(scaled_data, one_hot_encoded_data['Gender_Female'].values.reshape(200, 1), axis=1)
d_ready = np.append(d_ready, one_hot_encoded_data['Gender_Male'].values.reshape(200, 1), axis=1)

dfull_scaled = pd.DataFrame(d_ready, columns=['Age', 'Annual Income (k$)', 'Spending Score (1-100)', 'Gender_Female', 'Gender_Male'])
corr_p = dfull_scaled.corr()
plt.subplots(figsize=(12, 9))
sns.heatmap(corr_p, cmap="crest", vmax=0.9, fmt='.1f', annot=True)
plt.show()

plt.subplots(figsize=(12, 9))
corr_s = dfull_scaled.corr('spearman', numeric_only=False)
sns.heatmap(corr_s, cmap="crest", vmax=0.9, fmt='.1f', annot=True)
plt.show()

# testing if the 'Age' and 'Spending score' correlation is statistically significant: 
corr_bi = stats.pointbiserialr(dfull_scaled['Age'], dfull_scaled['Spending Score (1-100)'])
print(corr_bi)
age_spend = stats.spearmanr(dfull_scaled['Age'], dfull_scaled['Spending Score (1-100)'])
print(age_spend.pvalue)
corr_bi2 = stats.spearmanr(dfull_scaled['Age'], dfull_scaled['Annual Income (k$)'])
print(corr_bi2.pvalue)
corr_bi3 = stats.spearmanr(dfull_scaled['Spending Score (1-100)'], dfull_scaled['Annual Income (k$)'])
print(corr_bi3.pvalue)

Spearman's Correlation:

The correlation between 'Age' feature and 'Spending Score(1-100)' feature is statistically significant: SignificanceResult(statistic=-0.3272268460390901, pvalue=2.2502957035652467e-06)

Here is the scatter plot to see this negative correlation:

K-Means Clustering :

To make customers segmentation was used the k-means clustering method. A right parameter 'k' for a clustering was decided with a help of 'elbow' scores.

# K-Means Clustering :
from yellowbrick.cluster import KElbowVisualizer
from sklearn.cluster import KMeans

# looking for right parameter k :
model = KMeans()
visualizer = KElbowVisualizer(model, k=(1,10)).fit(data_2col)
visualizer.show()

visualizer = KElbowVisualizer(model, k=(2,10), metric='calinski_harabasz', timings=False)

visualizer.fit(data_2col)
visualizer.poof()

kmeans = KMeans(n_clusters=2, init='k-means++', random_state=0).fit(data_2col)

sns.scatterplot(data=data, x="Age", y="Spending Score (1-100)", hue=kmeans.labels_)

plt.show()

# 3 dimensional clustering :
data3d = data.drop(columns=['CustomerID', 'Gender'])
kmeans3d = KMeans(n_clusters = 3, init = 'k-means++',  random_state=42)
y = kmeans3d.fit_predict(scaled_data)
data3d['cluster'] = y

color_list = ['deeppink', 'blue', 'red', 'orange', 'darkviolet', 'brown']
fig = plt.figure(figsize = (10, 7))
ax = plt.axes(projection ="3d")
# Data for 3-dimensional scattered points :
for i in range(data3d.cluster.nunique()):
    label = "cluster=" + str(i+1)
    ax.scatter3D(data3d[data3d.cluster==i]['Spending Score (1-100)'], data3d[data3d.cluster==i]['Annual Income (k$)'], data3d[data3d.cluster==i]['Age'], c=color_list[i], label=label)

ax.set_xlabel('Spending Score (1-100)')
ax.set_ylabel('Annual Income (k$)')
ax.set_zlabel('Age')
plt.legend()
plt.title("Kmeans Clustering Of Mall's Customers")
plt.show()

I think that 3 subgroups of the data better show the main segmentations of customers:

Red Cluster with customers with low spending score but various age and mostly high annual income
Pink Cluster with customers with low or average annual income, age below 40 and average spending score
Blue Cluster with customers high spending score, age below 30 and high or average annual income

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

plots

plots

Mall_Customer_Segmentation.pdf

Mall_Customer_Segmentation.pdf

Mall_Customers.csv

Mall_Customers.csv

README.md

README.md

mall_customer_segmentation_clustering.ipynb

mall_customer_segmentation_clustering.ipynb

mall_customer_segmentation_clustering.py

mall_customer_segmentation_clustering.py

Repository files navigation

Short_Customer_Segmentation

Table of contents :

Introduction :

Descriptive and Exploratory Analysis :

K-Means Clustering :

Conclusions :

Thank you for reading !

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
plots		plots
Mall_Customer_Segmentation.pdf		Mall_Customer_Segmentation.pdf
Mall_Customers.csv		Mall_Customers.csv
README.md		README.md
mall_customer_segmentation_clustering.ipynb		mall_customer_segmentation_clustering.ipynb
mall_customer_segmentation_clustering.py		mall_customer_segmentation_clustering.py

claudia13062013/Short_Customer_Segmentation

Folders and files

Latest commit

History

Repository files navigation

Short_Customer_Segmentation

Table of contents :

Introduction :

Descriptive and Exploratory Analysis :

K-Means Clustering :

Conclusions :

Thank you for reading !

About

Topics

Resources

Stars

Watchers

Forks

Languages