Client clustering

Introduction

Hello everyone, this is a small client clustering project. KMeans was used as the clustering algorithm. The result of the work was the division of clients into 4 main groups. These groups have their own preferences in the company's products and places of purchase. More details below.

The project consists of three files:

marketing_campaign.csv - Dataset;
Marketing campaign.ipynb - The main notebook that contains all the clustering steps;
sup_defs.py - A file with auxiliary functions for plotting (I think it is better to keep a large amount of monotonous code in a separate file).

What we have?

Dataset from Kaggle: Customer Personality Analysis (https://www.kaggle.com/imakash3011/customer-personality-analysis)

Problem Statement

Customer Personality Analysis is a detailed analysis of a company’s ideal customers. It helps a business to better understand its customers and makes it easier for them to modify products according to the specific needs, behaviors and concerns of different types of customers.

Customer personality analysis helps a business to modify its product based on its target customers from different types of customer segments. For example, instead of spending money to market a new product to every customer in the company’s database, a company can analyze which customer segment is most likely to buy the product and then market the product only on that particular segment.

Attributes

People

ID: Customer's unique identifier
Year_Birth: Customer's birth year
Education: Customer's education level
Marital_Status: Customer's marital status
Income: Customer's yearly household income
Kidhome: Number of children in customer's household
Teenhome: Number of teenagers in customer's household
Dt_Customer: Date of customer's enrollment with the company
Recency: Number of days since customer's last purchase
Complain: 1 if the customer complained in the last 2 years, 0 otherwise

Products

MntWines: Amount spent on wine in last 2 years
MntFruits: Amount spent on fruits in last 2 years
MntMeatProducts: Amount spent on meat in last 2 years
MntFishProducts: Amount spent on fish in last 2 years
MntSweetProducts: Amount spent on sweets in last 2 years
MntGoldProds: Amount spent on gold in last 2 years

Promotion

NumDealsPurchases: Number of purchases made with a discount
AcceptedCmp1: 1 if customer accepted the offer in the 1st campaign, 0 otherwise
AcceptedCmp2: 1 if customer accepted the offer in the 2nd campaign, 0 otherwise
AcceptedCmp3: 1 if customer accepted the offer in the 3rd campaign, 0 otherwise
AcceptedCmp4: 1 if customer accepted the offer in the 4th campaign, 0 otherwise
AcceptedCmp5: 1 if customer accepted the offer in the 5th campaign, 0 otherwise
Response: 1 if customer accepted the offer in the last campaign, 0 otherwise

Place

NumWebPurchases: Number of purchases made through the company’s website
NumCatalogPurchases: Number of purchases made using a catalogue
NumStorePurchases: Number of purchases made directly in stores
NumWebVisitsMonth: Number of visits to company’s website in the last month

Clustering algorithm

First, I checked the data for gaps. In this case, the dataset had 24 gaps in the income line, it was easiest to remove them, since this will have little effect on the result.
Having information about the birthday, you can calculate the age of the client (in general, this column is only relatively useful, we do not know at what point in time the dataframe was created).
The number of days during which the client uses the services of the company is also calculated.
Good indicators will also be the number of goods purchased and the amount that was spent on them. These indicators will allow you to calculate the average price of the product and reduce the number of columns.
StandardScaler will be used for data processing, so the 'Marital_Status' and 'Education' columns have been simplified and numeric.
Abnormal values were found and removed using the plot-matrix.
The correlation matrix allows to track indicators that affect each other, there a lot of them, so need to apply the dimensionality reduction algorithm (PCA).
Transform data with StandardScaler().
Let's transform the data using PCA. 80% of the dataset information will be enough.
Using the Elbow Method, we determine the recommended number of clusters. It turns out 4 groups of clients.
Using KMeans(), we divide clients into 4 groups.

Client groups

Group 1

Average income: 42527.09;
Family size: 3-4 persons;
Preferred shopping place: Website and store;
Average spent: 127.89;
Average number of items purchased: 6.8;
Participation in campaigns: Low;
Note: The company does not provide the goods necessary for this group, it consists of families with children, and the main product of the company is wine. This group is not the target audience and can only be attracted by expanding the range of products, which can only be considered in the very long term.

Group 2

Average income: 76820.05;
Family size: 1-2 persons;
Preferred shopping place: Website, catalog and store;
Average spent: 1407.60;
Average number of items purchased: 19.37;
Participation in campaigns: High, use of discounts is very low;
Note: This group is the main clients of the company and the main target of promotional campaigns. This group rarely visits the website, it may be worth considering new approaches to keeping the audience on the site (simplification of functionality, updating recommender systems, etc.).

Group 3

Average income: 30069.88;
Family size: 2-3 persons;
Preferred shopping place: Website and store;
Average spent: 101.85;
Average number of items purchased: 5.9;
Participation in campaigns: Low;
Note: This is the youngest group, over time, clients from this group will flow into groups 1 and 4.

Group 4

Average income: 59994.39;
Family size: 2-3 persons;
Preferred shopping place: Website and store;
Average spent: 831.56;
Average number of items purchased: 18.35;
Participation in campaigns: moderate, high interest in discounts;
Note:This audience is very similar to group 2, but buys cheaper goods and has one child in the family. Often participates in promotional campaigns and has an interest in discounted products.

Visualizations

The algorithm split the data into 4 almost identical clusters. Now we need to evaluate our groups.

The dependence of money spent in the store on income clearly shows the division of customers into clusters. The graph shows the usual relationship between these indicators, but group 1 is of some interest. The income of this group is higher than that of group 3, but the average amount of purchased goods is approximately equal. This group should be given more attention, most likely the company does not provide the goods necessary for this group or promotional companies do not take this group into account.

The dependence of the number of purchases on their price indicates the difference between groups 2 and 4. Group 2 buys more expensive goods.

Returning to group 1, we can assume that the company provides few products for families with children, so you can consider options for expanding the range to attract this group.

The most traded commodity is wine. The distribution of clusters for all products is approximately the same, but you can pay attention to meat products. The main consumer of these products is the group with the highest income. Most likely, the price of meat products is higher than in other stores.

The group with the highest income (Group 2) shows the greatest interest in campaigns. Most likely, these campaigns are created with an eye on these customers.

I think we can conclude that the last campaign was the most successful and attracted a record number of customers. It seems to me that campaigns should focus not only on group 2, but also on group 4. This group is distinguished by lower income, having one child in the family and interest in discounts, but it is just as loyal as group 2.

Conclusions

The KMeans algorithm did a good job and divided the clients into groups. Each group has its own characteristics (private increase from its income) and preference when buying the company's products. These client groups can also be divided into subgroups by age, number of children, and included for a detailed study of clients.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Marketing campaign.ipynb		Marketing campaign.ipynb
README.md		README.md
marketing_campaign.csv		marketing_campaign.csv
sup_defs.py		sup_defs.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Marketing campaign.ipynb

Marketing campaign.ipynb

README.md

README.md

marketing_campaign.csv

marketing_campaign.csv

sup_defs.py

sup_defs.py

Repository files navigation

Client clustering

Introduction

The project consists of three files:

What we have?

Problem Statement

People

Products

Promotion

Place

Clustering algorithm

Client groups

Group 1

Group 2

Group 3

Group 4

Visualizations

Conclusions

About

Releases

Packages

Languages

vonOrso/Client_clustering

Folders and files

Latest commit

History

Repository files navigation

Client clustering

Introduction

The project consists of three files:

What we have?

Problem Statement

People

Products

Promotion

Place

Clustering algorithm

Client groups

Group 1

Group 2

Group 3

Group 4

Visualizations

Conclusions

About

Topics

Resources

Stars

Watchers

Forks

Languages