Customer Segmentation Based on Purchasing Behavior

The objective of this project is to analyze customer purchasing behavior to enhance strategic decision-making and operational efficiency for an online retail store. By segmenting customers based on their purchasing patterns, the project aims to provide insights for targeted marketing, personalized customer interactions, and optimized business strategies.

Data Sources

The primary dataset used for this analysis is the OnlineRetail.csv file, containing transactional data from an online retail store.

OnlineRetail.csv Dataset

Data Description

The dataset OnlineRetail.csv consists of 5,41,909 entries and includes the following columns:

InvoiceNo: Unique identifier for each transaction.
StockCode: Product identifier.
Description: Name or description of the product.
Quantity: Number of products purchased per transaction.
InvoiceDate: Date and time of the transaction.
UnitPrice: Price per unit of the product.
CustomerID: Unique identifier for each customer.
Country: Country or region where the customer resides.

Tools

Python: Data cleaning and analysis Download Python

Jupyter Notebook: For interactive data analysis and visualization Install Jupyter

Libraries

Below are the links for details and commands (if required) to install the necessary Python packages:

pandas: Go to Pandas Installation or use command: pip install pandas
numpy: Go to NumPy Installation or use command: pip install numpy
matplotlib: Go to Matplotlib Installation or use command: pip install matplotlib
seaborn: Go to Seaborn Installation or use command: pip install seaborn
scikit-learn: Go to Scikit-Learn Installation or use command: pip install scikit-learn
yellowbrick: Go to Yellowbrick Installation or use command: pip install yellowbrick`

EDA Steps

Exploratory Data Analysis (EDA) involved exploring the transactional data to answer key questions, such as:

What are the overall sales trends?
How do sales vary by country and product?
What are the peak sales periods?

Data Preprocessing Steps and Inspiration

Data Cleaning

a. Handling Missing Values: Removed records with missing CustomerID or Description. b. Removing Duplicates: Eliminated duplicate entries to ensure unique transactions. c. Standardizing Text Data: Converted product descriptions to lowercase and removed whitespace. d. Removing Outliers: Used the Interquartile Range (IQR) method to identify and remove outliers in fields like Quantity, UnitPrice, and TotalPrice.

Data Transformation

a. Standardization of Product Descriptions and Stock Codes: Mapped unique stock codes and descriptions to ensure consistency. b. Feature Engineering: Created a TotalPrice feature by multiplying Quantity by UnitPrice.

Date Handling

a. Invoice Date Conversion: Converted InvoiceDate to datetime format. b. Filtering Data by Date: Excluded transactions from incomplete periods for accurate analysis.

Graphs/Visualizations

Top 5 Countries By Sales

Reasons for Choosing the Algorithm for the Project

K-Means Clustering

Suitability for Customer Segmentation:
a. Simplicity and Efficiency: Effective for large datasets with numerical attributes. b. Scalability: Handles extensive transactional data efficiently.
Data Characteristics: a. Numerical Data Handling: Ideal for metrics like TotalPrice, Frequency, and Recency. b. Standardization Ready: Prepared data for optimal performance.
Analytical Goals: a. Customer Insights: Identifies actionable customer segments. b. RFM Analysis: Utilizes Recency, Frequency, and Monetary metrics for segmentation.

Assumptions

Data Distribution and Scale: Assumes normalized numerical data with equal variance across features.
Cluster Assumptions: Assumes spherical clusters with similar density.
Independence of Observations: Treats each transaction or customer record independently.
Algorithm-Specific Assumptions: Relies on multiple initializations for robust clustering.

Model Evaluation Metrics

Silhouette Score: Assesses cluster separation and cohesion.
Davies-Bouldin Index: Measures average similarity between clusters.
Calinski-Harabasz Index: Evaluates variance ratio between clusters.

Results

Customer Segment Distribution Analysis based on RFM:

Notable Segments:

24% in the Dormant segment.
14.90% in the Top Customers segment.
18.20% in the Faithful Customers segment.

K-means Clustering Results:

Silhouette Score: 0.565 (indicating moderate cluster separation).
Davies-Bouldin Index: 0.639 (indicating good cluster distinction). 3)Calinski-Harabasz Index: 3333.416 (indicating well-defined clusters).

Recommendations

Targeted Marketing: Use segmentation insights to tailor marketing campaigns.
Inventory Management: Optimize inventory based on purchasing trends.
Customer Engagement: Enhance engagement strategies for different segments.

Limitations

Data Quality: Potential inaccuracies due to missing or incorrect data.
Cluster Assumptions: Real-world data may not adhere to spherical clusters.
Model Sensitivity: Initial centroid placement can affect clustering results.

Future Possibilities of the Project

Integration with Predictive Analytics: Forecast future purchasing behaviors.
Dynamic Clustering: Implement real-time segmentation.
Enhanced Personalization: Develop personalized engagement strategies.

References

James, Gareth, et al. "An Introduction to Statistical Learning." Springer Texts in Statistics, 2013.
Scikit-Learn Documentation: KMeans, Silhouette Score, Davies-Bouldin Index, Calinski-Harabasz Index.
"Python Data Science Handbook" by Jake VanderPlas; O’Reilly Media, 2016.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
3D View of Customer Clusters.png		3D View of Customer Clusters.png
Average Cart Value by Country.png		Average Cart Value by Country.png
Countrywise Average Values.png		Countrywise Average Values.png
Customers per Month.png		Customers per Month.png
Data Description1.png		Data Description1.png
Data Description2.png		Data Description2.png
Distortion Score Elbow Method for KMeans Clustering.png		Distortion Score Elbow Method for KMeans Clustering.png
KMeans Customer Clusters Distribution.png		KMeans Customer Clusters Distribution.png
Monthly Sales Trends - UK.png		Monthly Sales Trends - UK.png
Monthly Sales Trends for Countries other than UK.png		Monthly Sales Trends for Countries other than UK.png
Number of Customers by Country.png		Number of Customers by Country.png
Number of Transactions by Time of Day.png		Number of Transactions by Time of Day.png
Number of Transactions per Hour.png		Number of Transactions per Hour.png
OnlineRetail.zip		OnlineRetail.zip
README.md		README.md
RFM Customer Segment Distribution.png		RFM Customer Segment Distribution.png
Silhouette Score for Different Numbers of Clusters.png		Silhouette Score for Different Numbers of Clusters.png
Top 5 countries by Sales.png		Top 5 countries by Sales.png
Total Monthly Sales Trends.png		Total Monthly Sales Trends.png
Total Sales For Each Country Except UK.png		Total Sales For Each Country Except UK.png
project_report.zip		project_report.zip
project_report_segmentation.docx		project_report_segmentation.docx
project_report_segmentation.pdf		project_report_segmentation.pdf
segmentation_project.ipynb		segmentation_project.ipynb

tgchacko/Customer-Segmentation---Purchasing-Behavior

Folders and files

Latest commit

History

Repository files navigation

Customer Segmentation Based on Purchasing Behavior

Table of Contents

Project Overview

Data Sources

Data Description

Tools

Libraries

EDA Steps

Data Preprocessing Steps and Inspiration

Data Cleaning

Data Transformation

Date Handling

Graphs/Visualizations

Top 5 Countries By Sales

Reasons for Choosing the Algorithm for the Project

K-Means Clustering

Assumptions

Model Evaluation Metrics

Results

K-means Clustering Results:

Recommendations

Limitations

Future Possibilities of the Project

References

About

Topics

Resources

Stars

Watchers

Forks

Languages