Skip to content

tgchacko/Customer-Segmentation---Purchasing-Behavior

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Customer Segmentation Based on Purchasing Behavior

Table of Contents

Project Overview

Data Sources

Data Description

Tools

EDA Steps

Data Preprocessing Steps and Inspiration

Graphs/Visualizations

Choosing the Algorithm for the Project

Assumptions

Model Evaluation Metrics

Results

Recommendations

Limitations

Future Possibilities of the Project

References

Project Overview

The objective of this project is to analyze customer purchasing behavior to enhance strategic decision-making and operational efficiency for an online retail store. By segmenting customers based on their purchasing patterns, the project aims to provide insights for targeted marketing, personalized customer interactions, and optimized business strategies.

Data Sources

The primary dataset used for this analysis is the OnlineRetail.csv file, containing transactional data from an online retail store.

OnlineRetail.csv Dataset

Data Description

The dataset OnlineRetail.csv consists of 5,41,909 entries and includes the following columns:

  1. InvoiceNo: Unique identifier for each transaction.
  2. StockCode: Product identifier.
  3. Description: Name or description of the product.
  4. Quantity: Number of products purchased per transaction.
  5. InvoiceDate: Date and time of the transaction.
  6. UnitPrice: Price per unit of the product.
  7. CustomerID: Unique identifier for each customer.
  8. Country: Country or region where the customer resides.

Data Description1

Data Description2

Tools

Python: Data cleaning and analysis Download Python

Jupyter Notebook: For interactive data analysis and visualization Install Jupyter

Libraries

Below are the links for details and commands (if required) to install the necessary Python packages:

EDA Steps

Exploratory Data Analysis (EDA) involved exploring the transactional data to answer key questions, such as:

  1. What are the overall sales trends?
  2. How do sales vary by country and product?
  3. What are the peak sales periods?

Data Preprocessing Steps and Inspiration

  1. Data Cleaning

a. Handling Missing Values: Removed records with missing CustomerID or Description. b. Removing Duplicates: Eliminated duplicate entries to ensure unique transactions. c. Standardizing Text Data: Converted product descriptions to lowercase and removed whitespace. d. Removing Outliers: Used the Interquartile Range (IQR) method to identify and remove outliers in fields like Quantity, UnitPrice, and TotalPrice.

  1. Data Transformation

a. Standardization of Product Descriptions and Stock Codes: Mapped unique stock codes and descriptions to ensure consistency. b. Feature Engineering: Created a TotalPrice feature by multiplying Quantity by UnitPrice.

  1. Date Handling

a. Invoice Date Conversion: Converted InvoiceDate to datetime format. b. Filtering Data by Date: Excluded transactions from incomplete periods for accurate analysis.

Graphs/Visualizations

Total Monthly Sales Trends

Countrywise Average Values

Average Cart Value by Country

Top 5 Countries By Sales

Top 5 countries by Sales

Monthly Sales Trends - UK

Monthly Sales Trends for Countries other than UK

Total Sales For Each Country Except UK

Customers per Month

Number of Customers by Country

Number of Transactions per Hour

Number of Transactions by Time of Day

Reasons for Choosing the Algorithm for the Project

K-Means Clustering

  1. Suitability for Customer Segmentation:
    a. Simplicity and Efficiency: Effective for large datasets with numerical attributes. b. Scalability: Handles extensive transactional data efficiently.

  2. Data Characteristics: a. Numerical Data Handling: Ideal for metrics like TotalPrice, Frequency, and Recency. b. Standardization Ready: Prepared data for optimal performance.

  3. Analytical Goals: a. Customer Insights: Identifies actionable customer segments. b. RFM Analysis: Utilizes Recency, Frequency, and Monetary metrics for segmentation.

Assumptions

  1. Data Distribution and Scale: Assumes normalized numerical data with equal variance across features.
  2. Cluster Assumptions: Assumes spherical clusters with similar density.
  3. Independence of Observations: Treats each transaction or customer record independently.
  4. Algorithm-Specific Assumptions: Relies on multiple initializations for robust clustering.

Model Evaluation Metrics

  1. Silhouette Score: Assesses cluster separation and cohesion.
  2. Davies-Bouldin Index: Measures average similarity between clusters.
  3. Calinski-Harabasz Index: Evaluates variance ratio between clusters.

Results

Customer Segment Distribution Analysis based on RFM:

RFM Customer Segment Distribution

Notable Segments:

  1. 24% in the Dormant segment.
  2. 14.90% in the Top Customers segment.
  3. 18.20% in the Faithful Customers segment.

K-means Clustering Results:

  1. Silhouette Score: 0.565 (indicating moderate cluster separation).
  2. Davies-Bouldin Index: 0.639 (indicating good cluster distinction). 3)Calinski-Harabasz Index: 3333.416 (indicating well-defined clusters).

Distortion Score Elbow Method for KMeans Clustering

Silhouette Score for Different Numbers of Clusters

3D View of Customer Clusters

KMeans Customer Clusters Distribution

Recommendations

  1. Targeted Marketing: Use segmentation insights to tailor marketing campaigns.
  2. Inventory Management: Optimize inventory based on purchasing trends.
  3. Customer Engagement: Enhance engagement strategies for different segments.

Limitations

  1. Data Quality: Potential inaccuracies due to missing or incorrect data.
  2. Cluster Assumptions: Real-world data may not adhere to spherical clusters.
  3. Model Sensitivity: Initial centroid placement can affect clustering results.

Future Possibilities of the Project

  1. Integration with Predictive Analytics: Forecast future purchasing behaviors.
  2. Dynamic Clustering: Implement real-time segmentation.
  3. Enhanced Personalization: Develop personalized engagement strategies.

References

  1. James, Gareth, et al. "An Introduction to Statistical Learning." Springer Texts in Statistics, 2013.
  2. Scikit-Learn Documentation: KMeans, Silhouette Score, Davies-Bouldin Index, Calinski-Harabasz Index.
  3. "Python Data Science Handbook" by Jake VanderPlas; O’Reilly Media, 2016.