Machine Learning Project (Regression & Clustering) on Kalbe Nutritionals

Project Description

Data scientist in Kalbe Nutritionals got a new project from :

Inventory Team to predict sum of quantity from all products, so they could create sufficient daily inventory.
Marketing Team to create cluster or segment of customer to get personalized promotion and sales treatment

The challenges in this project include :

Perform Data Ingestion into dbeaver and PostgreSQL.
Perform Exploratory Data Analysis (EDA) to know the average age of customer based on marital status and gender, also the best-selling store and product name.
Visualizing Data Dashboard using Tableau
Make Machine Learning Regression Model (Time Series) with ARIMA
Make Machine Learning Clustering Model with KMeans Algorithm

Project Goal

Estimate the quantity of products sold for daily stock inventory
Create a segment of customers for marketing promotion

Tools & Library Used

Dataset Used

Project Result

Data Ingestion and Exploratory Data Analysis using PosgreSQL & DBeaver

Click here to get on SQL data

On DBeaver we import dataset in Postgres public schema. There are 4 csv file that is used :

The challenge results are :
Query 1 : Average customer age based on their marital status

Query 2 : Average customer age based on their gender

Query 3 : Store name with the highest total quantity

Query 4 : The best-selling product with the highest total amount

Dashboard Visualization Using Tableau

Click here to get on Dashboard Visualization

The dashboard visualization on entire worksheet :

Worksheet 1 : Total quantity from month to month

Insight :

Sales trends fluctuate slightly, and show a gradual decline starting from June.
The highest total quantity sold is in March 2022 with 1,753 items
The lowest total quantity sold is in December 2022 with 1,409 items

Worksheet 2 : Total amount from day to day

Insight :

Daily revenue trends fluctuate heavily
The highest total amount is in March 2022 with Rp 976,500
The lowest total amount is in August 2022 with Rp 123,600

Worksheet 3 : Total quantity by product

Insight :

The highest selling product in 2022 is Thai Tea with 2,853 items sold.
The lowest selling product in 2022 is Cashew with 627 items sold.

Worksheet 4 : Total sales amount by store name

Insight :

The best-selling store in 2022 is Lingga with sales revenue reached Rp 25,294,100.
The lowest-selling store in 2022 is Buana Indah with sales revenue reached Rp 10,629,900.

Daily Product Quantity Prediction Using Time Series Arima

Click here to get on ARIMA and KMeans Clustering

Data Cleaning

Fix data type error
Fix data type error for 'Latitude', 'Longitude', and 'Income' from coma (,) to point (.)

Before cleaning :

After Cleaning :

Drop Missing Values :
Drop missing values on 'Marital Status' for 3 row

Data Merging

Merging all dataset into one data :

Create New Dataset

Create new dataset by grouping date and total quantity :

Here is the plot of new dataset :

Seasonal Decomposition

According to trend, seasonality, and residual, the sales trend are quite fluctuative and indicate downtrend.

Augmented Dicky-Fuller (ADF) Test

Based on ADF statistic, the p-value is 0.0 < 0.05, then reject H0 and accept H1. Therefore data is stationary.

Data Training & Testing

Splitting the data with 80% training and 20% testing. Blue line is data training, and green line is data testing.

Find p,d,q for ARIMA Model

Model 1 - Auto-fit ARIMA
Get result with p,d,q = 1,0,1. ARIMA (1,0,1) means there is no Differencing (0) because it is stationary, with Autoregression for 1 lag and 1 order Moving Average.

Model 2 - ACF & PACF Plot
Get result with p,d,q = 28,0,28 from ACF and PACF plot.

Model 3 - Autocorrelation Plot
Get result with p,d,q = 44,0,44 from Autocorrelation plot.

ARIMA Modelling Plot
Plot Data Train, Test, and Model Prediction

Forecast Quantity Sales With The Best Parameter
Model 2 with p,d,q (28,0,28) show the best metric evaluation because has the lowest MAE, MSE, RMSE, and MAPE.

Prediction for quantity on January 2023 is 50 pcs/day

Customer Segmentation Using KMeans Clustering

Elbow Method

Find k =4 in Elbow line plot, and input cluster to dataset.

There are 4 cluster of segmentation :

Conclusion

Cluster 0 is the cluster with the most largest number of customers, but has the second lowest average of quantity and total amount. The strategy is give special offering and discount for new member
Cluster 1 is the cluster with the second fewest number of customers, and the lowest average of quantity and total amount. One of the strategy is collaborate with influencers to promote products.
Cluster 2 is the customer that valuable to the business. the strategy is offer loyalty membership
Cluster 3 has the second largest average of quantity and total amount and has potential of upselling.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
Dataset		Dataset
images		images
ARIMA & KMeans for Sales Prediction & Segmentation.ipynb		ARIMA & KMeans for Sales Prediction & Segmentation.ipynb
README.md		README.md
Report of ARIMA & KMeans Clustering.pdf		Report of ARIMA & KMeans Clustering.pdf
Sales Dashboard.png		Sales Dashboard.png
Script EDA DBeaver.sql		Script EDA DBeaver.sql

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset

Dataset

images