GitHub - triyoza/Buyer-Rating-Prediction: A final project in Sharing Vision Data Science Bootcamp

Using CRISP-DM Methodology Classification Model to predict Buyer Rating in E-Commerce

A Final project in Sharing Vision Data Science Bootcamp by Triyoza Aprianda

Introduction

CRISP-DM (The CRoss Industry Standard Process for Data Mining):

Business Understanding
Data Understanding
Data Preparation
Feature Engineering
Modeling
Evaluation
Deployment (In this project only until the evaluation stage)

Business Understanding

Divided into three that have been determined by the instructor:

Business Objectives: A marketplace company wants to create a guideline containing tips for sellers on how to get a 5 rating from buyers.
Model Objectives: Create a classification engine to determine whether a buyer gives a 5 rating to the purchased item (label 1) or a rating below 5 (label 0).
Model Success Criteria (Recall > 0.6; Precision > 0.6; FPR < 0.45), The model created reaches or even exceeds the model success criteria. If not successful, select the model with the best performance.

Data Understanding

Data Description

Data used:

'model_development_set.csv', used for model development.
'back_testing_set.csv', used to test the model created and predict the 'label' column (buyer rating) because in this back_testing_set the 'label' column is hidden by the instructor
The data 'model_development_set.csv' consists of 13645 rows and 40 columns or features with numeric, categorical, and datetime data types

Numeric Features

Categorical Features

Datetime Feature

Exploratory Data Analysis (EDA)

Numerical Features

Multivariate Numerical

One of the regplot with a high correlation

Numeric-Label

From the resulting plots, in general, each numeric feature has a higher average value at label 0 (rating below 5).
However, for the features 'price' and 'description length' the opposite is shown, having a higher average value at label 1 (buyer rating 5).

Categorical features

Countplot and stacked barplot

From the bar plots and stacked bar plots shown, it can be seen that for each categorical feature, label 1 or rating 5 is superior for each category in the feature.

Datetime Feature

From the datetime feature, the length of time for order processing can be identified by identifying the difference between datetime features, so that a column can be created in the form of the length of time required, as follows:

Some plot results in the form of joint plot

The blue dots in the plot are more distant to zero, indicating a rating below 5 (label 0). This shows that the longer it takes the dominant rating given by the buyer is below 5.
The orange-colored dots on the plot are more clustered near axis 0 which indicates a rating above 5 (label 1). Indicating that the faster the time taken the rating given by the buyer is generally below 5.

Insights from EDA

The numeric features for 'price' and 'description length' are unique in that the average value for these features is higher at a rating of 5 than below 5, unlike the other numeric features where the opposite is true.
In categorical features, in each of the inner categories contained in the feature, there are more ratings of 5 than below 5, but the relationship is less clear.
Of the three data types, the most influential on the rating is the datetime feature, which is the length of time taken to process the order. Where the longer the processing stage time, the more dominant the rating given by the buyer is below 5.

Data Preparation and Engineering features

Train Test Split

Drop unnecessary features
Perform a train test split:

Missing Value Handling

Percentage of missing values for each feature

Handling:

Simple imputer (strategy = median) for numeric features
Simple imputer (strategy = most_frequent) for categorical features

Transformation

Scaling for numeric features
One hot encoder for categorical features

Feature Selection

Multicollinearity Reduction
Mutual Information

Testing Set

Perform all steps: Missing value handling, transformation, and feature selection to the testing set as done to the train set without re-fitting.

Modeling

Determining the model
Hyperparameter tuning using GridSearchCV
Getting the best parameters
Fitting to a training set
Check performance (train and test)
Performed until finding a model that meets the criteria or the best performance

Final classification model

The classification models made are Logistic Regression, Decision Tree, Random Forest, Ada boost, and XGBoost.
XGboost was chosen because it has the best performance.
Hyperparameter

Final XGBoost performance

Because it has exceeded the model success criteria specified at the beginning, the final XGboost model is used to predict the label column in 'backtesting_set'

Evaluation

Using the data 'back_testing_set.csv which has been transformed and selected following what was done to the training set
Obtained a predicted label column containing 0 and 1 (0: buyer rated below 5, 1: buyer rated 5), then made it into its column and extracted it to a CSV file

Closing

After submission, the instructor compares the obtained label column with the original hidden label column.
The performance obtained for the backtesting set of the model I made is: Recall 0.68, Precision 0.68, FPR 0.45

Thank you

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
SV_DS_ Final-Project_presentation.pdf		SV_DS_ Final-Project_presentation.pdf
SV_DS_Final_project_file_notebook.ipynb		SV_DS_Final_project_file_notebook.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

SV_DS_ Final-Project_presentation.pdf

SV_DS_ Final-Project_presentation.pdf

SV_DS_Final_project_file_notebook.ipynb

SV_DS_Final_project_file_notebook.ipynb

Repository files navigation

Using CRISP-DM Methodology Classification Model to predict Buyer Rating in E-Commerce

Introduction

Business Understanding

Data Understanding

Data Description

Numeric Features

Categorical Features

Datetime Feature

Exploratory Data Analysis (EDA)

Numerical Features

Multivariate Numerical

Numeric-Label

Categorical features

Datetime Feature

Insights from EDA

Data Preparation and Engineering features

Train Test Split

Missing Value Handling

Transformation

Feature Selection

Testing Set

Modeling

Final classification model

Evaluation

Closing

About

Releases

Packages

Languages

triyoza/Buyer-Rating-Prediction

Folders and files

Latest commit

History

Repository files navigation

Using CRISP-DM Methodology Classification Model to predict Buyer Rating in E-Commerce

Introduction

Business Understanding

Data Understanding

Data Description

Numeric Features

Categorical Features

Datetime Feature

Exploratory Data Analysis (EDA)

Numerical Features

Multivariate Numerical

Numeric-Label

Categorical features

Datetime Feature

Insights from EDA

Data Preparation and Engineering features

Train Test Split

Missing Value Handling

Transformation

Feature Selection

Testing Set

Modeling

Final classification model

Evaluation

Closing

About

Topics

Resources

Stars

Watchers

Forks

Languages