Skip to content

A final project in Sharing Vision Data Science Bootcamp

Notifications You must be signed in to change notification settings

triyoza/Buyer-Rating-Prediction

Repository files navigation

Using CRISP-DM Methodology Classification Model to predict Buyer Rating in E-Commerce

  • A Final project in Sharing Vision Data Science Bootcamp by Triyoza Aprianda

Introduction

CRISP-DM (The CRoss Industry Standard Process for Data Mining):

  • Business Understanding
  • Data Understanding
  • Data Preparation
  • Feature Engineering
  • Modeling
  • Evaluation
  • Deployment (In this project only until the evaluation stage)

Business Understanding

Divided into three that have been determined by the instructor:

  • Business Objectives: A marketplace company wants to create a guideline containing tips for sellers on how to get a 5 rating from buyers.
  • Model Objectives: Create a classification engine to determine whether a buyer gives a 5 rating to the purchased item (label 1) or a rating below 5 (label 0).
  • Model Success Criteria (Recall > 0.6; Precision > 0.6; FPR < 0.45), The model created reaches or even exceeds the model success criteria. If not successful, select the model with the best performance.

Data Understanding

Data Description

Data used:

  • 'model_development_set.csv', used for model development.
  • 'back_testing_set.csv', used to test the model created and predict the 'label' column (buyer rating) because in this back_testing_set the 'label' column is hidden by the instructor
  • The data 'model_development_set.csv' consists of 13645 rows and 40 columns or features with numeric, categorical, and datetime data types

Numeric Features

numerik

Categorical Features

newww

Datetime Feature

dtime beru

Exploratory Data Analysis (EDA)

Numerical Features

Multivariate Numerical

One of the regplot with a high correlation

regplot

Numeric-Label

kkkk

  • From the resulting plots, in general, each numeric feature has a higher average value at label 0 (rating below 5).
  • However, for the features 'price' and 'description length' the opposite is shown, having a higher average value at label 1 (buyer rating 5).

Categorical features

Countplot and stacked barplot

count and stacked

From the bar plots and stacked bar plots shown, it can be seen that for each categorical feature, label 1 or rating 5 is superior for each category in the feature.

Datetime Feature

From the datetime feature, the length of time for order processing can be identified by identifying the difference between datetime features, so that a column can be created in the form of the length of time required, as follows:

dtmiee

  • Some plot results in the form of joint plot

jointploat

  • The blue dots in the plot are more distant to zero, indicating a rating below 5 (label 0). This shows that the longer it takes the dominant rating given by the buyer is below 5.
  • The orange-colored dots on the plot are more clustered near axis 0 which indicates a rating above 5 (label 1). Indicating that the faster the time taken the rating given by the buyer is generally below 5.

Insights from EDA

  • The numeric features for 'price' and 'description length' are unique in that the average value for these features is higher at a rating of 5 than below 5, unlike the other numeric features where the opposite is true.
  • In categorical features, in each of the inner categories contained in the feature, there are more ratings of 5 than below 5, but the relationship is less clear.
  • Of the three data types, the most influential on the rating is the datetime feature, which is the length of time taken to process the order. Where the longer the processing stage time, the more dominant the rating given by the buyer is below 5.

Data Preparation and Engineering features

Train Test Split

  • Drop unnecessary features
  • Perform a train test split:

tts

Missing Value Handling

Percentage of missing values for each feature

mv

Handling:

  • Simple imputer (strategy = median) for numeric features
  • Simple imputer (strategy = most_frequent) for categorical features

Transformation

  • Scaling for numeric features
  • One hot encoder for categorical features

Feature Selection

  • Multicollinearity Reduction
  • Mutual Information

Testing Set

Perform all steps: Missing value handling, transformation, and feature selection to the testing set as done to the train set without re-fitting.

Modeling

  • Determining the model
  • Hyperparameter tuning using GridSearchCV
  • Getting the best parameters
  • Fitting to a training set
  • Check performance (train and test)
  • Performed until finding a model that meets the criteria or the best performance

Final classification model

  • The classification models made are Logistic Regression, Decision Tree, Random Forest, Ada boost, and XGBoost.
  • XGboost was chosen because it has the best performance.
  • Hyperparameter

hyp

  • Final XGBoost performance

performance

Because it has exceeded the model success criteria specified at the beginning, the final XGboost model is used to predict the label column in 'backtesting_set'

Evaluation

  • Using the data 'back_testing_set.csv which has been transformed and selected following what was done to the training set
  • Obtained a predicted label column containing 0 and 1 (0: buyer rated below 5, 1: buyer rated 5), then made it into its column and extracted it to a CSV file

eval

Closing

  • After submission, the instructor compares the obtained label column with the original hidden label column.
  • The performance obtained for the backtesting set of the model I made is: Recall 0.68, Precision 0.68, FPR 0.45

WhatsApp Image 2022-10-12 at 08 56 53

Thank you