Skip to content

cwong690/Shut-the-Fraud-Down

Repository files navigation

Shut the FRAUD Down!

A fraud detection model with an interactive Flask app to stream events and automatically detect fraudulent cases!

badge badge

Ben Weintraub | Cindy Wong | Tyler Woods

Table of Contents

Overview

Premise: A new e-commerce site needs a data scientist to detect fraudsters. A machine learning model needs to be created. However, failures are not equal. False positives will decrease trust with consumers and false negatives will cost money.

The model does not predict a ground truth, but rather flags ones that have high potential. The interactive portion of the web app allows users to see which cases are the top priorities to check and attributes about that case.

Data Preparation

Dataset dataset dataset info

Quick statistics of the columns

dataset stats

EDA

We started off with a heatmap of correlation between the target value (fraud or not fraud) and all the other columns.

corr heatmap

There appears to be columns that have higher correlations to being fraud. The bar plots below will show the number of fraudulent for different values within each category. For example, the higher the delivery method number, the less likely it is a fraudelent case.

Channels vs Fraud Delivery Method vs Fraud Gross Profits vs Fraud
gross profits
FB Published vs Fraud Ticket Length vs Fraud User Type vs Fraud
gross profits
Sale Duration vs Fraud Gmail vs Fraud Previous Payout vs Fraud
gross profits

Models

Using Random Forest, feature importance were revealed:

feature importance


3 separate models were built in order to determine the best one for this problem. Below are the ROC curves for each model.


Logistic Regression log roc
Random Forest Classifier rf kfolds
Gradient Boosting

Combined ROC Curves Comparison

combined roc

Fraudulent Activity Detector

The model and the data are store in MongoDB and connected through PyMongo. Below is an example of a piece of data, stored in dictionary form, inserted into MongoDB:

mongodb

The Fraudulent Activity Detector allows clients to quickly the latest records and the prediction of risk level for each case. There are 4 levels of fraudulent cases: low risk, medium risk, high risk, and unable to predict risk.

The chart below is an easy visualization of how many cases are in each category. It is highly suggested for the cases that are unabled to receive a prediction to be checked by an employee as well as the high risk cases.

Descriptions of the events are recorded as well in order to track down any leads.

combined roc

Future Work

  • KNN
  • Better Model
  • Clean up files