Skip to content
Steve Martinelli edited this page Jul 24, 2018 · 4 revisions

Short Name

Fraud prediction using skewed data

Short Description

This pattern takes you through end to end process of building & evaluating different predictive models and the impact of sampling techniques on the accuracy of the predictive models.

Offering Type

Data Science

Introduction

Predictive analytics uses historical data to predict future events. Typically, historical data is used to build a mathematical model that captures important trends. That predictive model is then used on current data to predict what will happen next, or to suggest actions to take for optimal outcomes. We use the same approach to draw a solution to the credit card fraud detection problem Fraudulent transactions are costly, but it is too expensive and inefficient to investigate every transaction for fraud. Even if possible, investigating innocent customers could prove to be a very poor customer experience, leading some clients to leave the business. Hence, using a predicative model we can automatically identify and prioritize likely fraudulent activity. Fraud units can then investigate only those incidents likely to require it. As compared to the other solutions present, this is an efficient, and an accurate solution devoid of human error. We aim to minimize instances where it is predicted as fraud but it is not actually fraud(False Positives) and those where it is fraud but is not predicted as one(False Negatives).

We will highlight the methodology to handle skewed data using different sampling techniques and generate accurate predictions using different statistical algorithms.

Authors

By Sharath Kumar RK, Manjula G Hosurmath, and Vishal Chahal

Code

https://github.com/IBM/xgboost-smote-detect-fraud

Demo

N/A

Video

https://youtu.be/LZYnfrnkmwk

Overview

Credit-card fraud is a growing problem worldwide which costs upwards of billions of dollars per year. It is a wide-ranging term for theft and fraud committed using or involving a payment card, such as a credit card or debit card, as a fraudulent source of funds in a transaction. The purpose may be to obtain goods without paying, or to obtain unauthorized funds from an account. According to 2016 data released by ACI Worldwide and financial industry consultant Aite Group, nearly 1 in 3 consumers globally have been victimized by card fraud in the past five years. The benchmark survey also reported that 14 of the 17 countries surveyed experienced an increase in card fraud between 2014 and 2016. A 2016 iovation/Aite Group study projected impact on financial fraud reports that credit card fraud losses may climb to as much as $10 billion in the United States alone by 2020. Therefore, it becomes the need of the hour to use technology and reduce these alarming numbers.

Predictive analytics uses historical data to predict future events. Typically, historical data is used to build a mathematical model that captures important trends. That predictive model is then used on current data to predict what will happen next, or to suggest actions to take for optimal outcomes. We use the same approach to draw a solution to the credit card fraud detection problem Fraudulent transactions are costly, but it is too expensive and inefficient to investigate every transaction for fraud. Even if possible, investigating innocent customers could prove to be a very poor customer experience, leading some clients to leave the business. Hence, using a predicative model we can automatically identify and prioritize likely fraudulent activity. Fraud units can then investigate only those incidents likely to require it. As compared to the other solutions present, this is an efficient, and an accurate solution devoid of human error. We aim to minimize instances where it is predicted as fraud but it is not actually fraud(False Positives) and those where it is fraud but is not predicted as one(False Negatives).

When the reader has completed this code pattern, they will understand how to:

  • Build predictive models using Bagging & Boosting statistical techniques.
  • Run different statistical models and evaluate the results.
  • Sample the data to create a balance between the majority & minority populations to handle skewed data.
  • Demonstrate how the sampling techniques can give a lift to the accuracy of the predictive model.

Flow

  1. User logs into Watson Studio, creates an instance which includes object storage.
  2. User uploads the csv file to the object storage.
  3. User imports a Jupyter Notebook from the URL.
  4. User runs the statistical models and sampling techniques in the notebook.
  5. User exports the predictive modelling results to the object storage.

Included components

  • IBM Watson Studio: Analyze data using RStudio, Jupyter, and Python in a configured, collaborative environment that includes IBM value-adds, such as managed Spark.

  • IBM Cloud Object Storage: An IBM Cloud service that provides an unstructured cloud data store to build and deliver cost effective apps and services with high reliability and fast speed to market. This code pattern uses Object Storage (Swift API).

  • Jupyter Notebooks: An open-source web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text.

Featured technologies

  • Data Science: Systems and scientific methods to analyze structured and unstructured data in order to extract knowledge and insights.
  • Analytics: Analytics delivers the value of data for the enterprise.
  • Python: Python is a programming language that lets you work more quickly and integrate your systems more effectively.
  • Jupyter Notebooks: An open-source web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text.

Blog

Risk mitigation is one of the key factors in financial domain and one of the highlights of risks would be fraudulent transactions which can lead to breach of data and systems. The number of fraudulent transactions are fewer when compared with legitimate transactions which makes the data highly skewed. Can we predict frauds in the biased data with good accuracy? Do we need to use different sampling techniques and can they give a lift in the accuracy? The answer is Yes to both questions and we demonstrate the methodology to handle skewed data using different sampling techniques and generate accurate predictions with different statistical algorithms.

Our aim is to enable developers with different techniques like Bagging & Boosting to draw a balance between accuracy and computation power and use under sampling & over sampling (SMOTE) techniques.

View the entire Fraud prediction using imbalanced data pattern, including demos, code and more.

Links