This repository consists of a solution that includes the analysis of financial data and detection of Fraud detection using Graph Machine Learning. The solution uses a Relational Graph Convolutional Network that generates unique graph features from neighbourhood information to aid in better and effective detection of fraud.
The setps are as follows:
- Pre-processing
- Making edgelists using the user identity columns
- Generating a multi dimensional heterogenous graph using the data along with these edgelists
- Using the Deep Learning models to generate predictions
This repo is refactored from the model used in awslabs/sagemaker-graph-fraud-detection, and implemented based on Deep Graph Library (DGL) and PyTorch. Unlike Amazon's implementation, this repo does not require the use of Sagemaker for training. We can run it directly with the free Google Colab or with our own local devices.
In addition to the Fraud transaction detection problem addressed, the repo can also be used in other heterogeneous graph based scenarios, such as game account theft, online shopping fraudulent orders, etc.
Many online businesses lose billions of dollars to fraud each year, but machine learning-based fraud detection models can help businesses predict which interactions or users are likely to be fraudulent in order to reduce losses.
This repo formulates the problem of fraud detection as a classification task for heterogeneous interaction networks. The machine learning model used is a graphical neural network (GNN) that learns potential representations of users or transactions, which can then be easily classified as Fraud or not.
This repo constructs a heterogeneous graph of the transaction data provided in the IEEE-CIS Fraud Detection data. The following are defined as features of nodes and edges, respectively:
- NODE: Number of card associated, Number of address associated, Days between transactions, Match situation(name, card, address, email, etc.), Vesta engineered rich features, etc.
- EDGE: Purchaser and recipient email domain, Product, Card information, Address, Device information, Network connection information (IP, ISP, Proxy, etc), Digital signature (UA/browser/os/version, etc)
If you want to run the code locally rather than on Colab, please skip the first 2 cell in each notebook.
The dataset used was the IEEE-CIS Fraud Detection Dataset provided by Vesta on Kaggle. First, we need to download the dataset from Kaggle. This link provided some additional information about the dataset.
Then put all of the CSV files into the ./ieee-data/
folder.
Before feeding the data to the model, we need to perform data pre-processing. Open 10_data_loader.ipynb and follow the introduction inside. The compiled data will be saved into the ./data/
folder.
Open 20_modeling.ipynb and follow the introduction inside. CPU training is recommended. Using GPUs may require additional environmental issues to be addressed.
The trained models and related files will be save into the ./model/
folder. You can also visualize the training process by 30_visual.ipynb. The related graphs and training record has been saved in ./output/
folder.
The architecture of the solution is as follows:
To run the code, simply run the Jupyter notebooks in this order:
- DataPrep
- Modelling
- Visualization
The constructed heterogeneous graph contains a total of 726,345 Nodes and 19,518,802 Edges.
Two RGCN (Relational Graph Convolutional Networks) were developed and tested
- Shallow RGCN
- Deep RGCN
Despite the heavy class imbalance, the Deep RGCN produced great results and outperformed the shallow RGCN. The evaluation metric scores were as follows:
- F1: 0.6228 (Shallow network 0.48)
- Accuracy: 98% (Shallow network 97.48%)
- Precision: 0.8872 (Shallow network 0.8240)
- Recall: 0.4798 (Shallow network 0.3410)
Considering that the data is very imbalanced, we need to make a trade-off between Recall and Precision. Considering that misclassifying non-fraud transactions as fraud will seriously affect the user experience, Precision is the priority. After training, the Precision is 0.86 and the ROC is 0.92.
Confusion Matrix:
type | Labels Positive | Labels Negative |
---|---|---|
Predicted Positive | 1435 | 240 |
Predicted Negative | 2629 | 113804 |
Detecting fraud in heterogeneous networks using Amazon SageMaker and Deep Graph Library