Ethereum-Fraud-Detection

Detecting Fraudulent Blockchain Accounts on Ethereum with Supervised Machine Learning

Introduction

Since 2021, more than 46,000 people lost over $1 billion to cryptocurrency scams, nearly 60 times more compared to 2018.1 The Federal Trade Commission (FTC) found that the top cryptocurrencies used to pay scammers were Bitcoin (70%), Tether (10%) and Ethereum (9%).1 Especially, with the most recent incident with FTX, a crypto exchange which misused more than $1 billion of client’s funds, it becomes ever more important to stay vigilant when navigating through the cryptocurrency world.2 To enforce deterrence against fraudulent scams, we used supervised machine learning techniques such as Logistic Regression, Naive Bayes, SVM, XGboost, LightGBM, MLP, Tabnet and Stacking to detect and predict fraudulent Ethereum accounts. This would add business value by enhancing fraudulent account detection features on crypto exchanges and crypto wallets, enabling people to navigate confidently through the cryptocurrency world and safeguard their personal assets. We set an objective to achieve more than 90% F1 score for machine learning models in predicting fraudulent accounts on the Ethereum blockchain.

Data

There are 2 data sources : Kaggle and Etherscan

`Kaggle`

The Kaggle dataset is downloaded from https://www.kaggle.com/datasets/vagifa/ethereum-frauddetection-dataset and can be found in ./Data/address_data_k.csv

`Etherscan`

Data are mined from etherscan from https://etherscan.io/accounts/label/phish-hack (Currently data has been taken off Etherscan, but we have saved our data) and can be found in ./Data/address_data_e.csv

`Combined without Time Series`

Data from Kaggle and Etherscan are combined and can be found in ./Data/address_data_combined.csv

`Time-Series`

One key aspect of the dataset that we realised was missing was the time series element. Although each observation in our data was a user account, this data was generated by aggregating individual transactions. By doing so, valuable information could have been “flattened out”. The flow of Ethereum transactions are intrinsically time series data that could be used in our model, such as seasonality of transactions. These information was extracted using the 'tsfresh' library and can be found in ./Data/Transaction_data and the new features extracted can be found in ./Data/new_ts_features_only.csv.

`Combined with Time Series`

Data from Kaggle and Etherscan including time series can be found in ./Data/address_data_combined_ts.csv

Data Description

We started with a Kaggle dataset of 9841 observations. Each observation is a unique Ethereum account, with each variable being an aggregate statistic over all transactions performed by that unique account, such as total Ether value received or average time between transactions. The data also distinguishes between account-to-account transactions and account-to-smart contract transactions. However, the dataset was highly imbalanced, with only 2179 out of 9841 (22.14%) being marked as fraud. To address the imbalance, we leveraged an API provided by Etherscan, a “Block Explorer and Analytics Platform for Ethereum”. This allowed us to retrieve transactions made by any given account address on the Ethereum blockchain. As a result, the number of fraudulent accounts in our dataset climbed to 4339 observations, making the combined dataset less imbalanced (45.97% fraud).

Machine Learning Models

Random Foest

./Models/Random_Forest_Model.ipynb

Logistic Regression

/Models/LightGBM_Model.ipynb

Naive Bayes

/Models/Naive_Bayes_Model.ipynb

Support Vector Machine (SVM)

/Models/Support_Vector_Machine_Model.ipynb

Multi-Layer Perceptron (MLP)

/Models/Multi_Layer_Perceptron_Model.ipynb

eXtreme Gradient Boosting (XGBoost)

/Models/XGBoost_Model.ipynb

TabNet

/Models/TabNet_Model.ipynb

LightGBM

/Models/LightGBM_Model.ipynb

Stacked Ensemble Model without Time Series

/Models/Final_Stacking_Model.ipynb

Stacked Ensemble Model with Time Series

/Models/Final_Stacking_Model_w_ts.ipynb

Model Performance without Time Series

Model	F1	Recall	Precision	Accuracy	Time taken	ROC-AUC	Optimal Parameters
Logistic Regression	0.8360	0.8420	0.8301	0.8479	84.85	0.8475	‘C’:1000, ‘penalty’:’l1’, ‘solver’:’liblinear’
Naive Bayes	0.7797	0.8241	0.7398	0.7855	1.54	0.7883	‘Var_smoothing’: 0.0533669923120631
SVM	0.9243	0.9149	0.9340	0.9427	3.28	0.9374	'C':1000,'gamma':1
XGBoost	0.9358	0.9177	0.9546	0.9519	2.79	0.9453	'learning_rate':0.05,'max_depth':8,'n_estimators':1000
MLP	0.8505	0.8346	0.8670	0.8879	1.91	0.8777	'input_dim':12,'H':60,'activation':'relu','dropout_probability':0.2,'num_epochs':75, 'num_layers':10
TabNet	0.9147	0.8903	0.9405	0.9366	56.93	0.9277	'gamma':1.0,'lambda_sparse':0, 'momentum':0.4, 'n_steps':8, 'optimizer_params':'lr': 0.025
LightGBM	0.9376	0.9198	0.9561	0.9532	0.06	0.9468	'bagging_fraction':0.95, 'bagging_freq':1, 'feature_fraction':0.95, 'learning_rate':0.2, 'max_bin':300, 'max_depth':6, 'min_gain_to_split':0, 'num_leaves':20
Stacking	0.9371	0.9226	0.9521	0.9527	198.35	0.9469	SVM, XGBoost, MLP, Tabnet and LightGBM
Stacking (excluding MLP)	0.9379	0.9240	0.9521	0.9532	185.72	0.9476	SVM, XGBoost, Tabnet and LightGBM

Model Performance with Time Series

Model	F1	Recall	Precision	Accuracy	Time taken	ROC-AUC	Optimal Parameters
SVM	0.9208	0.9220	0.9196	0.9284	5.43	0.9278	'C':1000,'gamma':1
XGBoost	0.9323	0.9310	0.9335	0.9389	7.74	0.9383	'learning_rate':0.05, 'max_depth':8,'n_estimators':1000
MLP	0.8364	0.8079	0.8668	0.8573	2.43	0.8529	'input_dim':12,'H':60,'activation':'relu' ,'dropout_probability':0.2,'num_epochs':75, 'num_layers':10
TabNet	0.8968	0.8578	0.9396	0.9109	89.65	0.9062	'gamma':1.0,'lambda_sparse':0, 'momentum':0.4, 'n_steps':8, 'optimizer_params':'lr': 0.025
LightGBM	0.9314	0.9326	0.9302	0.9379	0.012	0.9375	'bagging_fraction':0.95, 'bagging_freq':1, 'feature_fraction':0.95, 'learning_rate':0.2, 'max_bin':300, 'max_depth':6, 'min_gain_to_split':0, 'num_leaves':20
Stacking	0.9323	0.9347	0.9298	0.9387	311.96	0.9383	SVM, XGBoost, MLP, Tabnet and LightGBM

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
.idea		.idea
Data-Collection		Data-Collection
Data		Data
EDA		EDA
Models		Models
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

License

eltontay/Ethereum-Fraud-Detection

Folders and files

Latest commit

History

Repository files navigation

Ethereum-Fraud-Detection

Introduction

Data

Kaggle

Etherscan

Combined without Time Series

Time-Series

Combined with Time Series