BERT4ETH

This is the repo for the code (TensorFlow version) and datasets used in the paper BERT4ETH: A Pre-trained Transformer for Ethereum Fraud Detection, accepted by the ACM Web conference (WWW) 2023. Here you can find our slides.

If you find this repository useful, please give us a star : ) Thank you!

Update: I've recently added a section (Section 5.5) discussing the multi-hop modeling capability of BERT4ETH to the paper on arXiv. (10/30)

BERT4ETH-PyTorch: Here you can find the PyTorch implementation: https://github.com/Bayi-Hu/BERT4ETH_PyTorch

Some notes:

Note 1: The master branch hosts the basic BERT4ETH. If you wish to run the basic model, there is no need to download the ERC-20 log dataset. Advanced features such as In/out separation and ERC20 log can be found in the old branch but are not recommended due to the inefficiency of computation and memory.

Note 2: Despite BERT4ETH is a sequential model, it is able to capture three-hop relationship from a graph perspective. (For more details please refer to our slides.)

Note 3: The results reported in our paper are the best results among five times experiments (pre-training). The outcomes might slightly vary between different runs of pre-training, steps of checkpoints, and runs of cascaded MLP classifier training. Below are our recent results on the phishing detection task with fixed training:

Getting Start

Requirements

Python >= 3.6
TensorFlow >= 2

I use python 3.9, tensorflow 2.9.2 with CUDA 11.2, numpy 1.19.5.

Preprocess dataset

Step 1: Download dataset from Google Drive.

Step 2: Unzip dataset under the directory of "BERT4ETH/Data/"

cd BERT4ETH/Data; # Labels are already included
unzip ...;

Step 3: Transaction Sequence Generation

cd Model;
python gen_seq.py --bizdate=bert4eth_exp

Pre-training

Step 1: Pre-training Data Generation from Sequence

python gen_pretrain_data.py --bizdate=bert4eth_exp  \ 
                            --max_seq_length=100  \
                            --dupe_factor=10 \
                            --masked_lm_prob=0.8

Step 2: Pre-train BERT4ETH

python run_pretrain.py --bizdate=bert4eth_exp \
                       --max_seq_length=100 \
                       --epoch=5 \
                       --batch_size=256 \
                       --learning_rate=1e-4 \
                       --num_train_steps=1000000 \
                       --save_checkpoints_steps=8000 \
                       --neg_strategy=zip \
                       --neg_sample_num=5000 \ 
                       --neg_share=True \ 
                       --checkpointDir=bert4eth_exp

Parameter	Description
`bizdate`	The signature for this experiment run.
`max_seq_length`	The maximum length of BERT4ETH.
`masked_lm_prob`	The probability of masking an address.
`epochs`	Number of training epochs, default = `5`.
`batch_size`	Batch size, default = `256`.
`learning_rate`	Learning rate for the optimizer (Adam), default = `1e-4`.
`num_train_steps`	The maximum number of training steps, default = `1000000`,
`save_checkpoints_steps`	The parameter controlling the step of saving checkpoints, default = `8000`.
`neg_strategy`	Strategy for negative sampling, default `zip`, options (`uniform`, `zip`, `freq`).
`neg_share`	Whether enable in-batch sharing strategy, default = `True`.
`neg_sample_num`	The negative sampling number for one batch, default = `5000`.
`checkpointDir`	Specify the directory to save the checkpoints.

Step 3: Output Representation

python output_embed.py --bizdate=bert4eth_exp \
                       --init_checkpoint=bert4eth_exp/model_104000 \
                       --max_seq_length=100 \
                       --neg_sample_num=5000 \
                       --neg_strategy=zip \
                       --neg_share=True

I have generated a version of embedding file, you can unzip it under the directory of "Model/inter_data/" and test the results.

Testing on output account representation

Phishing Account Detection

python run_phishing_detection.py --init_checkpoint=bert4eth_exp/model_104000 # Random Forest (RF)

python run_phishing_detection_dnn.py --init_checkpoint=bert4eth_exp/model_104000 # DNN, better than RF

De-anonymization (ENS dataset)

python run_dean_ENS.py --metric=euclidean \
                       --init_checkpoint=bert4eth_exp/model_104000

De-anonymization (Tornado Cash)

python run_dean_Tornado.py --metric=euclidean \
                           --init_checkpoint=bert4eth_exp/model_104000

Fine-tuning for phishing account detection

python gen_finetune_phisher_data.py --bizdate=bert4eth_exp \ 
                                    --max_seq_length=100

python run_finetune_phisher.py --init_checkpoint=bert4eth_exp/model_104000 \
                               --bizdate=bert4eth_exp \ 
                               --max_seq_length=100 \ 
                               --checkpointDir=tmp

Citation

@inproceedings{hu2023bert4eth,
  title={BERT4ETH: A Pre-trained Transformer for Ethereum Fraud Detection},
  author={Hu, Sihao and Zhang, Zhen and Luo, Bingqiao and Lu, Shengliang and He, Bingsheng and Liu, Ling},
  booktitle={Proceedings of the ACM Web Conference 2023},
  pages={2189--2197},
  year={2023}
}

Q&A

If you have any questions, you can either open an issue or contact me (sihaohu@gatech.edu), and I will reply as soon as I see the issue or email.

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
Data		Data
Material		Material
Model		Model
.gitignore		.gitignore
README.md		README.md
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data

Data

Material

Material

Model

Model

.gitignore

.gitignore

README.md

README.md

run.sh

run.sh

Repository files navigation

BERT4ETH

Some notes:

Getting Start

Requirements

Preprocess dataset

Step 1: Download dataset from Google Drive.

Step 2: Unzip dataset under the directory of "BERT4ETH/Data/"

Step 3: Transaction Sequence Generation

Pre-training

Step 1: Pre-training Data Generation from Sequence

Step 2: Pre-train BERT4ETH

Step 3: Output Representation

Testing on output account representation

Phishing Account Detection

De-anonymization (ENS dataset)

De-anonymization (Tornado Cash)

Fine-tuning for phishing account detection

Citation

Q&A

About

Releases

Packages

Languages

git-disl/BERT4ETH

Folders and files

Latest commit

History

Repository files navigation

BERT4ETH

Some notes:

Getting Start

Requirements

Preprocess dataset

Step 1: Download dataset from Google Drive.

Step 2: Unzip dataset under the directory of "BERT4ETH/Data/"

Step 3: Transaction Sequence Generation

Pre-training

Step 1: Pre-training Data Generation from Sequence

Step 2: Pre-train BERT4ETH

Step 3: Output Representation

Testing on output account representation

Phishing Account Detection

De-anonymization (ENS dataset)

De-anonymization (Tornado Cash)

Fine-tuning for phishing account detection

Citation

Q&A

About

Topics

Resources

Stars

Watchers

Forks

Languages