⚙️ MiDas 🔎

by Truong Giang Nguyen, Thanh Le-Cong, Hong Jin Kang, Ratnadira Widyasari, Chengran Yang, Zhipeng Zhao, Bowen Xu, Jiayuan Zhou, Xin Xia, Ahmed E. Hassan, Xuan-Bach D. Le, David Lo

Welcome to the source code repo of MiDas, a vulnerability-fixing commit detection tool introduced in our paper "Multi-granularity Detector for Vulnerability Fixes"!

Please cite the following article if you find Midas to be useful:

Truong Giang Nguyen, Thanh Le-Cong, Hong Jin Kang, Ratnadira Widyasari, Chengran Yang, Zhipeng Zhao, Bowen Xu, Jiayuan Zhou, Xin Xia, Ahmed E. Hassan, Xuan-Bach D. Le, and David Lo

@article{nguyen2023midas,
  author={Nguyen, Truong Giang and Le-Cong, Thanh and Kang, Hong Jin and Widyasari, Ratnadira and Yang, Chengran and Zhao, Zhipeng and Xu, Bowen and Zhou, Jiayuan and Xia, Xin and Hassan, Ahmed E. and Le, Xuan-Bach D. and Lo, David},
  journal={IEEE Transactions on Software Engineering}, 
  title={Multi-granularity Detector for Vulnerability Fixes}, 
  year={2023},
  volume={},
  number={},
}

📃 Introduction

MiDas is a transformer-based novel techinique for detecting vulnerability-fixing commits. MiDas extract information of commit in respect to multiple levels of granularity (i.e. commit level, file level, hunk level, line level)

MiDas consists of seven feature extractors, regard the combination of granularity and CodeBERT representation:

Feature extractor index	Granularity	CodeBERT representation
1	Commit	Context-dependant
2	File	Context-dependant
3	Hunk	Context-dependant
5	Commit	Context-free
6	File	Context-free
7	Hunk	Context-free
8	Line	Context-free

To replicate the training process of MiDas, please follow the below steps:

    1. Finetune CodeBERT for each feature extractor
    2. Save commit embedding vectors represented by CodeBERT
    3. Train feature extractors
    4. Infer feature extractors to extract commit's features
    5. Train neural classifier
    6. Apply adjustment function 
    7. Evaluate MiDas

🔧 Prerequisites

Make sure you create a directory to store embedding vectors, a folder "model" to store saved model, and a "features" folder to store extractor features following this hierarchy:

    MiDas
        model
        features
        ...
    finetuned_embeddings
        variant_1
        variant_2
        variant_3
        variant_5
        variant_6
        variant_7
        variant_8

Note: If you run MiDas on a Docker container, please run docker with parameters: "LANG=C.UTF-8 -e LC_ALL=C.UTF-8" to avoid error when writing to file, "--shm-size 16G" to avoid memory problem, "--gpus all" in case you use multiple GPUs

📁 Dataset

The dataset is available at: https://zenodo.org/record/5565182#.Yv3lHuxBxO8 Please download and put dataset inside the MiDas folder

🚀 Replication

Note: The current code base requires two GPUs to run. We will try to make it more flexible.

Finetune CodeBERT

Corresponding to seven feature extractors, we have seven python scripts to finetune them.

Feature extractor index	Finetuning script
1	python variant_1_finetune.py
2	python variant_2_finetune.py
3	python variant_3_finetune_separate.py
5	python variant_5_finetune.py
6	python variant_6_finetune.py
7	python variant_7_finetune_separate.py
8	python variant_8_finetune_separate.py

Saving embedding vectors

After finetuning, run the following scripts to save embedding vectors corresponding to each feature extractor:

Feature extractor index	Saving embeddings script
1	python preprocess_finetuned_variant_1.py
2	python preprocess_finetuned_variant_2.py
3	python preprocess_finetuned_variant_3.py
5	python preprocess_finetuned_variant_5.py
6	python preprocess_finetuned_variant_6.py
7	python preprocess_finetuned_variant_7.py
8	python preprocess_finetuned_variant_8.py

Saving embedding vectors

Next, we need to train seven feature extractors

Feature extractor index	Extractor training script
1	python variant_1.py
2	python variant_2.py
3	python variant_3.py
5	python variant_5.py
6	python variant_6.py
7	python variant_7.py
8	python variant_8.py

Infer feature extractors and train neural classifier

Simply use the following two commands:

python3 feature_extractor_infer.py

python3 ensemble_classifier.py --model_path model/patch_ensemble_model.sav --java_result_path probs/prob_ensemble_classifier_test_java.txt --python_result_path probs/prob_ensemble_classifier_test_python.txt

Apply adjustment function

Simply run:

python adjustment_runner.py

Evaluate MiDas

The script for evaluation is placed in evaluator.py

Run evaluator.py with parameter "--rq <rq_number>" to evaluate MiDas with the corresponding research questions:

RQ1: Performance of MiDas on Java and Python project

python evaluator.py --rq 1

RQ2: Performance of MiDas with/without adjustment function

python evaluator.py --rq 2

RQ3: Performance of MiDas when continuously adding granularities

To obtain performance of MiDas using only line level, run:

python evaluator.py --rq 3 --mode 1

To obtain performance of MiDas using only line level + hunk level, run two commands:

python ensemble_classifier.py --ablation_study True -v 1 -v 2 -v 5 -v6 --model_path model/test_ablation_line_hunk_model.sav --java_result_path probs/test_ablation_line_hunk_java.txt --python_result_path probs/test_ablation_line_hunk_python.txt

python evaluator.py --rq 3 --mode 2

To obtain performance of MiDas using only line level + hunk level + file level, run two commands:

python ensemble_classifier.py --ablation_study True -v 1 -v 5 --model_path model/test_ablation_line_hunk_file_model.sav --java_result_path probs/test_ablation_line_hunk_file_java.txt --python_result_path probs/test_ablation_line_hunk_file_python.txt

python evaluator.py --rq 3 --mode 3

The performance of MiDas using all granularities is obtained in RQ1 (we hope you run it successfully).

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
probs		probs
.gitignore		.gitignore
README.md		README.md
adjustment_runner.py		adjustment_runner.py
ase_surity_sub_dataset.csv		ase_surity_sub_dataset.csv
auc_curve_java.png		auc_curve_java.png
auc_curve_python.png		auc_curve_python.png
big_vf.csv		big_vf.csv
commit_list.csv		commit_list.csv
ensemble_classifier.py		ensemble_classifier.py
ensemble_pca.py		ensemble_pca.py
entities.py		entities.py
evaluator.py		evaluator.py
feature_extractor_infer.py		feature_extractor_infer.py
huawei_dataset_url_to_hunk_count.csv		huawei_dataset_url_to_hunk_count.csv
huawei_dataset_url_to_token_count.csv		huawei_dataset_url_to_token_count.csv
huawei_measure_dataset.py		huawei_measure_dataset.py
hunk_count.png		hunk_count.png
hunk_model.py		hunk_model.py
hunk_preprocessing.py		hunk_preprocessing.py
hunk_train.py		hunk_train.py
la_classifier.py		la_classifier.py
line_model.py		line_model.py
line_preprocessing.py		line_preprocessing.py
line_train.py		line_train.py
local_notes.md		local_notes.md
metrices_calculator.py		metrices_calculator.py
missing_file_indices.txt		missing_file_indices.txt
model.py		model.py
neighbour_ensemble.py		neighbour_ensemble.py
neighbour_ensemble_classifier.py		neighbour_ensemble_classifier.py
overview.md		overview.md
pca_plot.py		pca_plot.py
preprocess_finetuned_variant_1.py		preprocess_finetuned_variant_1.py
preprocess_finetuned_variant_2.py		preprocess_finetuned_variant_2.py
preprocess_finetuned_variant_2_cnn.py		preprocess_finetuned_variant_2_cnn.py
preprocess_finetuned_variant_3.py		preprocess_finetuned_variant_3.py
preprocess_finetuned_variant_5.py		preprocess_finetuned_variant_5.py
preprocess_finetuned_variant_6.py		preprocess_finetuned_variant_6.py
preprocess_finetuned_variant_6_cnn.py		preprocess_finetuned_variant_6_cnn.py
preprocess_finetuned_variant_7.py		preprocess_finetuned_variant_7.py
preprocess_finetuned_variant_8.py		preprocess_finetuned_variant_8.py
preprocess_variant_1.py		preprocess_variant_1.py
preprocess_variant_2.py		preprocess_variant_2.py
preprocess_variant_3.py		preprocess_variant_3.py
preprocess_variant_5.py		preprocess_variant_5.py
preprocess_variant_8.py		preprocess_variant_8.py
preprocessing.py		preprocessing.py
pytorchtools.py		pytorchtools.py
train.py		train.py
train_fcn.py		train_fcn.py
tse_experiments.py		tse_experiments.py
tsne_visualizer.py		tsne_visualizer.py
url_to_file_count.txt		url_to_file_count.txt
url_to_loc.txt		url_to_loc.txt
utils.py		utils.py
variant_1.py		variant_1.py
variant_1_finetune.py		variant_1_finetune.py
variant_2.py		variant_2.py
variant_2_cnn.py		variant_2_cnn.py
variant_2_cnn_finetune_separate.py		variant_2_cnn_finetune_separate.py
variant_2_finetune.py		variant_2_finetune.py
variant_3.py		variant_3.py
variant_3_fcn.py		variant_3_fcn.py
variant_3_finetune.py		variant_3_finetune.py
variant_3_finetune_separate.py		variant_3_finetune_separate.py
variant_5.py		variant_5.py
variant_5_finetune.py		variant_5_finetune.py
variant_6.py		variant_6.py
variant_6_cnn.py		variant_6_cnn.py
variant_6_cnn_finetune_separate.py		variant_6_cnn_finetune_separate.py
variant_6_finetune.py		variant_6_finetune.py
variant_7.py		variant_7.py
variant_7_fcn.py		variant_7_fcn.py
variant_7_finetune_separate.py		variant_7_finetune_separate.py
variant_8.py		variant_8.py
variant_8_finetune_separate.py		variant_8_finetune_separate.py
variant_8_gru.py		variant_8_gru.py
variant_8_lstm.py		variant_8_lstm.py
variant_ensemble.py		variant_ensemble.py
visualize.py		visualize.py
whole_commit_train.py		whole_commit_train.py

soarsmu/midas

Folders and files

Latest commit

History