Skip to content

[網路安全的資料科學 108-2@NCCU] 惡意程式偵測 - 使用靜態分析與模型集成

Notifications You must be signed in to change notification settings

yujunkuo/DS4CS-Final

Repository files navigation

Data Science for Cybersecurity - Final Project

Topic

Malware Detection with Static Analysis & Model Ensemble

Introduction

When it comes to Malware Detection, there are many different ways to implement it, and Static Analysis is one of them.

Static Analysis is a technique which can help us to classify program into Malicious or Benign through some PE related data.

  • PE Section Headers
  • PE Imports
  • PE as Image

We can use different models and different PE partial data to make predictions.

Therefore, in this project, I want to know if we can use Model Ensemble to get a better Accuracy on Malware Detection with Static Analysis ?

In other word, I want to know will Model Ensemble perform better than Individual Model ?

Literature Review

1. Ensembling ConvNets using Keras

Reference Link

2. Keras: Multiple Inputs and Mixed Data

Reference Link

Dataset

[1] Angelo Oliveira, "Malware Analysis Datasets: PE Section Headers", IEEE Dataport, 2019. [Online]. Available: http://dx.doi.org/10.21227/2czh-es14. Accessed: Jun. 13, 2020.

[2] Angelo Oliveira, "Malware Analysis Datasets: Top-1000 PE Imports", IEEE Dataport, 2019. [Online]. Available: http://dx.doi.org/10.21227/004e-v304. Accessed: Jun. 13, 2020.

[3] Angelo Oliveira, "Malware Analysis Datasets: Raw PE as Image", IEEE Dataport, 2019. [Online]. Available: http://dx.doi.org/10.21227/8brp-j220. Accessed: Jun. 13, 2020.

Data Preprocessing

  1. Merge DataFrame by Hash Value & drop duplicated observations

  2. Add new column, which is calculated from original columns

  3. Do some EDA (Exploratory Data Analysis)

  4. Because data is imbalanced, we need resample. I use ADASYN to do oversampling.

Individual Models Training Result

  • First Model: PE Section Headers

    • Standardization
    • Build DNN model
      • Dense(32) + Dense(32) + Dense(64) + Dropout(0.2) + Dense(1)
      • Use Adam optimizer with learning rate = 0.0003 and Early stopping
    • Result : Training Accuracy: 89.71% , Validation Accuracy: 58.46%
  • Second Model: Top-1000 PE Imports (with PCA)

    • Build DNN model
      • Dense(64) + Dense(64) + Dropout(0.4) + Dense(32) + Dense(32) + Dropout(0.2) + Dense(1)
      • Use Adam optimizer with learning rate = 0.0001 and Early stopping
    • Result : Training Accuracy: 97.97% , Validation Accuracy: 94.26%
  • Third Model: Raw PE as Image

    • Min-Max Normalization (From [0, 255] to [0, 1])
    • Reshape to (32, 32, 1)
    • Build CNN model
      • Input + Conv2D(32, 44) + Conv2D(64, 44) + MaxPooling2D(22) + Conv2D(128, 44) + Conv2D(128, 44) + MaxPooling2D(22) + Flatten + Dense(256) + Dropout(0.4) + Dense(1)
      • Use Adam optimizer with learning rate = 0.000003 and Early stopping
    • Result : Training Accuracy: 95.11% , Validation Accuracy: 85.1%

Model Ensemble

First, ensemble these three models, then add Dense Layer (Fully Connected Layer) with 16 neurons & Dense Layer (Fully Connected Layer) with 1 neurons as final output.

The result of Ensemble Model:

  • Use Adam optimizer with learning rate = 0.0003 and Early stopping
  • Result : Training Accuracy: 98.58% , Validation Accuracy: 95.99%
Model Training Accuracy Validation Accuracy
PE Section Headers with DNN 89.71% 58.46%
Top-1000 PE Imports with DNN 97.97% 94.26%
Raw PE as Image with CNN 95.11% 85.1%
Ensemble Model 98.58% 95.99%

Conclusion

From the model training results, it can be seen that Model Ensemble is indeed helpful for improving the Accuracy of Malware Detection.

Compared with individual models, Model Ensemble has the Highest Accuracy in both Training Data and Validation Data.

About

[網路安全的資料科學 108-2@NCCU] 惡意程式偵測 - 使用靜態分析與模型集成

Topics

Resources

Stars

Watchers

Forks