Skip to content

avdhoot0303/Malware-detection-of-PE-files

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Malware-detection-of-PE-files



💻Working Demo of the FastAPI using it's swagger UI.

This project is basically a Malware detection system using Machine Learning and CNN. We also deploy models using fastAPI. So the main steps taken to get the reults are:

  1. Dataset collection
  2. Feature selection
  3. Data preprocessing
  4. Model building
  5. Deploy to fastAPI

🧠 In this we use two different models,
1. RandomForestClassifier : first model is trained on the portable executable files' different sections characteristic which allows us to classify whether a given input file is malicious file or not.
2. CNN model : This model is trained on 9639 malware images of 25 different malware families and using this model we try to classify the detected malware from the first model into 25 different malware families.
So starting with the first model, since we are working with the portable executable files we need to understand the structure of the PE files and which characteristics matter the most. Here is the link to understand the structure, if you want you can check it out [here](https://tech-zealots.com/malware-analysis/pe-portable-executable-structure-malware-analysis-part-2/)

📚 The datasets used:

  1. The dataset used to train the first model is available here https://www.kaggle.com/amauricio/pe-files-malwares
  2. The second dataset i.e Malware image dataset is the already generated dataset from microsoft's 2015 kaggle competition dataset, you can download it from here: https://www.dropbox.com/s/ep8qjakfwh1rzk4/malimg_dataset.zip?dl=0

    So these are the datasets used for building two different models but in the end working as a pipeline. Now these datasets are somewhat old as they were published in 2017 and at the time of making this project they were relevant so you might want to change the data source or you can build a dataset yourself using some of the utility functions from the scripts of this project. Since there wasn't enough time to test this entire project on 1000s of PE files we didnt add the retraining of the data part but you can find that code in the scripts part and execute it when there will be enough new data available.

⚙️ Requirements

First you need to have python 3.6+ to install all the dependencies. Now let's see the requiurements and dependencies you need to install inorder to run this on your end:

  1. We have used python for everything so basic requirement is to have python installed or you can use colab notebooks.
  2. You need to install the pefile module using this
    pip install pefile.
    Now what this pefile module does is that it takes a Portable Executable file as an input and gives an output of the dump which has almost every metadata of the portable executable file.
    To look up more on pefile module and the examples of usage here's a link of the original repository:https://github.com/erocarrera/pefile
  3. For CNN model building you need tensorflow as backend and keras wrapper class, for this we have used the colabcode since it has these libraries pre-installed we just have to import them.
  4. For deploying to fastapi we need fastapi library, so to do that use
    pip install fastapi
  5. You will also need an ASGI server, for production such as Uvicorn or Hypercorn.
    pip install uvicorn.

For more info on fastapi you can see this:https://fastapi.tiangolo.com/

For running this code on colab notebooks you can run the FastAPI for server notebook in the notebooks folder. It has every modules to be installed and imported.