Skip to content

bliutech/nlp-pdf-malware-detection

Repository files navigation

NLP-based Malware Detection of PDFs

Screen Shot 2022-12-03 at 12 00 48 AM

The threat presented by malware hidden in Portable Document Formats (PDFs) is a serious issue to the average Internet user, with the ability of a PDF to execute purposefully-embedded JavaScript serving as a method of obscuring malicious scripts and data. While there are several existing Machine Learning-based models designed for PDF malware detection, the usage of transformers to statically analyze PDFs for malware has not yet been explored. Due to their attention mechanisms and ability to process data in parallel, transformers hold great potential for analyzing large quantities of data in detail without being excessively computationally demanding. By preprocessing PDFs as byte strings, generating meaningful word embeddings using one-hot encoding and variable n-grams, and feeding these results to a fine-tuned transformer model, we have produced a model that classifies a testing set of PDFs as malicious or benign with 96.67% accuracy. After evaluating the performance of our model, we can note that this is a feasible method of performing robust static analysis on PDF files. However, it is important to continue refining the current model and exploring additional methods of improving the accuracy and precision of the model on a varied dataset.

The following repository contains the scripts, models, and data related to this research project. The data used for this project was from CIC-Evasive-PDFMal2022 which can be requested here.

Repository Structure

  • csv_generator.py: generates the relevant CSV file for training/validation data
  • demo.py: demonstrates model inferencing on a sample PDF
  • preprocessing.py: converts PDF to variable n-gram byte string
  • split.py: creates training/validation data split
  • train.py: runs model training on preprocessed data
  • val.py: runs model validation on inferences

Installation

In order to run the scripts within this repository, first set up a virtual environment using the following command.

python3 -m virtualenv venv

In order to activate the virtual environment, run the following command.

source venv/bin/activate

Once the virtualn environment is activated, you can install all of the necessary dependencies using the following command.

pip install -r requirements.txt

Additionally, two directories by the names of data and results should be placed at the root of the repository (these are included in the gitignore).

Demo

In order to run the demo, ensure that data/dummy.pdf (replace with whichever PDF you want to perform inferencing on) and results/model_weights.pth are placed correctly. To access a sample dummy.pdf visit here and to access the model, visit here. You can then run the following command to perform the demo.

python3 demo.py

Preprocessing

In order to perform preprocessing of the CIC-Evasive-PDFMal2022 dataset, there are a few stages. First, run split.py on the relevant zip files to generate the training-validation split required (90-10 is a recommended ratio). Use the following command.

python3 split.py -t 90

Next, from the produced dataset, run csv_generator.py on both the training and validation datasets inorder to generate CSV files for them. Use the following command.

python3 csv_generator.py

This will output a training.csv and a testing.csv which you can place in the data directory.

Training

In order to run the training script, make sure that data/training.csv is created (you can access a copy of the training data here). You can then begin training using the following command.

python3 train.py

Validation

In order to run the validation script, make sure that both data/testing.csv is created and results/model_weights.pth are placed correctly (you can access a copy of the validation data here and a copy of the model here). You can then begin validation using the following command.

python3 val.py

Authors

"NLP-based Malware Detection of PDFs" was developed by Benson Liu, Caolinn Hukill, Juliet Zhang, & Salma Alandary for ECE 188: Computer Security taught at UCLA in Fall 2022. For any questions or additional infromation about this project, please contact the authors.

About

ECE 188: Computer Security. Repository for "NLP-based Malware Detection on PDFs". Utilizing NLP techniques & transformer models to perform malware detection in PDFs.

Topics

Resources

Stars

Watchers

Forks

Languages