Skip to content

Using machine learning models to predict the probability of a windows system getting infected by various families of malware, based on different properties of that system.

Notifications You must be signed in to change notification settings

RachanaJayaram/MalwarePrediction

Repository files navigation

MalwarePrediction

This project takes a look at different machine learning techniques that can be used to predict a system’s probability of getting hit by various families of malware, based on different properties of that system. Given a dataset of these properties and the machine infections, the proposed solution is to use a gradient boosting framework, namely LightGBM, to build a model that predicts whether a system will soon be hit with malware


Main Notebook containing everything from data cleaning, feature engineering, LGBM implementation to kaggle submission is in MalwareDetection_ExploratoryTerritory.ipynb along with detailed instructions.
It is recommended that the notebook is run in google colab with TPU as the hardware accelerator.
Link to notebook : https://colab.research.google.com/drive/1KkgpJfH5LvAtgoi2_H0Pr7Kjr5PaKab5
Write up of the project is in project_writeup.pdf

Data sets

Data source for train and test data: Kaggle link for Microsoft Malware Prediction

  • Each row in this dataset corresponds to a system, uniquely identified by a MachineIdentifier.
  • HasDetections is the target and indicates whether Malware was detected on the system.
  • HasDetections is missing in the test dataset and must be predicted using the train dataset.

Data source for Antivirus Signature vs Timestamp : Kaggle link
This dataset consists of mappings from antivirus signature versions to timestamps. Antivirus signature version ('AvSigVersion') is updated approximately every 2 hours. 95% of user antiviruses regularly update their antivirus signature version making them a trustworthy timestamp for each dataset observation. This means that the antivirus signature version of a system when it was sampled can be mapped to the time at which the system was sampled. The timestamps from this dataset are provided by Microsoft. Microsoft has derived these timestamps in the manner explained above, by approximating sampling time from 'AvSigVersion' of the observation.

Pickled Objects : Kaggle link
train_df.pkl : Pickle of training data after preprocessing.
test_df.pkl : Pickle of testing data after preprocessing.
LGBMModel.pkl : Pickle of LGBM model after training.


Models Tried

  1. LSTM : LSTM was tried. It is available in LSTM.py. The AUC was 0.55 approx
  2. LSTM-CNN : Available in LSTM_CNN.py. results similar to LSTM were obtained
  3. LightGBM : Available in MalwareDetection_ExploratoryTerritory.ipynb. AUC of 0.67 obtained.

Observation :

NeuralNets are not appropriate


Time Series
This kind of malware risk detection is in essence a time series problem, with the sampling date of each data point greatly influencing the some of the system’s properties. The given dataset is also split into test and train in such a way that a majority of entries in the train data are from August and September 2018 while the training data is mostly from October and November 2018.(As seen in LGBM_EDA.ipynb)
But the problems posed to the traditional time series approach by this dataset are the following:
• New systems are added to the dataset with time.
• There are systems that occasionally go offline for variable durations of time. No data from these systems are recorded in this period.
• Systems receive OS patches, bug fixes and OS upgrades over time thereby changing their properties.
This analysis is intuitive as newer versions of operating systems and antivirus software to combat ever-improving malware.

Training data: timeseries1

Testing data: timeseries2


Light Gradient Boosting Machine

Given the shortcomings of having a plain time series perspective of the problem, it is best to have a final model that is not strictly a time-series approach to malware prediction, but can accommodate features that are indicative of time. Based on this there were two approaches we pursued: LSTM or Gradient boosting Decision Trees. To capture the time series aspect of the problem we engineer new features by making use of the Antivirus Signature vs Timestamp dataset.

Implementation of LGBM model along with detailed comments is in MalwareDetection_ExploratoryTerritory.ipynb

The LGBM model gives an AUC of 0.67 which is significantly better than LSTM's AUC of 0.50.

About

Using machine learning models to predict the probability of a windows system getting infected by various families of malware, based on different properties of that system.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published