Skip to content

Code for work done in the paper "Interpreting Machine Learning MalwareDetectors Which Leverage N-gram Analysis" published in the proceedings of the FPS 2019 conference in Toulouse

WilliamBriguglio/Malware-Classifier-Interpreter

Repository files navigation

README

  1. preproc2.py: This program creates a csv file for each batch of files in the training set. The batch size is specified by a global variable "batchSize". Each of the resutling csv files has two columns. The first is a sorted list of the ngrams which appear in the batch, and the second is the number of files each ngram appeared in.

  2. Merge2.py: This program merges two of the csv files created in step 1. It essentially acts as the merge sub routine in a merge sort algorithm. It was run in parallel using the bash script, MergeScript, to obtain a csv with all the ngrams in the entire dataset and the number of files each appears in.

  3. selbyfreq.py: Uses the total ngram frequency csv file obtained after step 2 to create a new csv file which contains only the ngrams which appear in at least 100 files.

  4. featEx.py: Uses the frequent ngram file obtained in step 3 to create a feature vector for each sample. This was done in parallel using a script similar to the script featExScript. (See featEx.py comments for how to change the featExScript for extracting different features)

  5. mkbatch.py: due to memory contraints, further analysis of the features can only be done in batches. This script creates 20 batch files containing the names of the sample files in each batch. Each batch file has roughly the same class proportions as the entire dataset.

  6. vectoArr.py: using the batch files created in step 5, and the feature vectors created in step 4, this script creates a feature array file and label vector file associated with the batch specified by arg1. It was ran in parallel using vectoArrSript, however only two batches could be processed at a time due to memory constraints.

  7. MI.py and Chi.py: using label and feature array files created in step 6, these scripts calculate batch wise MI scores and Chi scores and save them to a csv file. Each batch was processed sequentially due to memory constraints.

  8. avgMI.py and avgChi.py: using the batch wise score files created in step 7, this script finds the average scores across all batches to approximate the actual scores across the dataset and saves the results to another .csv file

  9. scorestofeats.py: this script uses the scores obtained in 8 to create a file which lists the features with a minimum score specified by arg1. (there are two sets of featuers, those with high MI scores, used for training the Neural Net and Random Forest, and those with high Chi^2 scores, used for training the Logistic Regressor)

  10. featEx.py is used again, this time using the files created in step 9, to create the final feature vector for each sample. This was done in parallel using the script featExScript. FeatExScript was ran twice (modified slightly between runs) to create two sets of vectors, one for the NN/RF and one for the Log-Reg.

  11. vectoArr2.py: This script uses the vector files created in step 10 to create a file containing the final feature array used for training. It is ran twice with minor modification to create two feature arrays, one for the NN (using the file containing the ngrams with high MI scores) and one for the Logistic Regressor (using the file containing the ngrams with high Chi^2 scores)

  12. splitdataset.py: splits the NN data set into an train portion and test portion with equivalent class distributions

  13. NN.py and LogReg.py and randFor.py: train and test a Neural Network, Logistic Regression, and Random Forest model respectively on the NN data set and Logistic Regression data set.

  14. avgWeights.py: Reads the coefficients of the Logistic Regression model from a file created by LogReg.py and averages them

  15. NNAnalysis.py: preforms LRP and averages out the absolute LRP values for the input nodes across all classes as well as averaging out the LRP values across each class separately

  16. DTAnalysis: prints the contributions of each feature in the tree with the highest predicted probabiility of the true class for some sample of interest

  17. SingleInDepthAnalysis.py: preforms LRP on a single sample and prints the relevances of the internal nodes as well as the relevances of the input nodes to the most relevant internal node

Other Files:

FeatureMap: code for finding code snippets associated with a particular nGram feature

trainLabels.csv: the labels of the entire dataset used in this work

o100.csv: the ngrams which appear in over 100 files, and the number of files each appears in

FinalFeatureSetLogReg.csv and FinalFeatureSetNN_RF.csv contain the final set of ngrams used in the LogReg and NN/RandomFirest feature set repectively.

ChiArr.csv and ChiLabels.csv are the samples and labels of the dataset used by the Logistic regression model

LogReg.coef: is the coefficients of the logistic regressors (there is one for each class since one-vs-rest classification was used)

MIArr.csv and MILabels.csv: are the samples and labels of the dataset used by the Random Forest and the Neural Network before the train test split

MITestArr.csv, MITestLabels.csv, MITrainArr.csv and MITrainLabels.csv: are the samples and labels of the dataset used by the Neural Network after the train test split

LogReg.joblib: The scikit logisitc regression object saved using the joblib library

RT_CLF.joblib: The scikit random forest object saved using the joblib library

NNmodelBest.josn and NNweightsBest.hdf5: are the files containing the architecture and weights of the Neural Network

AvgCoefMax15.csv: The 15 n-grams with the highest average coefficients across all 9 binary sub-classifiers

AvgCoefMin15.csv: The 15 n-grams with the lowest average coefficients across all 9 binary sub-classifiers

midrelevances.csv: Is the inputs to the hidden layer and the relevances of the hidden layer when classifying just a single sample

firstrelevances.csv: Is the relevances of the input layer to the most relevant node in the hidden layer when classifying just one sample as well as the input to the input layer corresponding to that sample

LRPs: the files X_Max_50.csv contain the LRP data for the 50 most relevant nodes in class X. X_Mean_All.csv contain the LRP data of all nodes for class X. If X == Abs then it contains the LRP data for the entire dataset

About

Code for work done in the paper "Interpreting Machine Learning MalwareDetectors Which Leverage N-gram Analysis" published in the proceedings of the FPS 2019 conference in Toulouse

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published