Skip to content

Barry0922/Multi-label-Android-Malware-Classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This is Multi-label Malware Classification for King's College London final dissertation.

This program is built on python2 and Linux OS.
------------------------------------------------------------------------------------------------------------------

This program used extracted static features by static tool (Drebin) to train Rondom Forest ML model which can assign behavior labels to unknow Android malware.

--------------------------------------------------------------------------------------------------------------------------

The "ex" folder includes demo files for building Random Forest model.

The whole process is below.

Step 1. Using "a_creat_file.py" to construct the feature matrix (.json )for ML model training.
   Demo run [python a_creat_file.py ./AMD_data/ ./]
  
   The folder of AMD_data is needed to organized well otherwise, this program will failed.
   The AMD dataset can be download from https://drive.google.com/drive/folders/1cc1pe90eNxrl7kTOoccnUpv4WTZC0oc- 
   This link include extracted features of Adroid Malware from AMD(Android Malware Dataset)
   (since the whole dataset is huge, it is not possible to put it here)

   This program would generate two json files (dicts).  
   One includes all applications and mapped to its corresponding malware features, and another includes malware family numbers mapped to corresponding malware family names. 
   
   ex : {app01: feature01, feature02, feature03, feature04, malware_family, app02: feature01, feature02, malware_family, app03: feature01, feature02, feature03, feature4, feature05,malware_family}
   ex : {"0": "Obad", "1": "GingerMaster", "2": "Svpeng", "3": "FakeAngry", "4": "Jisut", "5": "Aples"} 
   
   Those two output files of program "a_creat_file.py" are same as two demo files in 'ex' folder, which are "malware_feature_matrix.json" and "num_to_family.json", respectively.


Step 2. Using "b_create_behavior_labels.py" to create behavior labels json file.
   Demo run [python b_create_behavior_labels.py ./ex/last.csv ./]
   
   It would output a behaviors labels json file from CSV file, where CSV file is customized.
   The output file is same as demo file in 'ex' folder, which is 'behavior_last.json'

Step 3. Using "c_RandomForest.py" file to build a Random Forest model.
   Demo run [python b_RandomForest.py ./ ./ex/malware_feature.json ./ex/behavior_last.json ./num_to_family.json ]
   
   This file will fetch abovementioned 3 json files, which are malware feature_matrix(.json), malware behaviors list ( .json ) and family list(.json)
   
   After putting these 3 json files into this ML model, you can select what behavior labels you want to predict or what malware family samples you want to remove from training data or put into testing data.
   
   The program "c_RandomForest.py" will output many detail information by log file, CVS and json files.
   Those files can used to future use and analysis; it depends on users' purpose.

   Then, the file name of this output file will be formed by ML parameter setting
   ex: filename = family-0-1-2_label-11_5_auto_100_20_2_timestamp_
   
   Malware families '0', '1', '2' will be removed from training and put into testing dataset.
   Behavior labels '11' will be trained and tested
   5 = min_sample_leaf,    auto = max_feature,    estimators = 100,    max_depth = 20,    min_sample_split = 2
                                
Final, building RF model is done