GitHub - Barry0922/Multi-label-Android-Malware-Ensembled-Classifier: Build ensmbled Android Malware Classifier by using stacking-like method.

Barry0922 / Multi-label-Android-Malware-Ensembled-Classifier Public

Notifications You must be signed in to change notification settings
Fork 0
Star 1

Build ensmbled Android Malware Classifier by using stacking-like method.

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
1_good_setting_test-val.py		1_good_setting_test-val.py
2_gruop_label.py		2_gruop_label.py
3_read_log.py		3_read_log.py
4_analysis_FPnFT.py		4_analysis_FPnFT.py
5_read_common_sample.py		5_read_common_sample.py
readme.txt		readme.txt

Repository files navigation

This process aim to combine all results from different ML models to obtain the result that all models agreed, where the result is like result from ensembled model.
Below programs are bulit on Python 2 and Linxus System.

---------------------------------------------------------------------------------------------------
The demo files can be downloaded from below link, which let readers/users can use.
https://drive.google.com/drive/folders/1kv402CNoRrwphA7QRgFtqcweAGXIpEWQ?usp=sharing

There are 2 folders in the link.

The "CnC_server_demo" folder includes demo files for building ensembled model that is a multi-label model [C&C(label-8), C&C-CE(label-9), C&C-IN(label-10), C&C-SMS(label-11)].
The "CnC_server_dissertation" folder includes the final results of the "CnC_server_demo" folder that has been correctly processed, which let readers/users make a reference.

----------------------Below are ensembled random forest model -----------------------------------

a. Built many models with "different ML parameter setting" and 'different label combinations by using "Multi-label-Android-Malware-Classification".
Then, using "1_good_setting_tet-val.py" to find the model with best setting in validation result and selected the corresponding setting from testing result.
This process is like "training-validation-testing" in Machine Learning

Demo run [python 1_good_setting_tet-val.py ./CnC_server_demo/CC-val/ ./CnC_server_demo/CC-test/]

This program will chose the best setting model from validation result, and selected same setting models from testing result.
Then, it would create a folder "bset_combinations_test" under the 'CC-test' folder where put all best setting models

b. Using "2_group_label_.py" to group models that were selected in step.a into different folder.

That is to say, in step a, we would find many best setting models with different label combinations.
Thus, by using "2.group single label model”to group them into folder where inside models all have the common label.
In other word, ensembled many models that have common label result output.

Demo run [python 2_group_label_.py ./CnC_server_demo/CC-test/best_combinations_test/]
Then, it would create a folder "all_combination" where there are many sub-folder are categorized by single-label.
The models inside this single-label folder are related to this single-label. (stacking method)

For example, 'label-8 folder' includes "label-8", "label-8-9", "label-8-10", "label-8-11", "label-8-9-10", "label-8-9-11", "label-8-10-11" and "label-8-9-10-11" models.
They all have common label. (label-8)

c. Using "3_read_log.py" to read all log files that under single-label folders and have common label outputs.
Then, the program "3_read_log.py" would find the mislabeled samples via all log files and summarize these mislabeled samples into 'mis-labelled log' files for each model.
Furthermore, by using all 'mis-labelled log' files to find what samples have been mislabelled by all models to form two txt files and two csv files, which stands for mis-labelled results of an ensembled model.

Demo run [python 3_read_log.py ./CnC_server_demo/CC-test/best_combinations_test/all_combinations/label-8/ ]

Before demo running, please revise 'command' variable in program '3_read_log.py'.
For label-8, we should use 'C&C' to detect 'C&C' mislabelled samples because behavior label-8 is C&C (if it's for label-9, then revise command variable to 'C&C-CE')
Thus, this program will search all log files and find any samples were mislabelled in label-8 ('C&C').

Finally, it would create a summary folder that includes sub-folder "label-8" and it also create label-8_FN_.csv, label-8_FP_.csv, label-8_FN_.txt and label-8_FP_.txt.
The two csv files include samples that stand for false negative samples and false positive samples from the ensembled model for label-8, respectively.
The two txt files mean the number of mislabelled false negative samples and false positive samples from the ensembled models for label-8, respectively.

d. Using "4_analysis_FPnFT.py" to fetch txt files from 'step c' and get the json file from single-label model (label-8).

Then, it would calculate the final performance of the ensembled model.

Demo run [python 4_analysis_FPnFT.py ./CnC_server_demo/CC-test/best_combinations_test/all_combinations/label-8/label-8/ ./CnC_server_demo/CC-test/best_combinations_test/all_combinations/summary/label-8/ ]

Then, this program will fetch csv files and json files that all come from (argv [1]) label-8 model to get original accuracy, precision and recall.
Similarly, it would fetch label-8_FN.txt and label-8_FP.txt from (argv [2]) to get FP and FN of the ensembled model for label-8.

Finally, it would calculate the final results of the ensembled model for label-8. and output the summary file.

e. By repeatably doing above steps from a to d for label-9, label-10 abd label-11, we can get all FN.cvs files and FP.csv files for all label combinations.
Also, we can know the performance of the ensembled model in term of label-9, label-10 and label-11.
Then, using "5_read_common_sample.py" to read all FN and FP csv files to calculate final accuracy of the ensembled model

Demo run [python 5_read_common_sample.py ./advertising/test/best_combinations_test/all_combinations/summary/ ]
Then, the program will merge all label-x FP .csv files and label-x FN .csv files into a F.cvs which stands for a mislabelled sample list of the ensembled model.
Thus, we can calculate the final accuracy.

Specifically, the F_.csv has 161 mislabelled samples, therefore, the final accuracy of this 'C&C Server Connection(label-8-9-10-11)' ensembled model is (1-161/1312) = 87.7% (1312 is the number of total tested samples)

Thus, the whole process of ensembled model is below,

1. To build many multi-label models for every label combinations with different ML parameter setting.
2. By step a, find the best setting in testing dataset by analyzing validation dataset.
3. By step b, group relevant models into a single-label folder (e.q. label-8 folder includes label-8, label-8-9, label-8-10, label-8-11, etc.)
4. By step c, get all log files from single-label folder (e.g. label-8).
Then, it would output the label-8_FN.csv and label-8_FP.csv for label-8.
In fact, these two csv file is final FP and FN of the ensembled model for label-8, because they are common mislabelled samples on label-8 between those label-8 related models
5. By step d, get original accuracy, precision and recall from the label-8 model.
Then, using those data with FN and FP to compute final accuracy, precision and recall of the ensembled for label-8.

6. Repeatably doing step 1 to 5 for label-9, label-10 and label-11 to get label-9_FN.csv, label-9_FP.csv, label-10_FN.csv, label-10_FP.csv,label-11_FN.csv and label-11_FP.csv
Then, using '5_read_common_sample.py' to merge all label-x_FN.csv files and label-x_FP.csv files into a file (F_.csv ) that stands for mislabelled samples from the ensembled model.

Done

---------------ensembled is done-------------