Skip to content
This repository has been archived by the owner on May 23, 2023. It is now read-only.
/ ML-Corrupted Public archive

πŸ“Š [ML] Classification Problem Solution: Guessing the type of a corrupted file

License

Notifications You must be signed in to change notification settings

qwqoro/ML-Corrupted

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

gif

[ Have a look at the related Jupyter Notebook: ML-Corrupted.ipynb ]
[!] This README is only a brief overview of the notebook contents!

Problem | Results | Approach | Resources


// Disclaimer:
Β  I created this out of interest in the summer of 2019 at the age of 15
Β  Resources, links and libraries were updated in the spring of 2022
β†’ Machine Learning level: Script Kiddie

πŸ“‰ Problem

Solving CTF tasks led me to corrupted files at some point. When they lack distinctive features, such as signatures, most of the tools identify them as "Text files" or "Data", which is not useful at all, to be honest.

To proceed, it would be great to know their types, at least. Even though some files might be too broken to be fixed, unless you know the exact file format, which parts are missing and which part you have got before your eyes, I believe that guessing types may still be useful in some cases, e.g. criminal investigations β€” it might be important to find remains of what used to be evidences and analyze them thoroughly. Thus, my aim was to build models that would guess types of broken files.

πŸ“Š Results


Test set of files + knnRecG8 (./guess)

πŸ” Approach

Data

> Resources Β  [ Every piece of data and all copyrights belong to the original owners ]

Features

A total of 7 DataFrames were made:

  1. S = Top 10 bytes in each part of a file (file is split into 3 parts)
  2. G4 = Top 20 byte 4-Grams (stride=1) in a file
  3. G6 = Top 20 byte 6-Grams (stride=2) in a file
  4. G8 = Top 20 byte 8-Grams (stride=4) in a file
  5. G4S = Top 10 byte 4-Grams (stride=1) in each part of a file (file is split into 3 parts)
  6. G6S = Top 10 byte 6-Grams (stride=2) in each part of a file (file is split into 3 parts)
  7. G8S = Top 10 byte 8-Grams (stride=4) in each part of a file (file is split into 3 parts)

S β†’ Split, GN β†’ N-Grams

Algorithms & boosts

Scikit-Learn + XGBoost + LightGBM + CatBoost:

from sklearn.ensemble import RandomForestClassifier      # Random Forest Classifier
from sklearn.neighbors import KNeighborsClassifier       # KNN Classifier
from xgboost import XGBClassifier                        # XGBoost Classifier
from lightgbm import LGBMClassifier                      # LightGBM Classifier
from catboost import CatBoostClassifier                  # CatBoost Classifier

from sklearn.model_selection import RandomizedSearchCV   # Randomized search on hyperparameters

Models were built using every algorithm listed above and were trained based on every DataFrame specified. For each case there are two models: one with default settings and one with hyperparameters recommended by RandomizedSearchCV.

The total number of models is 70 (7 DataFrames * 5 Algorithms * 2 Sets of hyperparameters):

rfcS, rfcRecS, rfcG4, rfcRecG4, rfcG6, rfcRecG6, rfcG8, rfcRecG8, rfcG4S, rfcRecG4S, rfcG6S, rfcRecG6S, rfcG8S, rfcRecG8S,
knnS, knnRecS, knnG4, knnRecG4, knnG6, knnRecG6, knnG8, knnRecG8, knnG4S, knnRecG4S, knnG6S, knnRecG6S, knnG8S, knnRecG8S,
xgbS, xgbRecS, xgbG4, xgbRecG4, xgbG6, xgbRecG6, xgbG8, xgbRecG8, xgbG4S, xgbRecG4S, xgbG6S, xgbRecG6S, xgbG8S, xgbRecG8S,
lgbmS, lgbmRecS, lgbmG4, lgbmRecG4, lgbmG6, lgbmRecG6, lgbmG8, lgbmRecG8, lgbmG4S, lgbmRecG4S, lgbmG6S, lgbmRecG6S, lgbmG8S, lgbmRecG8S,
cbcS, cbcRecS, cbcG4, cbcRecG4, cbcG6, cbcRecG6, cbcG8, cbcRecG8, cbcG4S, cbcRecG4S, cbcG6S, cbcRecG6S, cbcG8S, cbcRecG8S

Accuracy calculation & comparison

from sklearn.metrics import accuracy_score               # Accuracy score

# [ Accuracy Score = (True Positives + True Negatives) / (True Positives + False Positives + True Negatives + False Negatives) ]

Also, heatmaps were used to visually compare accuracy scores of predictions (on validation sets) of different models. For example, the heatmap below depicts the difference between K-Nearest Neighbors model (train dataset = G8) with default hyperparameters (knnG8) and the ones recommended by RandomizedSearchCV (knnRecG8):

Libraries

Data processing Algorithms & boosting Visualization Misc
numpy 1.21.6 scikit-learn 1.0.2 matplotlib 3.5.2 binascii
pandas 1.3.5 xgboost 1.6.1 seaborn 0.11.2 collections
lightgbm 3.3.2 tqdm 4.35.0
catboost 1.0.6 dill 0.3.5.1

Hardware

Information regarding hardware that was used to run the notebook. The time spent on every crucial step may be seen in the notebook, next to each tqdm progress bar.

RAM: 12 GB
GPU: NVIDIA GeForce GTX 1060
Processor: Intel(R) Core(TM) i5-8300H

πŸ“š Resources

Datasets:

[1] Chao Dong, Chen Change Loy, Xiaoou Tang. Accelerating the Super-Resolution Convolutional Neural Network, in Proceedings of European Conference on Computer Vision (ECCV), 2016 arXiv:1608.00367

[2] Zhang Zhifei, Song Yang, and Qi Hairong. "Age Progression/Regression by Conditional Adversarial Autoencoder". IEEE Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:1702.08423, 2017

[3] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: A Large Video Database for Human Motion Recognition. ICCV, 2011. PDF Bibtex

[4] Eduardo Fonseca, Manoj Plakal, Daniel P. W. Ellis, Frederic Font, Xavier Favory, and Xavier Serra, β€œLearning Sound Event Classifiers from Web Audio with Noisy Labels”, arXiv preprint arXiv:1901.01189, 2019

[5] Royi Ronen, Marian Radu, Corina Feuerstein, Elad Yom-Tov, Mansour Ahmadi. "Microsoft Malware Classification Challenge". arXiv:1802.10135, 2018