Skip to content

Advancing Cybersecurity with AI: This project fortifies phishing defense using cutting-edge models, trained on a diverse dataset of 737,000 URLs. It was the final project for the AI for Cybersecurity course in my Master's at uOttawa in 2023.

License

RimTouny/Phishing-Attack-Detection-using-Machine-Learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 

Repository files navigation

Phishing Attack Detection using Machine Learning

Advancing Cybersecurity with AI: This project fortifies phishing defense using cutting-edge models, trained on a diverse dataset of 737,000 URLs. It was the final project for the AI for Cybersecurity course in my Master's at uOttawa in 2023.

  • Required libraries: scikit-learn, pandas, matplotlib.
  • Execute cells in a Jupyter Notebook environment.
  • The uploaded code has been executed and tested successfully within the Google Colab environment.

Binary-class classification problem

Task is to classify the likelihood of a URL: Phishing , Benign.

Independent Variables:

The independent variables in the provided dataset can be categorized into three groups:

  • Length and Count Features:These include measures related to the length and count of different components in a URL, such as domain length, URL length, count of digits, letters, path components, and various symbols.

  • Boolean Features:These features are binary indicators, representing the presence or absence of certain characteristics in a URL, such as whether it contains an IP address (ip), has redirection (redirection), uses IPv notation (ipv), is a shortened URL (short), is encoded (is_encoded), or has a suspicious top-level domain (sus).

  • Calculation-Based Features:These features involve calculated values based on the URL, including a malicious probability score (malicious_probability), entropy of characters (entropy), and a ratio of special characters and digits to the total characters in the URL (ratio).

Target variable:

  • 'Label' indicating the classification into two classes: 1 (Phishing) / 0 (Benign)

Key Tasks Undertaken

  1. Data Concatenation:

  2. Feature Extraction:

    • Defined a function for extracting features from URLs.
    • Extracted various features such as domain, path, first directory length, presence of IP address, URL length, etc.
    • Calculated counts and frequencies of characters, entropy, URL decoding, and presence of unusual characters.
    • Checked for URL shortening, special characters, and suspicious top-level domains.
  3. Exploratory Data Analysis (EDA)

    • Checked dataset information, including data types and non-null counts.
    • Explored unique values in the dataset.
    • Computed and displayed descriptive statistics for numerical features.
    • Visualized data distribution using boxplots, violin plots, histograms, and correlation matrices. merge_from_ofoct image
  4. Feature Engineering and Data Cleaning:

    • Handling Null Values and Duplicate Rows
    • Handling Repeated Maximum Values
    • Spilting Data: Train 80% , Test 20%
    • Oversampling with SMOTE:Used SMOTE (Synthetic Minority Over-sampling Technique) to balance the training data, especially for the minority class. cd107d3e-734f-42e5-8dcf-bb36ae821eeb
  5. Features Selection: Developed a function to assess and identify features with overwhelmingly repeated maximum values.

    1. Evaluated the percentage of occurrences for the most frequent value in each feature.
    2. Removed features where the maximum value was repeated over 90% of the time.
    3. Applied a 90% repetition threshold to exclude less informative or near-constant features.
    4. Improved model efficiency and computational performance by reducing redundancy in the dataset. image
  6. Modeling:

    • Model Training: Trained various classification models (Logistic Regression, SVM, Decision Tree, Random Forest, XGBoost, etc.) using LazyClassifier.

    • Bias-Variance Decomposition: Implemented a function for bias-variance decomposition to analyze and quantify the bias and variance for each model.

    • Performance Evaluation: Evaluated each model's performance using confusion matrices, F1 scores, classification reports, and other relevant metrics.

  7. Stacking and Voting Classifiers:

    • Selecting the two models with the highest true positive rates and the two with the highest true negative rates.

    • Combining each pair of top models using a stacking classifier approach to create ensemble models.

    • Applying soft voting to the predictions from the two ensemble models.

    • Re-evaluating the final integrated model on the test set and comparing its performance to the highest traditional model.

    image

  8. Champion Model:

About

Advancing Cybersecurity with AI: This project fortifies phishing defense using cutting-edge models, trained on a diverse dataset of 737,000 URLs. It was the final project for the AI for Cybersecurity course in my Master's at uOttawa in 2023.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published