Phishing Attack Detection using Machine Learning

Advancing Cybersecurity with AI: This project fortifies phishing defense using cutting-edge models, trained on a diverse dataset of 737,000 URLs. It was the final project for the AI for Cybersecurity course in my Master's at uOttawa in 2023.

Required libraries: scikit-learn, pandas, matplotlib.
Execute cells in a Jupyter Notebook environment.
The uploaded code has been executed and tested successfully within the Google Colab environment.

Binary-class classification problem

Task is to classify the likelihood of a URL: Phishing , Benign.

Independent Variables:

The independent variables in the provided dataset can be categorized into three groups:

Length and Count Features:These include measures related to the length and count of different components in a URL, such as domain length, URL length, count of digits, letters, path components, and various symbols.
Boolean Features:These features are binary indicators, representing the presence or absence of certain characteristics in a URL, such as whether it contains an IP address (ip), has redirection (redirection), uses IPv notation (ipv), is a shortened URL (short), is encoded (is_encoded), or has a suspicious top-level domain (sus).
Calculation-Based Features:These features involve calculated values based on the URL, including a malicious probability score (malicious_probability), entropy of characters (entropy), and a ratio of special characters and digits to the total characters in the URL (ratio).

Target variable:

'Label' indicating the classification into two classes: 1 (Phishing) / 0 (Benign)

Key Tasks Undertaken

Data Concatenation:
- Concatenated multiple DataFrames vertically into a single DataFrame.
  - PhishStorm-URL dataset: 96011 Data Size.
  - ISCX-URL2016 dataset: Extracted only Phishing / Legitimate from165366 rows.
  - Malicious URL dataset: 651,191 Data Size
Feature Extraction:
- Defined a function for extracting features from URLs.
- Extracted various features such as domain, path, first directory length, presence of IP address, URL length, etc.
- Calculated counts and frequencies of characters, entropy, URL decoding, and presence of unusual characters.
- Checked for URL shortening, special characters, and suspicious top-level domains.
Exploratory Data Analysis (EDA)
- Checked dataset information, including data types and non-null counts.
- Explored unique values in the dataset.
- Computed and displayed descriptive statistics for numerical features.
- Visualized data distribution using boxplots, violin plots, histograms, and correlation matrices.
Feature Engineering and Data Cleaning:
- Handling Null Values and Duplicate Rows
- Handling Repeated Maximum Values
- Spilting Data: Train 80% , Test 20%
- Oversampling with SMOTE:Used SMOTE (Synthetic Minority Over-sampling Technique) to balance the training data, especially for the minority class.
Features Selection: Developed a function to assess and identify features with overwhelmingly repeated maximum values.
1. Evaluated the percentage of occurrences for the most frequent value in each feature.
2. Removed features where the maximum value was repeated over 90% of the time.
3. Applied a 90% repetition threshold to exclude less informative or near-constant features.
4. Improved model efficiency and computational performance by reducing redundancy in the dataset.
Modeling:
- Model Training: Trained various classification models (Logistic Regression, SVM, Decision Tree, Random Forest, XGBoost, etc.) using LazyClassifier.
- Bias-Variance Decomposition: Implemented a function for bias-variance decomposition to analyze and quantify the bias and variance for each model.
- Performance Evaluation: Evaluated each model's performance using confusion matrices, F1 scores, classification reports, and other relevant metrics.
Stacking and Voting Classifiers:
- Selecting the two models with the highest true positive rates and the two with the highest true negative rates.
- Combining each pair of top models using a stacking classifier approach to create ensemble models.
- Applying soft voting to the predictions from the two ensemble models.
- Re-evaluating the final integrated model on the test set and comparing its performance to the highest traditional model.
Champion Model:

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
LICENSE		LICENSE
Phishing-Attack-Detection.ipynb		Phishing-Attack-Detection.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LICENSE

LICENSE

Phishing-Attack-Detection.ipynb

Phishing-Attack-Detection.ipynb

README.md

README.md

Repository files navigation

Phishing Attack Detection using Machine Learning

Binary-class classification problem

Independent Variables:

Target variable:

Key Tasks Undertaken

About

Releases

Packages

Languages

License

RimTouny/Phishing-Attack-Detection-using-Machine-Learning

Folders and files

Latest commit

History

Repository files navigation

Phishing Attack Detection using Machine Learning

Binary-class classification problem

Independent Variables:

Target variable:

Key Tasks Undertaken

About

Topics

Resources

License

Stars

Watchers

Forks

Languages