GitHub - Popseli/Prediction-of-Phishing-Webpages: This project develops an ML binary classification model to predict phishing webpages.

Project Overview

The aim of this project was to develop a binary classifier for distinguishing phishing webpages from legitimate ones using URL and webpage structure based features. Phishing webpages are webpages of websites owned by attackers for the purpose of collecting users' sensitive information such as email addresses, passwords, bank account details etc. The information is then used to impersonate victims in order to undertake various malicious activities including theft of money, execution of other cyberattacks and cyberespionage. Since phishing webpages highly resemble with legitimate webpages that prompt sensitive information, users find it difficult to distinguish the two, leading to most users even those with cybersecurity awareness to fall into the phishing attacks. The purpose of this project, therefore, is to build an automated ML based classifier that can exploit key differences in the structures of the two types of webpages to automatically detect the former, thus, protecting online users from the attacks. The classifier can be used to build an application that can be deployed as a built web browser feature or web browser plug-in to offer protection at the time user is attempting to access any webpage that prompts for sensitive information. Here is the Github link of the dataset, feature descriptions and codes of the project.

Objectives

To achieve the aim stated above, the following tasks were to be performed:

Investigating and identifying potential webpage structure based features for distinguishing the two types of webpages.
Extracting the features from realible data sources of phishing and legitimate webpages.
Identify the most relevant feature set for the prediction task.
Evaluating the relevant feature set using various ML classification algorithms to identify the best performing algorithm based on the performances reported by a wide range of metrics.
Performing model explainability analysis to learn the influence of each feature on the model's output.

Tasks Performed

Retrieving sets of active phishing and legitimate webpages from online repositories
Creating a dataset by extracting feature values from the webpages
Data profiling
Data cleaning
Data exploration analysis
Feature correlation analysis
Automated feature selection
Model evaluation
ROC analysis
Hyperparameter tuning
Model explainability analysis

Dataset

A total of 35 features based on URL and webpage structure, webpage contents and third party information related to a webpage were identified for the prediction task. We then retrieved 12,691 phishing and 13,494 legitimate webpages that prompt sensitive information from PhishTank and Tranco online repositories. From each active webpage, we extracted feature values to form a dataset of 26,115 records. The records were stored in a MySQL database.

Key Software and Libraries Used

Python
Scikit-learn
Numpy
Pandas
MySQL
Geoip2 database
Google and Bing search engines
Google Translator
PhishTank
Beautiful soup
Matplotlib
Seaborn
Category encoders
SHAP

Prediction Result Summary

CatBoost was observed to outperform other algorithms across most metrics by achieving an accuracy of 98.67 %, FPR of 0.89 % and FNR of 1.81 %. After tuning its hyperparameters, the algorithm's performance was improved to an accuracy of 98.76 % and FNR of 1.58 %.

Below is the summary of the performance results and ROC analysis of the evaluated algorithms indicating that CatBoost is the best performer.

The model explainability analysis using SHAP shown below indicates the ranking of features' influence on the model output thus their ranking of importance. FQDNBlacklistCounts is determined to be the most important feature among the best features whereas obfuscationCharFQDN to be the least one.

Name		Name	Last commit message	Last commit date
Latest commit History 105 Commits
Proj_images		Proj_images
Dataset_Final.csv		Dataset_Final.csv
Feature Descriptions.pdf		Feature Descriptions.pdf
File 1 - Data_Extraction_Legitimate_Webpages.ipynb		File 1 - Data_Extraction_Legitimate_Webpages.ipynb
File 2 - Data_Extraction_Phishing_Webpages.ipynb		File 2 - Data_Extraction_Phishing_Webpages.ipynb
File 3 - Functions.ipynb		File 3 - Functions.ipynb
File 4 - SQL_Functions.ipynb		File 4 - SQL_Functions.ipynb
File 5 - Phishing Webage Detection - ML - Codes.ipynb		File 5 - Phishing Webage Detection - ML - Codes.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proj_images

Proj_images

Dataset_Final.csv

Dataset_Final.csv

Feature Descriptions.pdf

Feature Descriptions.pdf

File 1 - Data_Extraction_Legitimate_Webpages.ipynb

File 1 - Data_Extraction_Legitimate_Webpages.ipynb

File 2 - Data_Extraction_Phishing_Webpages.ipynb

File 2 - Data_Extraction_Phishing_Webpages.ipynb

File 3 - Functions.ipynb

File 3 - Functions.ipynb

File 4 - SQL_Functions.ipynb

File 4 - SQL_Functions.ipynb

File 5 - Phishing Webage Detection - ML - Codes.ipynb

File 5 - Phishing Webage Detection - ML - Codes.ipynb

README.md

README.md

Repository files navigation

Project Overview

Objectives

Tasks Performed

Dataset

Key Software and Libraries Used

Prediction Result Summary

About

Releases

Packages

Languages

Popseli/Prediction-of-Phishing-Webpages

Folders and files

Latest commit

History

Repository files navigation

Project Overview

Objectives

Tasks Performed

Dataset

Key Software and Libraries Used

Prediction Result Summary

About

Topics

Resources

Stars

Watchers

Forks

Languages