Phishing URL Detection

Objective

A phishing website is a common social engineering method that mimics trustful uniform resource locators (URLs) and webpages. The objective of this project is to train machine learning models and deep neural nets on the dataset created to predict phishing websites. Both phishing and benign URLs of websites are gathered to form a dataset and from them required URL and website content-based features are extracted. The performance level of each model is measures and compared.

Installation

To install the required packages and libraries, run this command in the project directory after Forking and cloning this repository:

pip install -r requirements.txt

Technologies Used

Feature Extraction

The system starts by retrieving URLs to be checked for phishing. These URLs can be collected from user input in the webpage created. Once the URLs are obtained, the system extracts relevant features from the web pages. These features are essential for training and evaluating the machine learning models. Various features were extracted from the URL database based on Domain, HTML and Address bar of the URLs.

Machine Learning Models

Various machine learning models are compared and The machine learning model with high accuracy is selected which predicts whether the URL is a phishing site or not. It provides a probability score or a binary classification (phishing or not phishing) based on the trained model's decision boundary. The system categorize URLs into "phishing" or "legitimate" and the result is finally displayed on the webpage.

Refer Phishingproject.ipynb for more details.

Result

Accuracy of various model used for URL detection

	ML Model	Accuracy	f1_score	Recall	Precision
0	Gradient Boosting Classifier	0.974	0.977	0.994	0.986
1	CatBoost Classifier	0.972	0.975	0.994	0.989
2	Multi-layer Perceptron	0.969	0.973	0.995	0.981
3	Random Forest	0.967	0.971	0.993	0.990
4	Support Vector Machine	0.964	0.968	0.980	0.965
5	Decision Tree	0.960	0.964	0.991	0.993
6	K-Nearest Neighbors	0.956	0.961	0.991	0.989
7	Logistic Regression	0.934	0.941	0.943	0.927
8	Naive Bayes Classifier	0.605	0.454	0.292	0.997

Conclusion

The final take away form this project is to explore various machine learning models, perform Exploratory Data Analysis on phishing dataset and understanding their features.
Creating this notebook helped me to learn a lot about the features affecting the models to detect whether URL is safe or not, also I came to know how to tuned model and how they affect the model performance.
The final conclusion on the Phishing dataset is that the some feature like "HTTTPS", "AnchorURL", "WebsiteTraffic" have more importance to classify URL is phishing URL or not.
Gradient Boosting Classifier currectly classify URL upto 97.4% respective classes and hence reduces the chance of malicious attachments.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
DataFiles		DataFiles
static		static
templates		templates
Phishingproject.ipynb		Phishingproject.ipynb
Procfile		Procfile
Project document.pdf		Project document.pdf
README.md		README.md
app.py		app.py
convert.py		convert.py
feature.py		feature.py
newmodel.pkl		newmodel.pkl
requirements.txt		requirements.txt

asrith-reddy/Phishing-detector

Folders and files

Latest commit

History

Repository files navigation

Phishing URL Detection

Objective

Installation

Technologies Used

Feature Extraction

Machine Learning Models

Refer Phishingproject.ipynb for more details.

Result

Conclusion

About

Topics

Resources

Stars

Watchers

Forks

Languages