Phishing URL Detection
with Machine Learning

This repository contains the code for training a machine learning model for phishing URL detection. The dataset used and the latest model are hosted on Hugging Face:

ℹ️ You can test the model on the demo page here.

Consideration Regarding The Model

The model architecture consists of a TF-IDF (character n-grams + word n-grams) for vectorization and a linear SVM for classification.

✅ Lightweight: Easy to handle, you can embed it in your applications without the need for a remote server to host it.

✅ Fast: Your application will experience no additional latency due to model inferences.

✅ Works Offline: The use of URL tokens alone enables usage without an internet connection.

On the other hand, it could be less efficient than more complex models or those using external features.

Reproduce The Model

# 1. Clone the repository
git clone https://github.com/pirocheto/phishing-url-detection.git

# 2. Go inside the project
cd phishing-url-detection

# 3. Install dependencies
poetry install --no-root

# 4. Run the pipeline
dvc repro -s download_data
dvc repro -s train

For more details, see the pipeline in the dvc.yaml file.

Project Structure

live: Artifacts created during pipeline execution
notebooks: Contains the code for the exploration phase
ressources: Miscellaneous resources used by scripts
tests: Test files
src: Python scripts
params.yaml: Parameters for the DVC experiment
dvc.yaml: Pipeline to run the experiment and reproduce executions

Main Tools Used in This Project

DVC: Version data and experiments
CML: Post a comment to the pull request showing the metrics and parameters of an experiment
Scikit-Learn: Framework to train the model
Optuna: Find the best hyperparameters for model

Name		Name	Last commit message	Last commit date
Latest commit History 425 Commits
.dvc		.dvc
.github/workflows		.github/workflows
live		live
notebooks		notebooks
resources		resources
src		src
tests		tests
.dvcignore		.dvcignore
.flake8		.flake8
.gitattributes		.gitattributes
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
dvc.lock		dvc.lock
dvc.yaml		dvc.yaml
params.yaml		params.yaml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

pirocheto/phishing-url-detection

Folders and files

Latest commit

History

Repository files navigation

Phishing URL Detection with Machine Learning

Consideration Regarding The Model

Reproduce The Model

Project Structure

Main Tools Used in This Project

About

Topics

Resources

Stars

Watchers

Forks

Languages

Phishing URL Detection
with Machine Learning