Skip to content

pirocheto/phishing-url-detection

Repository files navigation

Phishing URL Detection
with Machine Learning

This repository contains the code for training a machine learning model for phishing URL detection. The dataset used and the latest model are hosted on Hugging Face:

ℹ️ You can test the model on the demo page here.

Consideration Regarding The Model

The model architecture consists of a TF-IDF (character n-grams + word n-grams) for vectorization and a linear SVM for classification.

Lightweight: Easy to handle, you can embed it in your applications without the need for a remote server to host it.

Fast: Your application will experience no additional latency due to model inferences.

Works Offline: The use of URL tokens alone enables usage without an internet connection.

On the other hand, it could be less efficient than more complex models or those using external features.

Reproduce The Model

# 1. Clone the repository
git clone https://github.com/pirocheto/phishing-url-detection.git

# 2. Go inside the project
cd phishing-url-detection

# 3. Install dependencies
poetry install --no-root

# 4. Run the pipeline
dvc repro -s download_data
dvc repro -s train

For more details, see the pipeline in the dvc.yaml file.

Project Structure

  • live: Artifacts created during pipeline execution
  • notebooks: Contains the code for the exploration phase
  • ressources: Miscellaneous resources used by scripts
  • tests: Test files
  • src: Python scripts
  • params.yaml: Parameters for the DVC experiment
  • dvc.yaml: Pipeline to run the experiment and reproduce executions

Main Tools Used in This Project

  • DVC: Version data and experiments
  • CML: Post a comment to the pull request showing the metrics and parameters of an experiment
  • Scikit-Learn: Framework to train the model
  • Optuna: Find the best hyperparameters for model

About

Train a machine learning model for Phishing URL Detection with mlops practices.

Topics

Resources

Stars

Watchers

Forks

Languages