Skip to content

This repo showcases a project that transforms ML model training into a simplified, production-ready Kedro Dockerized Pipeline. It emphasizes best MLOps practices, enabling easy training, evaluation, and deployment of models, including XGBoost, LightGBM and Random Forest, with built-in visualization and logging features for effective monitoring.

Notifications You must be signed in to change notification settings

labrijisaad/Kedro-Energy-Forecasting-Machine-Learning-Pipeline

Repository files navigation

Kedro Machine Learning Pipeline 🏯

"This DALL-E generated image, within Japan, Kedro orchestrates the rhythm of renewable insights amidst the choreography of data and predictions."

πŸ“˜ Introduction

In this project, I challenged myself to transform notebook-based code for model training into a Kedro pipeline. The goal is to create modular, easy-to-train pipelines that follow the best MLOps practices, simplifying the deployment of ML models. With Kedro, you can execute just one command to train your models and obtain your pickle files, performance figures, etc. (Yes, just ONE command✌️). Parameters can be easily adjusted in a YAML file, allowing for the addition of different steps and the testing of various models with ease. Additionally, Kedro provides visualization and logging features to keep you informed about everything. You can create all types of pipelines, not only for machine learning but for any data-driven workflow.

For an in-depth understanding of Kedro, consider exploring the official documentation at Kedro's Documentation.

Additionally, I integrated a CI pipeline on Github Actions for code quality checks and code functionality assurance, enhancing reliability and maintainability βœ…

🎯 Project Goals

The objectives were:

  • Transition to Production: Convert code from Jupyter Notebooks to a production-ready and easily deployable format.
  • Model Integration: Facilitate the straightforward addition of models, along with their performance metrics, into the pipeline.
  • Workflow Optimization: Utilize the Kedro framework to establish reproducible, modular, and scalable data workflows.
  • CI/CD Automation: Implement an automated CI/CD pipeline using GitHub Actions to ensure continuous testing and code quality management.
  • Dockerization: Develop a Dockerized pipeline for ease of use, incorporating Docker volumes for persistent data management.

πŸ› οΈ Preparation & Prototyping in Notebooks

Before I started making Kedro pipelines, I tried out my ideas in Jupyter notebooks. Check the notebooks folder to see how I did it:

🧩 Project Workflow

Within the src directory lies the essence, with each component neatly arranged in a Kedro pipeline:

  • Data Processing: Standardizes and cleans data in ZIP and CSV formats, preparing it for analysis. πŸ”
  • Feature Engineering: Creates new features. πŸ› οΈ
  • Train-Test Split Pipeline: A dedicated pipeline to split the data into training and test sets. πŸ“Š
  • Model Training + Model Evaluation: Constructs separate pipelines for XGBoost, LightGBM and Random Forest, modular and independent, capable of training in async mode. πŸ€–

Kedro Visualization

The Kedro Viz tool provides an interactive canvas to visualize and understand the pipeline structure. It illustrates data flow, dependencies, and the orchestration of nodes and pipelines. Here is the visualization of this project:

kedro-pipeline

With this tool, the understanding of data progression, outputs, and interactivity is greatly simplified. Kedro Viz allows users to inspect samples of data, view parameters, analyze figures, and much more, enriching the user experience with enhanced transparency and interactivity.

πŸ“œ Logging and Monitoring

Logging is integral to understanding and troubleshooting pipelines. This project leverages Kedro's logging capabilities to provide real-time insights into pipeline execution, highlighting progress, warnings, and errors. This GIF demonstrates the use of the kedro run or make run command, showcasing the logging output in action:

Notice how the nodes are executed sequentially, and observe the RMSE outputs during validation for the XGBoost model. Logging in Kedro is highly customizable, allowing for tailored monitoring that meets the user's specific needs.

πŸ“ Project Structure

A simplified overview of the Kedro project's structure:

Kedro-Energy-Forecasting/
β”‚
β”œβ”€β”€ conf/                                                # Configuration files for Kedro project
β”‚   β”œβ”€β”€ base/                                             
β”‚   β”‚   β”œβ”€β”€ catalog.yml                                  # Data catalog with dataset definitions
β”‚   β”‚   β”œβ”€β”€ parameters_data_processing_pipeline.yml      # Parameters for data processing
β”‚   β”‚   β”œβ”€β”€ parameters_feature_engineering_pipeline.yml  # Parameters for feature engineering
β”‚   β”‚   β”œβ”€β”€ parameters_random_forest_pipeline.yml        # Parameters for Random Forest pipeline
β”‚   β”‚   β”œβ”€β”€ parameters_lightgbm_training_pipeline.yml    # Parameters for LightGBM pipeline
β”‚   β”‚   β”œβ”€β”€ parameters_train_test_split_pipeline.yml     # Parameters for train-test split
β”‚   β”‚   └── parameters_xgboost_training_pipeline.yml     # Parameters for XGBoost training
β”‚   └── local/                                            
β”‚
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ 01_raw/                                          # Raw, unprocessed datasets
β”‚   β”œβ”€β”€ 02_processed/                                    # Cleaned and processed data ready for analysis
β”‚   β”œβ”€β”€ 03_training_data/                                # Train/Test Datasets used for model training
β”‚   β”œβ”€β”€ 04_reporting/                                    # Figures and Results after running the pipelines
β”‚   └── 05_model_output/                                 # Trained pickle models
β”‚
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ pipelines/                            
β”‚   β”‚   β”œβ”€β”€ data_processing_pipeline/                    # Data processing pipeline
β”‚   β”‚   β”œβ”€β”€ feature_engineering_pipeline/                # Feature engineering pipeline
β”‚   β”‚   β”œβ”€β”€ random_forest_pipeline/                      # Random Forest pipeline
β”‚   β”‚   β”œβ”€β”€ lightgbm_training_pipeline/                  # LightGBM pipeline
β”‚   β”‚   β”œβ”€β”€ train_test_split_pipeline/                   # Train-test split pipeline
β”‚   β”‚   └── xgboost_training_pipeline/                   # XGBoost training pipeline
β”‚   └── energy_forecasting_model/                        # Main module for the forecasting model
β”‚
β”œβ”€β”€ .gitignore                                           # Untracked files to ignore
β”œβ”€β”€ Makefile                                             # Set of tasks to be executed
β”œβ”€β”€ Dockerfile                                           # Instructions for building a Docker image
β”œβ”€β”€ .dockerignore                                        # Files and directories to ignore in Docker builds   
β”œβ”€β”€ README.md                                            # Project documentation and setup guide
└── requirements.txt                                     # Project dependencies

πŸš€ Getting Started

First, Clone the Repository to download a copy of the code onto your local machine, and before diving into transforming raw data into a trained pickle Machine Learning model, please note:

πŸ”΄ Important Preparation Steps

Before you begin, please follow these preliminary steps to ensure a smooth setup:

  • Clear Existing Data Directories: If you're planning to run the pipeline, i recommend removing these directories if they exist: data/02_processed, data/03_training_data, data/04_reporting, and data/05_model_output (leave only data/01_raw in the data folder). They will be recreated or updated once the pipeline runs. These directories are tracked in version control to provide you with a glimpse of the expected outputs.

  • Makefile Usage: To utilize the Makefile for running commands, you must have make installed on your system. Follow the instructions in the installation guide to set it up.

Here is an example of the available targets: (you type make in the command line)

  • Running the Kedro Pipeline:
    • For production environments, initialize your setup by executing make prep-doc or using pip install -r docker-requirements.txt to install the production dependencies.
    • For a development environment, where you may want to use Kedro Viz, work with Jupyter notebooks, or test everything thoroughly, run make prep-dev or pip install -r dev-requirements.txt to install all the development dependencies.

🌿 Standard Method (Conda / venv)

Adopt this method if you prefer a traditional Python development environment setup using Conda or venv.

  1. Set Up the Environment: Initialize a virtual environment with Conda or venv to isolate and manage your project's dependencies.

  2. Install Dependencies: Inside your virtual environment, execute pip install -r dev-requirements.txt to install the necessary Python libraries.

  3. Run the Kedro Pipeline: Trigger the pipeline processing by running make run or directly with kedro run. This step orchestrates your data transformation and modeling.

  4. Review the Results: Inspect the 04_reporting and 05_model_output directories to assess the performance and outcomes of your models.

  5. (Optional) Explore with Kedro Viz: To visually explore your pipeline's structure and data flows, initiate Kedro Viz using make viz or kedro viz run.

🐳 Docker Method

Prefer this method for a containerized approach, ensuring a consistent development environment across different machines. Ensure Docker is operational on your system before you begin.

  1. Build the Docker Image: Construct your Docker image with make build or kedro docker build. This command leverages dev-requirements.txt for environment setup. For advanced configurations, see the Kedro Docker Plugin Documentation.

  2. Run the Pipeline Inside a Container: Execute the pipeline within Docker using make dockerun or kedro docker run. Kedro-Docker meticulously handles volume mappings to ensure seamless data integration between your local setup and the Docker environment.

  3. Access the Results: Upon completion, the 04_reporting and 05_model_output directories will contain your model's reports and trained files, ready for review.

For additional assistance or to explore more command options, refer to the Makefile or consult kedro --help.

🌌 Next Steps?

With our Kedro Pipeline πŸ— now capable of efficiently transforming raw data πŸ”„ into trained models πŸ€–, and the introduction of a Dockerized environment 🐳 for our code, the next phase involves advancing beyond the current repository scope πŸš€ to orchestrate data updates automatically using tools like Databricks, Airflow, Azure Data Factory... This progression allows for the seamless integration of fresh data into our models.

Moreover, implementing experiment tracking and versioning with MLflow πŸ“Š or leveraging Kedro Viz's versioning capabilities πŸ“ˆ will significantly enhance our project's management and reproducibility. These steps are pivotal for maintaining a clean machine learning workflow that not only achieves our goal of simplifying model training processes πŸ›  but also ensures our system remains dynamic and scalable with minimal effort.

🌐 Let's Connect!

You can connect with me on LinkedIn or check out my GitHub repositories:

About

This repo showcases a project that transforms ML model training into a simplified, production-ready Kedro Dockerized Pipeline. It emphasizes best MLOps practices, enabling easy training, evaluation, and deployment of models, including XGBoost, LightGBM and Random Forest, with built-in visualization and logging features for effective monitoring.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published