`Kedro` Machine Learning Pipeline 🏯

"This DALL-E generated image, within Japan, Kedro orchestrates the rhythm of renewable insights amidst the choreography of data and predictions."

📘 Introduction

In this project, I challenged myself to transform notebook-based code for model training into a Kedro pipeline. The goal is to create modular, easy-to-train pipelines that follow the best MLOps practices, simplifying the deployment of ML models. With Kedro, you can execute just one command to train your models and obtain your pickle files, performance figures, etc. (Yes, just ONE command✌️). Parameters can be easily adjusted in a YAML file, allowing for the addition of different steps and the testing of various models with ease. Additionally, Kedro provides visualization and logging features to keep you informed about everything. You can create all types of pipelines, not only for machine learning but for any data-driven workflow.

For an in-depth understanding of Kedro, consider exploring the official documentation at Kedro's Documentation.

Additionally, I integrated a CI pipeline on Github Actions for code quality checks and code functionality assurance, enhancing reliability and maintainability ✅

🎯 Project Goals

The objectives were:

Transition to Production: Convert code from Jupyter Notebooks to a production-ready and easily deployable format.
Model Integration: Facilitate the straightforward addition of models, along with their performance metrics, into the pipeline.
Workflow Optimization: Utilize the Kedro framework to establish reproducible, modular, and scalable data workflows.
CI/CD Automation: Implement an automated CI/CD pipeline using GitHub Actions to ensure continuous testing and code quality management.
Dockerization: Develop a Dockerized pipeline for ease of use, incorporating Docker volumes for persistent data management.

🛠️ Preparation & Prototyping in Notebooks

Before I started making Kedro pipelines, I tried out my ideas in Jupyter notebooks. Check the notebooks folder to see how I did it:

EDA & Data Preparation - Energy_Forecasting.ipynb: Offers insights into how I analyzed the data and prepared it for modeling, including data preparation and cleaning processes.
Machine Learning - Energy_Forecasting.ipynb: Documents how I evaluated and trained various Machine Learning models, testing the Random Forest, XGBoost, and LightGBM models.

🧩 Project Workflow

Within the src directory lies the essence, with each component neatly arranged in a Kedro pipeline:

Data Processing: Standardizes and cleans data in ZIP and CSV formats, preparing it for analysis. 🔍
Feature Engineering: Creates new features. 🛠️
Train-Test Split Pipeline: A dedicated pipeline to split the data into training and test sets. 📊
Model Training + Model Evaluation: Constructs separate pipelines for XGBoost, LightGBM and Random Forest, modular and independent, capable of training in async mode. 🤖

Kedro Visualization

The Kedro Viz tool provides an interactive canvas to visualize and understand the pipeline structure. It illustrates data flow, dependencies, and the orchestration of nodes and pipelines. Here is the visualization of this project:

With this tool, the understanding of data progression, outputs, and interactivity is greatly simplified. Kedro Viz allows users to inspect samples of data, view parameters, analyze figures, and much more, enriching the user experience with enhanced transparency and interactivity.

📜 Logging and Monitoring

Logging is integral to understanding and troubleshooting pipelines. This project leverages Kedro's logging capabilities to provide real-time insights into pipeline execution, highlighting progress, warnings, and errors. This GIF demonstrates the use of the kedro run or make run command, showcasing the logging output in action:

Notice how the nodes are executed sequentially, and observe the RMSE outputs during validation for the XGBoost model. Logging in Kedro is highly customizable, allowing for tailored monitoring that meets the user's specific needs.

📁 Project Structure

A simplified overview of the Kedro project's structure:

Kedro-Energy-Forecasting/
│
├── conf/                                                # Configuration files for Kedro project
│   ├── base/                                             
│   │   ├── catalog.yml                                  # Data catalog with dataset definitions
│   │   ├── parameters_data_processing_pipeline.yml      # Parameters for data processing
│   │   ├── parameters_feature_engineering_pipeline.yml  # Parameters for feature engineering
│   │   ├── parameters_random_forest_pipeline.yml        # Parameters for Random Forest pipeline
│   │   ├── parameters_lightgbm_training_pipeline.yml    # Parameters for LightGBM pipeline
│   │   ├── parameters_train_test_split_pipeline.yml     # Parameters for train-test split
│   │   └── parameters_xgboost_training_pipeline.yml     # Parameters for XGBoost training
│   └── local/                                            
│
├── data/
│   ├── 01_raw/                                          # Raw, unprocessed datasets
│   ├── 02_processed/                                    # Cleaned and processed data ready for analysis
│   ├── 03_training_data/                                # Train/Test Datasets used for model training
│   ├── 04_reporting/                                    # Figures and Results after running the pipelines
│   └── 05_model_output/                                 # Trained pickle models
│
├── src/
│   ├── pipelines/                            
│   │   ├── data_processing_pipeline/                    # Data processing pipeline
│   │   ├── feature_engineering_pipeline/                # Feature engineering pipeline
│   │   ├── random_forest_pipeline/                      # Random Forest pipeline
│   │   ├── lightgbm_training_pipeline/                  # LightGBM pipeline
│   │   ├── train_test_split_pipeline/                   # Train-test split pipeline
│   │   └── xgboost_training_pipeline/                   # XGBoost training pipeline
│   └── energy_forecasting_model/                        # Main module for the forecasting model
│
├── .gitignore                                           # Untracked files to ignore
├── Makefile                                             # Set of tasks to be executed
├── Dockerfile                                           # Instructions for building a Docker image
├── .dockerignore                                        # Files and directories to ignore in Docker builds   
├── README.md                                            # Project documentation and setup guide
└── requirements.txt                                     # Project dependencies

🚀 Getting Started

First, Clone the Repository to download a copy of the code onto your local machine, and before diving into transforming raw data into a trained pickle Machine Learning model, please note:

🔴 Important Preparation Steps

Before you begin, please follow these preliminary steps to ensure a smooth setup:

Clear Existing Data Directories: If you're planning to run the pipeline, i recommend removing these directories if they exist: data/02_processed, data/03_training_data, data/04_reporting, and data/05_model_output (leave only data/01_raw in the data folder). They will be recreated or updated once the pipeline runs. These directories are tracked in version control to provide you with a glimpse of the expected outputs.
Makefile Usage: To utilize the Makefile for running commands, you must have make installed on your system. Follow the instructions in the installation guide to set it up.

Here is an example of the available targets: (you type make in the command line)

Running the Kedro Pipeline:
- For production environments, initialize your setup by executing make prep-doc or using pip install -r docker-requirements.txt to install the production dependencies.
- For a development environment, where you may want to use Kedro Viz, work with Jupyter notebooks, or test everything thoroughly, run make prep-dev or pip install -r dev-requirements.txt to install all the development dependencies.

🌿 Standard Method (Conda / venv)

Adopt this method if you prefer a traditional Python development environment setup using Conda or venv.

Set Up the Environment: Initialize a virtual environment with Conda or venv to isolate and manage your project's dependencies.
Install Dependencies: Inside your virtual environment, execute pip install -r dev-requirements.txt to install the necessary Python libraries.
Run the Kedro Pipeline: Trigger the pipeline processing by running make run or directly with kedro run. This step orchestrates your data transformation and modeling.
Review the Results: Inspect the 04_reporting and 05_model_output directories to assess the performance and outcomes of your models.
(Optional) Explore with Kedro Viz: To visually explore your pipeline's structure and data flows, initiate Kedro Viz using make viz or kedro viz run.

🐳 Docker Method

Prefer this method for a containerized approach, ensuring a consistent development environment across different machines. Ensure Docker is operational on your system before you begin.

Build the Docker Image: Construct your Docker image with make build or kedro docker build. This command leverages dev-requirements.txt for environment setup. For advanced configurations, see the Kedro Docker Plugin Documentation.
Run the Pipeline Inside a Container: Execute the pipeline within Docker using make dockerun or kedro docker run. Kedro-Docker meticulously handles volume mappings to ensure seamless data integration between your local setup and the Docker environment.
Access the Results: Upon completion, the 04_reporting and 05_model_output directories will contain your model's reports and trained files, ready for review.

For additional assistance or to explore more command options, refer to the Makefile or consult kedro --help.

🌌 Next Steps?

With our Kedro Pipeline 🏗 now capable of efficiently transforming raw data 🔄 into trained models 🤖, and the introduction of a Dockerized environment 🐳 for our code, the next phase involves advancing beyond the current repository scope 🚀 to orchestrate data updates automatically using tools like Databricks, Airflow, Azure Data Factory... This progression allows for the seamless integration of fresh data into our models.

Moreover, implementing experiment tracking and versioning with MLflow 📊 or leveraging Kedro Viz's versioning capabilities 📈 will significantly enhance our project's management and reproducibility. These steps are pivotal for maintaining a clean machine learning workflow that not only achieves our goal of simplifying model training processes 🛠 but also ensures our system remains dynamic and scalable with minimal effort.

🌐 Let's Connect!

You can connect with me on LinkedIn or check out my GitHub repositories:

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
.github/workflows		.github/workflows
conf		conf
data		data
notebooks		notebooks
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
dev-requirements.txt		dev-requirements.txt
docker-requirements.txt		docker-requirements.txt
pyproject.toml		pyproject.toml
session_store.db		session_store.db

labrijisaad/Kedro-Energy-Forecasting-Machine-Learning-Pipeline

Folders and files

Latest commit

History

Repository files navigation

Kedro Machine Learning Pipeline 🏯

📘 Introduction

🎯 Project Goals

🛠️ Preparation & Prototyping in Notebooks

🧩 Project Workflow

Kedro Visualization

📜 Logging and Monitoring

📁 Project Structure

🚀 Getting Started

🔴 Important Preparation Steps

🌿 Standard Method (Conda / venv)

🐳 Docker Method

🌌 Next Steps?

🌐 Let's Connect!

About

Topics

Resources

Stars

Watchers

Forks

Languages

`Kedro` Machine Learning Pipeline 🏯