Build an ML Pipeline for Short-Term Rental Prices in NYC

Links

Introduction

This project builds an end-to-end reproducible Machine Learning pipeline to predict prices of rental properties based on various features. The pipeline is constructed to allow for component to be run independently from each other.

The project employes MLOps tools and best practices. Mainly, the pipeline components are built using Weights and Biases for Experiement Tracking and Artifact Storage and Versioning, Furthermore, MLFlow is used for Orchestration and Hydra for Configuration Management.

NOTE: The modeling in this project is just a baseline since the focus here is on the MLops aspect of the analysis.

Usage

To run the pipeline from this Github repo (without cloning), use the following command:

> mlflow run https://github.com/alturkim/build-ml-pipeline-for-short-term-rental-prices.git -v 1.0.2

Alternatively, you can clone the repo locally and use the following commands to interact with the pipeline.

To run the entire pipeline, use the following command:

>  mlflow run .

To run a specific step, e.g. basic_cleaning, use the following command:

> mlflow run . -P steps=basic_cleaning

To run multiple steps together, e.g. the download and the basic_cleaning steps, use the following command:

> mlflow run . -P steps=download,basic_cleaning

To override any parameter in the configuration file, use the hydra_options parameter. The following command set the parameter n_estimators to 10:

> mlflow run . \
  -P steps=download,basic_cleaning \
  -P hydra_options="modeling.random_forest.n_estimators=10 etl.min_price=50"

Model Testing

Runing the entire pipeline as indicated above is NOT going to execute the testing step that evaluates the model on test data. You need to explicitly specify it as in the steps parameter after promoting the trained model to production with the prod tag.

> mlflow run . -P steps=test_regression_model

Data Testing

Deterministic and Non-deterministic Tests are used to verify the fittness of the data. Deterministic Tests includes, among others, checking the size of the dataset and the range of the dependant variable (price). Non-deterministic Tests includes verifying the distribution of any new data against the reference dataset that is used to train the initial model using KL divergence.

Pipeline Visualization

This link provides an interactive version of this visualization.

License

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.github/workflows		.github/workflows
components		components
cookie-mlflow-step		cookie-mlflow-step
images		images
src		src
.gitignore		.gitignore
CODEOWNERS		CODEOWNERS
LICENSE.txt		LICENSE.txt
MLproject		MLproject
README.md		README.md
conda.yml		conda.yml
config.yaml		config.yaml
environment.yml		environment.yml
main.py		main.py

License

alturkim/build-ml-pipeline-for-short-term-rental-prices

Folders and files

Latest commit

History

Repository files navigation

Build an ML Pipeline for Short-Term Rental Prices in NYC

Links

Introduction

Usage

Model Testing

Data Testing

Pipeline Visualization

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages