Skip to content

Tutorial for implementing data validation in data science pipelines

Notifications You must be signed in to change notification settings

NatanMish/data_validation

Repository files navigation

Data Validation for Data Science

This repository contains data, code and Jupyter notebooks for the validation of data science projects tutorial. The tutorial consists of three sections for each step in the production data science model life cycle:

  1. Database management (using Great Expectations)
  2. Training pipeline (using Pandera)
  3. Model serving (using Pydantic)

Each section comes with a notebook in which there are explanations, code snippets and exercises.

If you would like to see me run through these notebooks from PyData London 2022, you can navigate to this YouTube video: Data Validation for Data Science | PyData London 2022

Data

Dataset used for the purposes of this tutorial is taken from the House prices prediction competition on Kaggle. Two CSV files located in the data folder: train.csv and test.csv.

Instructions

To Follow the notebooks and exercises there are two options:

  1. Use your own Python environment with Jupyter installed. The notebooks are run using the jupyter notebook command, select the notebook you want to run in the notebooks folder and follow the instructions. For running the different tools with all of the features available it is recommended to use Python 3.8 and up.
  2. Use Google Colaboratory without any pre-installation needed. Click the link to go to the repository's GitHub page. Choose one of the notebooks in the notebooks folder and from the interactive view, click on the link to open in Colab.