Skip to content

syntio/poc-great-expectations

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

POC-Great-Expectations

This repository provides a practical demonstration of using Great Expectations in an end-to-end pipeline with Airflow and dbt.

This project is an adaptation of the official Great Expectations tutorial available on their GitHub repository. However, the original tutorial was based on an older version of GX. To provide you with the most up-to-date experience, this project has been updated to align with the latest version, which is 0.16.8 at the time of writing.

Overview

The pipeline loads data from files into a database and then transforms it. Airflow is used to orchestrate the pipeline, and dbt is used to transform for the "T" step of ELT. Specifically, this tutorial directory contains:

  • airflow - A folder containing the Airflow DAG file for this data pipeline.
  • data - A folder containing two datasets used in the tutorial.
  • dbt - A folder with the dbt project structure.
  • great_expectations - A folder containing the Great Expectations configuration files.

Instructions

To use this repository, follow these steps:

  1. Install the required dependencies by running the following commands:

    pip install great_expectations==0.16.8
    pip install sqlalchemy==1.4.16
    pip install apache-airflow==2.6.1
    pip install psycopg2==2.9.6
    pip install airflow-provider-great-expectations==0.1.1
  2. Update the following variables in the pipeline.py file:

    • DATABASE_URL: Set this to your database connection string.
    • PROJECT_ROOT_PATH: Set this to the root directory of this repository on your local machine.
  3. Create a config_variables.yml file inside the uncommitted folder, and store your PostgreSQL database credentials in this file. Use the following template, replacing <username> and <password> with your actual database username and password, and modifying other fields if necessary:

    my_postgres_db_yaml_creds:
      drivername: postgresql
      username: <username>
      password: <password>
      host: localhost
      database: tutorials_db
      port: '5432'
  4. Once everything is set up, you can run the entire DAG or individual tasks in the Airflow DAG.

    • To run the DAG, use the following command:

      airflow dags test pipeline_with_gx
    • To run a specific task within the DAG, use the following command:

      airflow tasks test pipeline_with_gx <task_name>

    Replace <task_name> with the name of the specific task you want to run.

For more detailed instructions on how to use this repository, please refer to the blog post.

About

Demonstration of Great Expectations in an end-to-end pipeline with Airflow and dbt.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published