Snowflake_Internal_Stage_DataPipeline

Project Overview

This project, titled "Snowflake Internal Stage Data Pipeline," focuses on developing a robust data ingestion pipeline from a MySQL database to Snowflake utilizing Snowflake's Internal Stage and orchestrated with Airflow. The pipeline is designed to perform incremental data loads, enhancing efficiency and reducing data transfer volumes. A metadata-driven approach is employed to streamline and automate the data ingestion process.

Youtube Video

Features

Incremental Data Loading: Efficiently transfers only new or updated records from MySQL to Snowflake, minimizing data transfer and processing time.
Metadata-Driven Pipeline: Leverages metadata to dynamically configure and execute data ingestion tasks, reducing the need for hard-coded configurations.
Airflow Orchestration: Utilizes Apache Airflow to manage workflow orchestration, scheduling, and monitoring, ensuring reliable execution of data loading processes.
Snowflake Integration: Employs Snowflake's Internal Stage for secure and scalable data staging before ingestion into the target tables.

Technologies Used

MySQL: Source database for extracting data.
Snowflake: Target cloud data warehouse for analytics.
Apache Airflow: Workflow orchestration tool to manage the data pipeline.
Python: The primary programming language for scripting and automation.

Source Data

There are 2 tables which are getting ingested to Snowflake.

Dedicated Dags are developed for each Table

amazone_books
amazonebook_reviews

Source Table DDls

CREATE TABLE amazone_books (
	book_id INT NOT NULL AUTO_INCREMENT
	,book_title TEXT
	,book_amount FLOAT
	,book_author TEXT
	,book_rating FLOAT
	,book_link TEXT
	,business_date DATE DEFAULT(CURRENT_DATE)
	,PRIMARY KEY (book_id)
	);
CREATE TABLE amazonebook_reviews (
book_id INT NOT NULL
,reviewer_name TEXT
,rating FLOAT
,review_title TEXT
,review_content TEXT
,reviewed_on DATE
,business_date DATE DEFAULT(CURRENT_DATE)
);

For the Incremental load. Primary Keys are required in the Tables. Respective Primary key for the Table are

amazone_books

book_id

amazonebook_reviews

book_id
reviewer_name
business_date

Note: This Source Data is from another Project. To know more about how source data is generated please refer AmazonBooks_DataPipeline

The Airflow Dag Ids for respective Tables are

amazone_books

Snowflake_InternalStage_amazone_books_Dag

amazonebook_reviews

Snowflake_InternalStage_amazonebook_review_Dag

How It Works

Workflow Design: A DAG (Directed Acyclic Graph) in Airflow defines the sequence of tasks for data extraction, staging, and loading.
Incremental Loading: The pipeline identifies new or updated records in MySQL using timestamps or sequence IDs.
Metadata Management: Metadata definitions (i.e Config) guide the extraction and loading processes, allowing for flexibility and scalability.
Monitoring and Logging: Airflow provides comprehensive monitoring and logging capabilities, facilitating troubleshooting and performance optimization.

Limitations and Future Enhancements

Data Quality Checks: Implement additional data validation and quality checks within the pipeline.
Advanced Scheduling: Enhance scheduling capabilities to support more complex dependencies and triggers.

Screenshots for the Dag amazone_books

Screenshots for the Dag amazonebook_reviews

Reference

Incremental Load: Medium
Internal stage:Snowflake Docs
Power of Metadata driven ETL Frameworks:Linkedin

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
dags		dags
README.md		README.md
amazone_books_ddl.sql		amazone_books_ddl.sql
amazonebook_reviews_ddl.sql		amazonebook_reviews_ddl.sql
config_ddl.sql		config_ddl.sql

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dags

dags

README.md

README.md

amazone_books_ddl.sql

amazone_books_ddl.sql

amazonebook_reviews_ddl.sql

amazonebook_reviews_ddl.sql

config_ddl.sql

config_ddl.sql

Repository files navigation

Snowflake_Internal_Stage_DataPipeline

Project Overview

Youtube Video

Features

Technologies Used

Source Data

Source Table DDls

How It Works

Limitations and Future Enhancements

Screenshots for the Dag amazone_books

Screenshots for the Dag amazonebook_reviews

Reference

About

Languages

melwinmpk/Snowflake_Internal_Stage_DataPipeline

Folders and files

Latest commit

History

Repository files navigation

Snowflake_Internal_Stage_DataPipeline

Project Overview

Youtube Video

Features

Technologies Used

Source Data

Source Table DDls

How It Works

Limitations and Future Enhancements

Screenshots for the Dag amazone_books

Screenshots for the Dag amazonebook_reviews

Reference

About

Topics

Resources

Stars

Watchers

Forks

Languages