Skip to content

ysfesr/Building-Data-LakeHouse

Repository files navigation

Building Data Lakehouse

This project is designed to construct a data lakehouse. This data lakehouse will enable organizations to store, manage, and analyze large datasets in a cost-effective, secure, and scalable manner. The data lakehouse will provide a centralized repository for all data, allowing users to easily access and query the data with a unified interface.

Minio will provide distributed object storage to store the data, Delta Lake will provide ACID-compliant transactions for managing the data, Spark will enable distributed computing for analytics, Presto will provide fast SQL queries, and Hive Metastore will provide a unified catalog for the data. This data lakehouse will enable organizations to quickly and easily access and analyze valuable data, allowing them to make better data-driven decisions.

This project aims also to create an Extract, Load, and Transform (ELT) pipeline to ingest data from a Postgres database into our lakehouse. The ELT pipeline will make use of Apache Spark, to extract the data from the Postgres database, load it into the lakehouse, and then transform it into the desired format. Once the data is loaded into the lakehouse, it will be available for downstream analytics and reporting.

Architecture

Architecture

Setup

  • First, build Spark and Presto docker image
docker build -t presto:0.272.1 ./Dockerfiles/presto
docker build -t cluster-apache-spark:3.1.1 Dockerfiles/spark
  • Run docker compose
docker-compose up
  • Create a bucket in minio to store our data (name it datalake)

  • Create a Postgres database (name it CarParts and use CarParts.sql file to create tables)

  • Install jar files needed for our spark project

docker exec -it master bash /opt/workspace/dependencies/packages_installer.sh 
  • Run the first script
docker exec -it master spark-submit --master spark://master:7077 \
        --deploy-mode cluster \
        --executor-memory 5G \
        --executor-cores 8 \
        /opt/workspace/postgres_to_s3.py
  • Run the second script
docker exec -it master spark-submit --master spark://master:7077 \
        --deploy-mode cluster \
        --executor-memory 5G \
        --executor-cores 8 \
        /opt/workspace/clean_data.py

links

Built With

  • Spark
  • Minio
  • PostgreSQL
  • Hive Metastore
  • Presto
  • Delta Lake

Author

Youssef EL ASERY

🤝 Support

Contributions, issues, and feature requests are welcome!

Give a ⭐️ if you like this project!

About

Creation of a data lakehouse and an ELT pipeline to enable the efficient analysis and use of data

Topics

Resources

Stars

Watchers

Forks