This repo is my experimental projects on Data Engineering.
-
Updated
Mar 6, 2023 - Python
This repo is my experimental projects on Data Engineering.
Automate Apache Spark in Hadoop with Airflow in Cloud
Project files originating from my 2023 Nanodegree Data Engineering.
We Build an ETL pipeline using Airflow that accomplishes the following: Downloads data from an AWS S3 bucket, Runs a Spark/Spark SQL job on the downloaded data producing a cleaned-up dataset of delivery deadline missing orders and then Upload the cleaned-up dataset back to the same S3 bucket in a folder primed for higher level analytics
Keywords: Python, Airflow, AWS, S3, Redshift, ETL
Udacity project within the Data Engineer Nanodegree
Akka hands-on for the Distributed Data Management course at the Hasso-Plattner-Institute
This repository contains infrastructure code for the Wizeline Data Engineering Bootcamp (DEB) 2023. It is one of two repositories for the DEB. The other (deb-application) houses the application code.
Collections of POC/dev data infrastructure. | #SE
Project Performing Data Modeling, Data Engineering and Data Analysis on Employees of a Corporation
🔴 📕 Our repository for Jupyter Notebook to serve as blog posts.
Design Employee Database using SQL
SQL analyses of a corporation's employee database. UT Austin Bootcamp homework assignment.
Simplified blueprints for building data pipelines with Google Cloud Storage (GCS).
Use supervised machine learning to analyze key performance indicators of a player's strengths and weaknesses. The process involved data gathering from API, data cleaning, data storage in SQL and CSV files, multiple machine learning models like Random Forest, Logistic linear regression classifiers.
Repository for testing data build tool (dbt)
Add a description, image, and links to the data-engineering topic page so that developers can more easily learn about it.
To associate your repository with the data-engineering topic, visit your repo's landing page and select "manage topics."