Code for blog at Data Engineering Project for Beginners.
This data engineering project, includes the following:
Airflow
: To schedule and orchestrate DAGs.Postgres
: To store Airflow's details (which you can see via Airflow UI) and also has a schema to represent upstream databases.DuckDB
: To act as our warehouseQuarto with Plotly
: To convert code inmarkdown
format to html files that can be embedded in your app or servered as is.minio
: To provide an S3 compatible open source storage system.Apache Spark
: To process our data, specifically to run a classification algorithm.
Start a code spaces, run make up
, wait until its ready and click on the link in the Port tab to see the AirflowUI.
Note: Make sure to turn off your codespaces when you are done, you only have a limited amount of free codespace use.
- git
- Docker with at least 4GB of RAM and Docker Compose v1.27.0 or later
Clone the repo and run the make up command as shown here:
git clone https://github.com/josephmachado/beginner_de_project.git
cd beginner_de_project
make up
make ci # run checks and tests
# To stop your containers run make down
Open Airflow at localhost:8080 and sign in with username and password as airflow
. Switch on the user_analytics_dag
and it will start running.
On completion, you can see the dashboard html rendered at./dags/scripts/dashboard/dashboard.html.
Read this post, for information on setting up CI/CD, IAC(terraform), "make" commands and automated testing.