Data Engineering project to demonstrate pipelines for Data Engineering Zoomcamp.
This project aims to show some interesting statistics of history of programming languages.
Some of the questions answered:
What is the number of programming languages appeared per year?
What is the number of GitHub repos with each programming language?
What is the number of programming languages died per year?
Ratio of dead and alive programming languages?
The dataset contains information on over 4000 programming languages. Which include facts about the language such as what year it was created, What is its rank, and other parameters that you will come to know once you explore the dataset.
Python, Apache Airflow, Apache Spark with PySpark, dbt, Google Cloud Platform, Metabase
git clone https://github.com/civispro/de_zoomcamp_project.git
cd de_zoomcamp_project
Setup connection to google cloud and aplly terraform
gcloud auth application-default login
terraform init
terraform plan
terraform apply
conda install setuptools
export AIRFLOW_HOME=/de_zoomcamp_project/airflow
cd
nano .bashrc
export AIRFLOW_HOME=/de_zoomcamp_project/airflow
cd /de_zoomcamp_project/airflow
pip install 'apache-airflow==2.5.1' --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.5.1/constraints-3.9.txt"
airflow db init
airflow users create \
--username airflow \
--firstname airflow \
--lastname airflow \
--role Admin \
--email li@li.ru
airflow webserver -p 8080
airflow scheduler
Sign up on dbt cloud and use dbt folder from this repo
dbt build
docker run -d -p 3000:3000 --name metabase metabase/metabase