Skip to content

History of programming languages - A Data Engineering project to demonstrate pipelines for Data Engineering Zoomcamp.

Notifications You must be signed in to change notification settings

civispro/de_zoomcamp_project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🔢 History of programming languages

Data Engineering project to demonstrate pipelines for Data Engineering Zoomcamp.

Problem

This project aims to show some interesting statistics of history of programming languages.

Some of the questions answered:

What is the number of programming languages appeared per year?
What is the number of GitHub repos with each programming language?
What is the number of programming languages died per year?
Ratio of dead and alive programming languages?

Dataset

The dataset contains information on over 4000 programming languages. Which include facts about the language such as what year it was created, What is its rank, and other parameters that you will come to know once you explore the dataset.

Kaggle

Dashboard

Dashboard

Tech Stack

Python, Apache Airflow, Apache Spark with PySpark, dbt, Google Cloud Platform, Metabase

Installation

Clone repo to your computer

git clone https://github.com/civispro/de_zoomcamp_project.git
cd de_zoomcamp_project

Setup connection to google cloud and aplly terraform

gcloud auth application-default login
terraform init
terraform plan
terraform apply
conda install setuptools  

Install Apache Airflow

export AIRFLOW_HOME=/de_zoomcamp_project/airflow    
cd
nano .bashrc 
add to .bashrc
export AIRFLOW_HOME=/de_zoomcamp_project/airflow
save and exit
cd /de_zoomcamp_project/airflow
pip install 'apache-airflow==2.5.1'  --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.5.1/constraints-3.9.txt"
airflow db init
airflow users create \	
    --username airflow \
    --firstname airflow \
    --lastname airflow \
    --role Admin \
    --email li@li.ru  
airflow webserver -p 8080
airflow scheduler
Put GCP credentials "cred.json" to the folder named airflow

Sign up on dbt cloud and use dbt folder from this repo

dbt build

Run Metabase in docker to vizualize data

docker run -d -p 3000:3000 --name metabase metabase/metabase

Dashboard

Dashboard

About

History of programming languages - A Data Engineering project to demonstrate pipelines for Data Engineering Zoomcamp.

Topics

Resources

Stars

Watchers

Forks