Skip to content

trdtnguyen/sb-capstone1

Repository files navigation

Correlation between Covid-19 and Economic

This project provides ready-to-analyze datasets for Covid-19 and other related aspects in our daily life such as stock index, unemployment rate, job openings, business activities. We implement a full-stack solution that includes: (1) the front-end dashboard with Flask + jQuery, (2) ETL pipeline with Apache Spark + Python and Apache Airflow for scheduling ETL tasks and (3) MySQL as the back-end DBMS.

By integrating covid19 data sources and other related data sources, this project help users answer deeper analytics questions such as:

  • Which stock indexes still going well during the pandemic?
  • What occupation group has the highest/lowest unemployment rate during the pandemic?
  • What occupation still has high job opens regardless of the pandemic?
  • Does the peak of the unemployment rate and the peak of the new cases of covid19 in the US happen at the same time?
  • Percentage of travel time reduced during the pandemic?

The overall architecture is:

architecture

And the expected results from FLask App would look like:

covid stock covid stock covid stock covid stock covid stock covid stock

Updated: Added story telling Slide deck that walks you through the progress of building this project.

Quick Start

Clone the repository

$ git clone https://github.com/trdtnguyen/sb-capstone1
$ cd sb-capstone1

Running in your local machine:

Step 1: Create and activate a virtual envirionment

$ virtualenv -p /usr/bin/python3.8 vevn
$ source venv/bin/activate

Note: To avoid Python version error due to differences of python version in Spark driver and Spark worker, following below notes:

  • Ensure your python in Spark has the same version as the one in your veritual environment.
  • Ensure PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON point to the same python bin.

Step 2: Install Python packages

$ (venv) pip3 install -r requirements.txt

Step 3: Setup Database This project use mysql as the back-end database. Change the config.cnf to match your mysql settings:

[DATABASE]
MYSQL_DATABASE=covid19cor
AIRFLOW_DATABASE=airflowdb
MYSQL_HOST=localhost
MYSQL_PORT=3306
MYSQL_ROOT_PASSWORD=rootpassword
MYSQL_USER=myusername
MYSQL_PASSWORD=mysqlpassword

Create database and tables:

$ mysqladmin -u root -p create covid19cor
$ mysql -u root -p covid19cor < sql/create_table.sql

Step 4: Run ETL manually Before running the ETL pipeline, ensure your Spark driver and workers are started and configurated properly.

We test the project using standalone Spark in a single machine.

  • Single ETL load: run etl_test.py without argument to get all data from COVID_OLDEST_DATE (Jan 2020) to the current date.
$ python tasks/etl_test.py
  • Incremental ETL load: run etl_test.py with end date (yyyy-mm-dd) to simulation daily ETL task
$ python tasks/etl_test.py 2021-02-20
$ python tasks/etl_test.py 2021-02-21
$ python tasks/etl_test.py 2021-02-22

Step 5: Config and start front-end Flask App Change Flask's configuration in config.cnf file to match yours settings:

[API]
API_HOST=<your_ip> # such as 192.168.0.2
API_PORT=8082

Start Flask web server

$ python flaskapp/app.py

Step 6: Access dashboard After the flask app started, open your browser with url: http://<API_HOST>:<API_PORT> for example http://192.168.0.2:8082/.

Running in docker container:

$ docker-compose build && docker-compose up

In this current version, docker-compose up covers step 1 - step 3 of steps in running in a local machine.

We could do the same step 4 - step 6 inside docker container.

$ docker exec -it airflow bash

Techniques

This section list common Data Engineering techniques and general programming techniques used in this project.

ETL

  • Extract data source using various methods such as pandas' Datareader, parsing HTML page, API.
  • Manipulate Spark's Dataframe and pandas Dataframe to load, cleaning, transforming data from raw tables to dimensional and fact tables.
    • Add/remove columns into/from a Dataframe
    • Transpose Dataframe's columns into rows
    • Aggregate data from daily frequency into montly frequency
    • Handling null values (filling null with previous non-null value) using Window functions.

Data pipeline scheduling

  • Exploit Apache Airflow to schedule ETL tasks

Database Design, SQL

  • Using MySQL as the back-end RDBMS for data wareshouse with star schema.
  • The front-end apps (ETL tasks) use ODBC connector (for pyspark) and SqlAchemy for pandas.
  • Data transform in ETL tasks combine both SQL and pyspark's APIs.

Front-end techniques: Flask, jQuery, Chart.js, HTML/javascript/CSS

  • Project support various front-end interfaces such as RESTful API using Flask, Tablue, and Jupyter notebook
  • Drawing charts using Chart.js + jQuery

Docker, docker-compose

  • Packing flask, airflow, spark, mysql into one docker container
  • App image wait for database image ready

Python programming

  • Object-oriented programming in Python.
  • Load configuration values through config.cnf file
  • Exploit singleton desgin pattern to use a GlobalUtil class as a common shared utility between other classes in the project
  • Logging
  • Unit test using pytest.

Project details: