Skip to content

SharadChoudhury/Azure_Covid19_Analysis

Repository files navigation

Azure_Covid19_Analysis

This project analyses the No. of cases and Deaths due to covid 19 in year 2020 and transforms the Population data for further use. The Azure Components used here are :

  1. Azure Data Lake Storage Gen 2
  2. Azure Blob Storage
  3. Azure Data Factory
  4. Azure DataBricks
  5. Azure SQL Database
  6. Azure Service Principal

Extract

Pipeline for Ingesting ECDC data

Screenshot 2023-07-02 at 11 21 21 AM
  1. Implement a for-loop to fetch each file from https://github.com/SharadChoudhury/Azure_Covid19_Analysis/raw/ecdc/main using the ecdc_file_list.json
  2. Store the ingested files in raw/ecdc folder in ADLS.

Pipeline for Ingesting Population data from Blob Storage to Data Lake

Screenshot 2023-07-02 at 11 34 41 AM
  1. Store the the raw population file in population_raw/BLOB in Azure Blob container
  2. Implement a pipeline to fetch the raw file from the Blob container if it exists, then fetch its metadata and if the column count matches the required column count, then copy the file to ADLS.

Transform

Pipeline for Processing Cases and Deaths data and storing the processed file in Data lake

Screenshot 2023-07-02 at 11 54 36 AM
  1. Create a dataflow to process the Cases and Deaths file as per the below requirements and store the processed sink file in ADLS.
Screenshot 2023-07-02 at 7 54 11 PM

Pipeline for Processing Hospital admissions data and storing the processed file in Data lake

Screenshot 2023-07-02 at 12 08 09 PM
  1. Create a dataflow to process the Hospital admissions file as per the below requirements and store the processed sink file in ADLS.
Screenshot 2023-07-02 at 7 55 39 PM

Create a pipeline for transforming the population data using Databricks Pyspark Notebook

  1. Create a ADF pipeline that runs the Databricks Notebook for population file transformation and store the processed file in ADLS.

Create a Master pipeline that runs both the child pipelines : 1. Ingesting Population data, 2. Transforming Population data using Databricks

Screenshot 2023-07-02 at 8 03 12 PM
  1. This pipeline should get triggered when the blob for raw population file is created

Load

Create schemas for Cases and Deaths table, Hospital Admissions table in Azure SQL Database

  1. Run SQL scripts in Azure SQL Database to create the table schemas in your database.

Sqlize the processed Cases and Deaths file, Hospital Admissions file

  1. Create pipelines with Copy activity to copy the data from processed Cases and Deaths file, Hospital Admissions file to respective tables in the SQL database.

About

Covid ETL Project using Azure Data Engineering Stack

Topics

Resources

Stars

Watchers

Forks