Incident Data Warehouse

The project aims to build a data warehouse for incidents that happened in San Francisco from scratch. There are 2 pipelines in this project:

full_load_pipeline: Load all data to the data warehouse for the first load.
incremental_load_pipeline: Run daily and load new data into the data warehouse as well as manage changes in data by implementing SCD Types 1 and 2.

Tech Stack

Staging Area: Amazon S3
Data warehouse: Amazon Redshift
Visualization: Power BI
Orchestration: Apache Airflow with Docker
Processing: Python, SQL

Architecture

The architecture of this project is presented as follows:

Data is sourced from Socrata API and ingested into raw zone of Staging Area hosted on S3.
Raw data is cleansed and standardized before moving to cleansed zone.
Cleansed data is transformed into data model used in data warehouse and loaded into stage zone. Now the data is ready for moving to data warehouse.
Reports are created in Power BI from the data in data warehouse.

Data Model

The data warehouse schema is designed follow Star schema model.
To manage the relation many to many between incidents and incident categories, the bridge table is used.
To manage changes in data, SCD Type 1 is applied to all table, SCD Type 2 applied to dim_category and dim_intersection tables.
These SCD types are implemented using SQL.

ETL Pipeline

To manage the repetitive jobs of ingest (from source to staging area) and load to data warehouse, I build 2 Airflow custom operators are Socrata_to_S3 and S3_to_Redshift.
Full load pipeline architecture

Full load pipeline architecture (zoom in)

Incremental load pipeline architecture

Incremental load pipeline architecture (zoom in)

Visualization

Some dashboards create from the data from data warehouse

Total report for year 2022

Daily report for everyday

Achievement in learning

Python

Use Python to process data, especially the date time datatype. Most used libraries to process data are Pandas and Numpy.
Implement OOP in the project.
Structure files in project.

Apache Airflow

Write custom operators for repetive jobs (code)
Use runtime variable and template variable
Group tasks that belong to each stage of the pipeline for more briefness when looking
Connect to cloud services through Hook
Secure connections and other important variables by using Variable and Connection features
Implement Airflow using Docker
Components of Airflow

AWS

Use S3 as the Staging Area
Use Redshift as the data warehouse
How Redshift works

Visualization

Connect to Redshift and creata dashboard

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
dags		dags
images		images
plugins/operators		plugins/operators
README.md		README.md
docker-compose.yaml		docker-compose.yaml
project_variables.json		project_variables.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dags

dags

images

images

plugins/operators

plugins/operators

README.md

README.md

docker-compose.yaml

docker-compose.yaml

project_variables.json

project_variables.json

Repository files navigation

Incident Data Warehouse

Tech Stack

Architecture

Data Model

ETL Pipeline

Visualization

Achievement in learning

Python

Apache Airflow

AWS

Visualization

About

Releases

Packages

Languages

minhky2185/incident_data_warehouse

Folders and files

Latest commit

History

Repository files navigation

Incident Data Warehouse

Tech Stack

Architecture

Data Model

ETL Pipeline

Visualization

Achievement in learning

Python

Apache Airflow

AWS

Visualization

About

Topics

Resources

Stars

Watchers

Forks

Languages