Skip to content

shivanipadal/Fire_Incident_DE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

# Fire Incidents Analysis

Data Engineering Zoomcamp Project

This repository contains my project for the completion of Data Engineering Zoomcamp by DataTalks.Club.

Index

Problem Description

A fire incident is defined as an incident involving smoke, heat, and flames. The purpose of this simple project was to analyze the fire incidents dataset from USA

Below is the high level overview of the steps involved:

  • Download csv data from Source.
  • Upload the data to Data Lake by transforming csv into efficient parquet format.
  • Loading the data from Data Lake to Data Warehouse with some transformations and data quality checks.

Dataset

The chosen dataset was the fire incidents data in the US from 1-Jan-2003 to 20-Mar-2023.

It includes a summary of each (non-medical) incident to which the SF Fire Department responded. Each incident record includes, the incident number, the battalion which responded to the incident, the incident date, the timestamp of alarm, arrival and closure of the incident, among others.

It is available for download as a csv file and for consultation where it is also provided a data dictionary. As of 20-Mar-2023, this dataset is updated daily.

Technologies Used

Below tools have been used for this project:

  • Infrastructure as code (IaC): Terraform
  • Workflow orchestration: Prefect
  • Data Lake: Google Cloud Storage (GCS)
  • Data Warehouse: BigQuery
  • Transformations: dbt
  • Visualization: Looker Studio

Steps for Project Reproduction

Clone this repo to start with.

Step 1

Creation of a Google Cloud Platform (GCP) account.

Step 2: Setup of GCP

  • Creation of new GCP project. Attention: The Project ID is important.
  • Create VM instance till 23rd step
  • Go to IAM & Admin > Service accounts > Create service account, provide a service account name and grant the roles Viewer, BigQuery Admin, Storage Admin, Storage Object Admin.
  • Download lservice account key locally, rename it to google_credentials.json.
  • Store it in your home folder $HOME/.google/credentials/for easier access.
  • Set and export the GOOGLE_APPLICATION_CREDENTIALS using export GOOGLE_APPLICATION_CREDENTIALS=<path/to/your/service-account-authkeys>.json
  • Activate the service account using gcloud auth activate-service-account --key-file $GOOGLE_APPLICATION_CREDENTIALS
  • Add the above two lines at the end of .bashrc so that we don't need to export and activate every time.
  • Activate the following API's:

Step 3: Creation of a GCP Infrastructure

  • Install Terraform
  • Change default variables project, region, BQ_DATASET in variables.tf (the file contains descriptions explaining these variables)
  • Run the following commands from terraform directory on bash:
# Initialize state file (.tfstate)
terraform init

# Check changes to new infra plan
terraform plan

# Create new infra
terraform apply
  • Confirm in GCP console that the infrastructure was correctly created.

Step 4: Creation of Conda environment and Orchestration using prefect flows.

Execution

1. Create a new Conda environment and install packages listed in prefect_ingest_data/requirements.txt .

conda create -n <env_name> python=3.9
conda activate <env_name>
pip install -r prefect_ingest_data/requirements.txt

2. Register Prefect blocks and start Orion

prefect block register-m prefect_gcp
prefect orion start
  • Navigate to prefect dashboard http://127.0.0.1:4200 --> go to blocks menu --> add GCS Bucket and provide below inputs.
    • Block name : <your-GCS-bucket-block-name>
    • Bucket name: <your-bucket-name-created-by-terraform>
    • GCP credentials: Click on Add --> It opens up create block of GCP credentials , provide input below.
      • Block name : <your-GCP-credentials-block-name>
      • Service Account info: copy paste the json file data in the service account info.
      • Clicck on create.
    • GCP credentials: Click on Add --> Select the above created <your-GCP-credentials-block-name>
    • Code generated needs to be replaced in the web-to-gcs-parent.py and gcs_to_bq_parentpython files.
       from prefect_gcp.cloud_storage import GcsBucket
       gcp_cloud_storage_bucket_block = GcsBucket.load("<your-gcp-bucket-block-name")
      
       from prefect_gcp import GcpCredentials
       gcp_credentials_block = GcpCredentials.load("<your-gcs-cred-block-name>")
      
      

3. Change the directory to prefect_ingest_data and run below command to deploy code in prefect

prefect deployment build ./ingest_data_to_bucket.py:ingest_data_to_bucket -n "Ingest data from web to gcs bucket" --cron "0 0 1 * *" -a

prefect deployement build ./ingest_data_to_bq.py:ingest_data_to_bq -n "Ingest data from gcs bucket to bigquery" --cron "0 1 1 * *" -a

4. Navigate to prefect deployment dashboard and check for the recently created deployments. Image

5. Run deployments from prefect dashboard

The above deployments download csv data from web portal and stores it into GCS bucket as .parquet file and then writes data into Google BigQuery.

Step 5: Transformations using dbt.

  • Navigate to dbt cloud and create a new project by referring to this repository. Under the project subfolder update /dbt
  • Select the BigQuery connection and update service-account.json file for the authentication.
  • Under dbt development menu, edit the dbt-project.yml to update the name and models.
  • add macros/extract_month_with_name.sql, models/core/schema.yml, models/core/stg_fire_accident.sql, packages.yml
  • Run below commands to execute the transformations:
     dbt deps
     dbt build
    
    Image
  • The above will create dbt models and final tables Note: The transformations made were the selection of certain columns and creation of new ones (time differences, Month, Year). It is known that tables with less than 1 GB don't show significant improvement with partitioning and clustering; doing so in a small table could even lead to increased cost due to the additional metadata reads and maintenance needed for these features (or) the processing data clustered and with out clustered is same for small data

As of 20-Mar-2023, the dataset has a size of ~ 260 MB, thus I only performed transformations such as adding new variables, and not partitioning and clustering.

CREATE  OR  REPLACE  TABLE  `de-project-2023.fire_accident_raw_data.fire_data_sanfrancisco_clustered`

Cluster  BY

Battalion AS

SELECT  *  FROM de-project-2023.fire_accident_raw_data.fire_data_sanfrancisco;

...makes the query same data... image

same as performing it on the not clustered table. image

Step 6: Development of a visualization using lookerstudio

  • Connect the final BigQuery dataset from above inside lookerstudio and start creating the dashboard with insights.

Dashboard

Below is the final dashboard

Dashboard_pdf_link

Dashboard Image

Dashboard_From_Jan_2021_to_Mar_2023

A special thank you to DataTalks.Club for providing this incredible course! Also, thank you to the amazing slack community!

About

End to End Data Engineering flow using Fire incidents occurred in US

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published