This repository contains my project for the completion of Data Engineering Zoomcamp by DataTalks.Club.
A fire incident is defined as an incident involving smoke, heat, and flames. The purpose of this simple project was to analyze the fire incidents dataset from USA
Below is the high level overview of the steps involved:
- Download csv data from Source.
- Upload the data to Data Lake by transforming csv into efficient parquet format.
- Loading the data from Data Lake to Data Warehouse with some transformations and data quality checks.
The chosen dataset was the fire incidents data in the US from 1-Jan-2003 to 20-Mar-2023.
It includes a summary of each (non-medical) incident to which the SF Fire Department responded. Each incident record includes, the incident number, the battalion which responded to the incident, the incident date, the timestamp of alarm, arrival and closure of the incident, among others.
It is available for download as a csv file and for consultation where it is also provided a data dictionary. As of 20-Mar-2023, this dataset is updated daily.
Below tools have been used for this project:
- Infrastructure as code (IaC): Terraform
- Workflow orchestration: Prefect
- Data Lake: Google Cloud Storage (GCS)
- Data Warehouse: BigQuery
- Transformations: dbt
- Visualization: Looker Studio
Clone this repo to start with.
Creation of a Google Cloud Platform (GCP) account.
- Creation of new GCP project. Attention: The Project ID is important.
- Create VM instance till 23rd step
- Go to
IAM & Admin > Service accounts > Create service account
, provide a service account name and grant the rolesViewer
,BigQuery Admin
,Storage Admin
,Storage Object Admin
. - Download lservice account key locally, rename it to
google_credentials.json
. - Store it in your home folder
$HOME/.google/credentials/
for easier access. - Set and export the GOOGLE_APPLICATION_CREDENTIALS using
export GOOGLE_APPLICATION_CREDENTIALS=<path/to/your/service-account-authkeys>.json
- Activate the service account using
gcloud auth activate-service-account --key-file $GOOGLE_APPLICATION_CREDENTIALS
- Add the above two lines at the end of
.bashrc
so that we don't need to export and activate every time. - Activate the following API's:
- Install Terraform
- Change default variables
project
,region
,BQ_DATASET
invariables.tf
(the file contains descriptions explaining these variables) - Run the following commands from terraform directory on bash:
# Initialize state file (.tfstate)
terraform init
# Check changes to new infra plan
terraform plan
# Create new infra
terraform apply
- Confirm in GCP console that the infrastructure was correctly created.
1. Create a new Conda environment and install packages listed in prefect_ingest_data/requirements.txt
.
conda create -n <env_name> python=3.9
conda activate <env_name>
pip install -r prefect_ingest_data/requirements.txt
2. Register Prefect blocks and start Orion
prefect block register-m prefect_gcp
prefect orion start
- Navigate to prefect dashboard
http://127.0.0.1:4200
--> go to blocks menu --> addGCS Bucket
and provide below inputs.- Block name :
<your-GCS-bucket-block-name>
- Bucket name:
<your-bucket-name-created-by-terraform>
- GCP credentials: Click on Add --> It opens up create block of GCP credentials , provide input below.
- Block name :
<your-GCP-credentials-block-name>
- Service Account info: copy paste the json file data in the service account info.
- Clicck on create.
- Block name :
- GCP credentials: Click on Add --> Select the above created
<your-GCP-credentials-block-name>
- Code generated needs to be replaced in the
web-to-gcs-parent.py
andgcs_to_bq_parent
python files.from prefect_gcp.cloud_storage import GcsBucket gcp_cloud_storage_bucket_block = GcsBucket.load("<your-gcp-bucket-block-name") from prefect_gcp import GcpCredentials gcp_credentials_block = GcpCredentials.load("<your-gcs-cred-block-name>")
- Block name :
3. Change the directory to prefect_ingest_data and run below command to deploy code in prefect
prefect deployment build ./ingest_data_to_bucket.py:ingest_data_to_bucket -n "Ingest data from web to gcs bucket" --cron "0 0 1 * *" -a
prefect deployement build ./ingest_data_to_bq.py:ingest_data_to_bq -n "Ingest data from gcs bucket to bigquery" --cron "0 1 1 * *" -a
4. Navigate to prefect deployment dashboard and check for the recently created deployments. Image
5. Run deployments from prefect dashboard
The above deployments download csv data from web portal and stores it into GCS bucket as .parquet file and then writes data into Google BigQuery.
- Navigate to dbt cloud and create a new project by referring to this repository. Under the project subfolder update
/dbt
- Select the BigQuery connection and update
service-account.json
file for the authentication. - Under dbt development menu, edit the
dbt-project.yml
to update thename
andmodels
. - add macros/extract_month_with_name.sql, models/core/schema.yml, models/core/stg_fire_accident.sql, packages.yml
- Run below commands to execute the transformations:
Image
dbt deps dbt build
- The above will create dbt models and final tables Note: The transformations made were the selection of certain columns and creation of new ones (time differences, Month, Year). It is known that tables with less than 1 GB don't show significant improvement with partitioning and clustering; doing so in a small table could even lead to increased cost due to the additional metadata reads and maintenance needed for these features (or) the processing data clustered and with out clustered is same for small data
As of 20-Mar-2023, the dataset has a size of ~ 260 MB, thus I only performed transformations such as adding new variables, and not partitioning and clustering.
CREATE OR REPLACE TABLE `de-project-2023.fire_accident_raw_data.fire_data_sanfrancisco_clustered`
Cluster BY
Battalion AS
SELECT * FROM de-project-2023.fire_accident_raw_data.fire_data_sanfrancisco;
...makes the query same data... image
same as performing it on the not clustered table. image
- Connect the final BigQuery dataset from above inside lookerstudio and start creating the dashboard with insights.
Below is the final dashboard
A special thank you to DataTalks.Club for providing this incredible course! Also, thank you to the amazing slack community!