This project aims to analyze ordering, invoicing, and sales processes at a Hotel using this dataset. By delving into customers' meal choices, order values, and conversion rates, this project offers valuable insights into consumer behavior trends and engagement within the business that could facilitate strategic data-driven decision-making.
The project adopts a batch approach and implements cloud-based technologies, data ingestion pipelines, workflow orchestration, data lake, data warehouse, data transformations, and dashboarding.
The project is developed in the cloud using scalable infrastructure provided by Google Cloud Platform. Infrastructure as Code (IaC) tools such as Terraform are utilized to provision and manage the cloud resources efficiently.
Data ingestion involves batch processing, where data is collected, processed, and uploaded to the data lake periodically and subsequently to the data warehouse. This ensures that the latest information on customers' meal choices, order values, and sales conversions is readily available for analysis.
An end-to-end pipeline is orchestrated using Mage to automate data workflows. This pipeline efficiently manages multiple steps in a Directed Acyclic Graph (DAG), ensuring seamless execution and reliability of the data processing tasks.
In this project, Google Cloud Storage is used as the data lake where the data is initially stored after ingestion from the source. Google BigQuery is used as the data warehouse and for storing and optimizing structured data for analysis. Tables in BigQuery are partitioned and clustered to ensure efficient query performance, enabling quick retrieval of insights for strategic decision-making.
Data transformations are performed using dbt. The transformation logic is defined and executed seamlessly within the pipeline, ensuring accurate analysis of consumer behavior trends and patterns.
Finally a dashboard is then created using Looker Studio to visualize key insights derived from the processed data. The dashboard comprises of tiles that provide some insights into the customer actions, habits, and engagement with the hotel.
- Set up a google cloud platform account and provision a virtual machine.
- Create a GCP project and set up service account and authentication as per these instructions.
- Terraform setup in both your local environment and virtual machine. Check out terraform installation instructions here
infrastructure provisioning with terraform
- Clone this repository
git clone https://github.com/skihumba/data-engineering-project.git
and change directory to thedata-engineering-project
folder. - Create a folder named
.keys
in the1_terraform
folder. - Rename your gcp service account key obtained from the second point on the prerequisites to
key.json
and paste it in the.keys
folder. (if your gcp service account key is in a different location, you have to specify it's location in thevariables.tf
file) - Open a terminal and
cd
to the1_terraform
directory and run the following commands to set up the project infratructure i.e:google cloud storage
bucket andBigQuerry
:
terraform init
terraform plan
terraform apply
orchestration with Mage
- Change directory to the
de-project
folder that is in the2_mage
directory. - Create a file named
.env
with the following content:PROJECT_NAME=de-project
as the name of the mage project. - Create a folder named
.keys
in thede-project
folder. - Copy your renamed (
key.json
) gcp service account key into the.keys
folder. - Run the
docker-compose up
command in thede-project
folder to startup the mage container. - Open mage by going to
localhost:6789
in your browser. - In Mage, go to
Files
and edit theoi_conf.yml
file. Specify the location of theGOOGLE_APPLICATION_CREDENTIALS
file to be the.keys
folder created earlier. - Run the pipelines
order_leads_source_to_gcs
,sales_team_sourde_to_gcs
andinvoices_source_to_gcs
to load data from the source togoogle cloud storage
- Run the pipelines
order_leads_gcs_to_bq
,sales_team_gcs_to_bq
andinvoices_gcs_to_bq
to load data from the source togoogle cloud storage
to move the data from the datalake (google cloud storage
) into the datawarehouse (BigQuerry
)
transformations with dbt
- For the transformations, ensure that you have
dbt cloud
set up. You can follow these instructions to set up dbt cloud. - import the
2_dbt
folder into your project and run thedbt-run
command to execute the models as per the transformations specified.