Data Engineering Project: Hotel Sales Analysis

Problem Statement

This project aims to analyze ordering, invoicing, and sales processes at a Hotel using this dataset. By delving into customers' meal choices, order values, and conversion rates, this project offers valuable insights into consumer behavior trends and engagement within the business that could facilitate strategic data-driven decision-making.

The project adopts a batch approach and implements cloud-based technologies, data ingestion pipelines, workflow orchestration, data lake, data warehouse, data transformations, and dashboarding.

Project Architecture

Cloud

The project is developed in the cloud using scalable infrastructure provided by Google Cloud Platform. Infrastructure as Code (IaC) tools such as Terraform are utilized to provision and manage the cloud resources efficiently.

Data Ingestion

Data ingestion involves batch processing, where data is collected, processed, and uploaded to the data lake periodically and subsequently to the data warehouse. This ensures that the latest information on customers' meal choices, order values, and sales conversions is readily available for analysis.

Workflow Orchestration

An end-to-end pipeline is orchestrated using Mage to automate data workflows. This pipeline efficiently manages multiple steps in a Directed Acyclic Graph (DAG), ensuring seamless execution and reliability of the data processing tasks.

Data Lake & Data Warehouse

In this project, Google Cloud Storage is used as the data lake where the data is initially stored after ingestion from the source. Google BigQuery is used as the data warehouse and for storing and optimizing structured data for analysis. Tables in BigQuery are partitioned and clustered to ensure efficient query performance, enabling quick retrieval of insights for strategic decision-making.

Transformations

Data transformations are performed using dbt. The transformation logic is defined and executed seamlessly within the pipeline, ensuring accurate analysis of consumer behavior trends and patterns.

Dashboard

Finally a dashboard is then created using Looker Studio to visualize key insights derived from the processed data. The dashboard comprises of tiles that provide some insights into the customer actions, habits, and engagement with the hotel.

How to run the project

Prerequisites:

Set up a google cloud platform account and provision a virtual machine.
Create a GCP project and set up service account and authentication as per these instructions.
Terraform setup in both your local environment and virtual machine. Check out terraform installation instructions here

Instructions:

infrastructure provisioning with terraform

Clone this repository git clone https://github.com/skihumba/data-engineering-project.git and change directory to the data-engineering-project folder.
Create a folder named .keys in the 1_terraform folder.
Rename your gcp service account key obtained from the second point on the prerequisites to key.json and paste it in the .keys folder. (if your gcp service account key is in a different location, you have to specify it's location in the variables.tf file)
Open a terminal and cd to the 1_terraform directory and run the following commands to set up the project infratructure i.e: google cloud storage bucket and BigQuerry:

terraform init
terraform plan
terraform apply

orchestration with Mage

Change directory to the de-project folder that is in the 2_mage directory.
Create a file named .env with the following content: PROJECT_NAME=de-project as the name of the mage project.
Create a folder named .keys in the de-project folder.
Copy your renamed (key.json) gcp service account key into the .keys folder.
Run the docker-compose up command in the de-project folder to startup the mage container.
Open mage by going to localhost:6789 in your browser.
In Mage, go to Files and edit the oi_conf.yml file. Specify the location of the GOOGLE_APPLICATION_CREDENTIALS file to be the .keys folder created earlier.
Run the pipelines order_leads_source_to_gcs, sales_team_sourde_to_gcs and invoices_source_to_gcs to load data from the source to google cloud storage
Run the pipelines order_leads_gcs_to_bq, sales_team_gcs_to_bq and invoices_gcs_to_bq to load data from the source to google cloud storage to move the data from the datalake (google cloud storage) into the datawarehouse (BigQuerry)

transformations with dbt

For the transformations, ensure that you have dbt cloud set up. You can follow these instructions to set up dbt cloud.
import the 2_dbt folder into your project and run the dbt-run command to execute the models as per the transformations specified.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
1_terraform		1_terraform
2_mage/de-project		2_mage/de-project
3_dbt		3_dbt
images		images
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1_terraform

1_terraform

2_mage/de-project

2_mage/de-project

3_dbt

3_dbt

images

images

.gitignore

.gitignore

README.md

README.md

Repository files navigation

Data Engineering Project: Hotel Sales Analysis

Problem Statement

Project Architecture

Cloud

Data Ingestion

Workflow Orchestration

Data Lake & Data Warehouse

Transformations

Dashboard

How to run the project

Prerequisites:

Instructions:

About

Releases

Packages

Languages

skihumba/data-engineering-project

Folders and files

Latest commit

History

Repository files navigation

Data Engineering Project: Hotel Sales Analysis

Problem Statement

Project Architecture

Cloud

Data Ingestion

Workflow Orchestration

Data Lake & Data Warehouse

Transformations

Dashboard

How to run the project

Prerequisites:

Instructions:

About

Topics

Resources

Stars

Watchers

Forks

Languages