Skip to content

RoshnaKU/TPCDS-Inventory-Optimization-and-Sales-Performance-Analysis

Repository files navigation

TPCDS-Inventory-Optimization-and-Sales-Performance-Analysis

Capstone Project completed as part of Analytical Engineering- diploma in Data Engineering (WeCloud Data Academy)

1. Project Overview

Data will be ingested from several data sources and populated into the Snowflake data warehouse. In the warehouse, after a couple of data transformation process, we make the data ready for the BI usage. And the BI tool Metabase will connect to the data warehouse to generate different dashboards and reports.

image

2. About Data

1. Data Background

This dataset is the dataset from TPCDS. TPCDS is a famous dataset for database testing. The business background of the dataset is Retail Sales. The data contains the sales records from website and Catalog. And also the inventory level of each item in each warehouse. In addtion to these, there are 15 dimensional tables those contain the information of customer, warehouse, items, etc. This entire dataset is not stored in one place, instead, the dataset was splited into 2 parts: Untitled

RDS: All the tables except for the inventory tables are stored in the Postgres DB in AWS RDS. the tables will refreshed everyday and updated with the newest data so for sales data, so in order to get the newest data, you need to run ETL process every day.

S3 Bucket: The single Inventory table is stored in S3 bucket, everyday there will be a new file contain the newest data dump into the S3 bucket. BUT, be aware that the inventory table usually only record the inventory data at the end of each week, so usually each week you can only see one entry for each item each warehouse (Please go to you RAW schema in Snowflake to explore the data). But you also need to ingest inventory file from S3 bucket every day.

2. Tables in the Dataset

In this sheet, you can see there are several tables corelated to customer, these tables' schema are arranged horizontally,this means when you are doing ETL consider to put integrate all these tables into one customer dimension table.

image

3. Business Requirements

3.1. Snowflake Data Warehouse Requirements

In order to meet the requirements on BI's requirements, we need to create data model(new tables) in data warehouse, these are the requirements for the new tables:

Some raw tables, such as several customer related tables, need to be join into one table. We need to create a new fact table (you will learn what is fact table) with several new metrics: sum_qty_wk: the sum sales_quantity of this week sum_amt_wk: the sum sales_amount of this week sum_profit_wk: the sum net_profit of this week avg_qty_dy: the average daily sales_quantity of this week (= sum_qty_wk/7) inv_on_hand_qty_wk: the item’s inventory on hand at the end of each week in all warehouses (=The inventory on hand of of this weekend) wks_sply: Weeks of supply, an estimate metrics to see how many weeks the inventory can supply the sales (inv_on_hand_qty_wk/sum_qty_wk) low_stock_flg_wk: Low stock weekly flag. During the week, if there is a single day, if [(avg_qty_dy > 0 && ((avg_qty_dy) > (inventory_on_hand_qty_wk)), then this week, the flag is True

3.2. Metabase Requirements

Our final purpose is to create dashboard and reports in the BI tool --Metabase. It is important to understand the requirements for Metabase dashboard and reports:

Identify the highest and lowest performing items of the week by analyzing sales amounts and quantities. Display items with low supply levels for each week. Detect items experiencing low stock levels, along with their corresponding week and warehouse numbers, marked as "True".

4. Project Infrastructure

In this project, all the infrastructure is built on cloud.

Servers: We need to create several servers on AWS cloud. Tools: In these servers, we will install several tools, including Airbyte, the tool for data ingestion; DBT, the tool to manage data transformation in the data warehouse. Metabase, the BI tool to build the dashboard. Cloud Data Warehouse: We will use Snowflake, the cloud data warehouse to store data and do data transformation, we need to create account on it. AWS Lambda: This is a serverless server, we will use it to ingest data from AWS data storage -- S3.

5. Part One: Data Ingestion

The first part of the project is Data Ingestion. In this part, you have 2 data sources to connect, the Postgres database and the AWS S3 bucket.

You will use the tool Airbyte to connect to schema "raw_st" of the Postgres database on AWS RDS, to transfer all the tables to the Snowflake data warehouse.

You also will use the AWS Lambda function, to connect to the AWS S3 bucket, to transfer the file called "inventory.csv" in the S3 bucket into the Snowflake data warehouse.

6. Part Two: Data Transformation

The second part of the project is data transformation. the transformation process will happen in the Snowflake data warehouse. In this phase, you will transform the tables from their orginial structure into the tables we want. During this process, we need to create data model, build ETL script and schedule the data loading process. All this will be done in a tool called DBT.

7. Part Three: Data Analyzation

In the last part, you will connect the Snowflake data warehouse with the BI tool - Metabase, to display the dashboard and reports in Metabase based on the data in Snowflake. After this step, the entire project finish.

About

Capstone Project completed as part of Analytical Engineering- diploma in Data Engineering (WeCloud Data Academy)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published