Skip to content

MicroDataWarehouse is an ETL pipeline built with Python, PySpark, and SQLite to extract, transform, and load data, with Metabase for data exploration and visualization.

Notifications You must be signed in to change notification settings

prakash-aryan/MicroDataWarehouse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Micro Data Warehouse

This project is an ETL (Extract, Transform, Load) pipeline for a microdatawarehouse built using PySpark and SQLite. It extracts data from SQLite tables, performs transformations on the data, and loads the transformed data back into SQLite tables. Additionally, it integrates with the Metabase data exploration and visualization tool.

Project Overview

The MicroDataWarehouse project is designed to showcase an end-to-end ETL process for a small-scale data warehouse. The project consists of the following components:

  1. Data Generation: A Python script (data.py) that creates an SQLite database and populates it with sample data for sales, customers, and employees.

  2. ETL Pipeline: A PySpark script (etl.py) that extracts data from the SQLite database, performs various transformations on the data (e.g., calculating total revenue, finding top-selling products, and computing sales metrics by category), and then loads the transformed data back into the SQLite database.

  3. Execution Script: A Bash script (run_etl.sh) that sets up the environment, executes the data generation and ETL pipeline scripts, and starts the Metabase server.

  4. Data Exploration: The Metabase server is used to explore and visualize the transformed data stored in the SQLite database.

Project Modules

1. data.py

This Python script creates an SQLite database named microdatawarehouse.db and populates it with sample data for the following tables:

  • sales: Contains product information, including product ID, name, category, price, and quantity.
  • customers: Contains customer details, including customer ID, name, region, and total purchases.
  • employees: Contains employee information, including employee ID, name, department, and salary.

2. etl.py

This Python script implements the ETL pipeline using PySpark. It performs the following tasks:

  • Extract: Reads data from the sales, customers, and employees tables in the SQLite database.
  • Transform:
    • Calculates the total revenue from sales.
    • Finds the top 5 selling products based on quantity sold.
    • Calculates the average salary by department.
    • Identifies the top 3 high-value customers based on total purchases.
    • Calculates various sales metrics by product category (total quantity, average price, maximum price, minimum price, and number of products).
  • Load: Writes the transformed data to new tables in the SQLite database (total_revenue, top_selling_products, avg_salary_by_dept, high_value_customers, sales_metrics_by_category).

3. run_etl.sh

This Bash script automates the execution of the ETL process and starts the Metabase server. It performs the following steps:

  • Executes the data.py script to create the SQLite database and populate it with sample data.
  • Executes the etl.py script to perform the ETL pipeline.
  • Starts the Metabase server.

Running the Example

To run the MicroDataWarehouse project, follow these steps:

  1. Clone the repository or download the project files:
git clone https://github.com/prakash-aryan/MicroDataWarehouse.git
cd MicroDataWarehouse
  1. Download the required JAR files:
  • Metabase JAR file: Download the latest version from here
  • SQLite JDBC driver JAR file: Download the latest version from here

Place both JAR files in the project directory.

  1. Create a virtual environment and activate it:
python3 -m venv microdatawarehouse_venv
source microdatawarehouse_venv/bin/activate
  1. Install the required Python dependencies:
pip install pyspark==3.3.1 "py4j>=0.10.9.5,<0.10.9.6" findspark
  1. Run the run_etl.sh script:
chmod +x run_etl.sh
./run_etl.sh
  1. Access the Metabase server at http://localhost:3000.

You can now explore and visualize the transformed data using the Metabase web interface.

Contributing

Contributions are welcome! If you find any issues or have suggestions for improvements, please open an issue or submit a pull request.

License

This project is licensed under the MIT License.

About

MicroDataWarehouse is an ETL pipeline built with Python, PySpark, and SQLite to extract, transform, and load data, with Metabase for data exploration and visualization.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published