This repository provides a real-time sales data pipeline that captures data changes from multiple MySQL databases using Debezium connectors, processes the data with Spark, stores it in a MySQL database, and visualizes it through Metabase. It offers a comprehensive solution for streaming data analysis and visualization.
Follow these steps to set up and run the Real-Time Sales Data Pipeline:
git clone https://github.com/saadkh1/Real-Time_Sales_Data_Pipeline_Kafa_Debezium_Spark_MySQL_Metabase
cd Real-Time_Sales_Data_Pipeline_Kafa_Debezium_Spark_MySQL_Metabase
- Windows:
run.bat
This command will use Docker Compose to start all the necessary Docker containers, including Kafka, Debezium, Spark, MySQL, and Metabase. It will also set up the necessary configurations for Debezium to capture data changes from MySQL databases.
- The
api-mysql
service generates synthetic sales data and inserts it into MySQL databases for different states (Jendouba, Beja, Kef).
- Debezium connectors capture data changes from MySQL databases and stream them to Kafka topics.
- A Spark job continuously reads data from the Kafka topics.
- The data is processed (data transformations).
- The processed data is saved to the MySQL database (manager_table).
- Metabase dashboards are created to visualize the real-time sales data stored in the MySQL database manager_table.
For more details and an end-to-end stream pipeline project, contact me via email at saadkhemiri123@gmail.com.