Skip to content

real-time data streaming pipeline, covering each phase from data ingestion to processing and finally storage. We'll utilize a powerful stack of tools and technologies, including Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra—all neatly containerized using Docker.

pablogzalez/Realtime-Data-Streaming

Repository files navigation

📊 Data Pipeline Project with Airflow, Kafka, Spark, and Cassandra

This project implements an automated data engineering workflow that collects, processes, and stores randomly generated user data. It uses Apache Airflow for orchestration, Apache Kafka for data stream handling, Apache Spark for data processing, and Cassandra for persistent storage.

🚀 Quick Start

Prerequisites

  • Docker
  • Docker Compose

Setup

1. Clone the repository

git clone [https://github.com/your-username/your-repository.git] (https://github.com/pablogzalez/Realtime-Data-Streaming/tree/master) [cd your-repository (Realtime-Data-Streaming)

2. Start the services

Use Docker Compose to build and start the necessary services (Airflow, Kafka, Spark, Cassandra).

docker-compose up -d

3. Execution

  • Apache Airflow: Access the Airflow UI at http://localhost:8080 and trigger the user_automation DAG.
  • Verify execution: Check the logs in Airflow to ensure data is being processed and stored correctly.

📋 Architecture This project follows a data flow architecture involving the following components:

  • Apache Airflow: Orchestrates the workflow of data collection, processing, and storage.
  • Apache Kafka: Acts as a messaging system to handle real-time data.
  • Apache Spark: Processes the real-time data read from Kafka.
  • Cassandra: Stores the processed data for future queries and analysis.

🛠 Technologies Used

  • Apache Airflow
  • Apache Kafka
  • Apache Spark
  • Cassandra
  • Docker

About

real-time data streaming pipeline, covering each phase from data ingestion to processing and finally storage. We'll utilize a powerful stack of tools and technologies, including Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra—all neatly containerized using Docker.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published