Skip to content

ArmanShakeri/pysparkMetrics

Repository files navigation

Summary

This is Pyspark pipeline for consume message from kafka and insert processed record into Delta table.

In most data processing pipelines, the source is apache Kafka and I needed to monitor the Kafka consumption status and its lag in an external monitoring system. Therefore, I created this project and I used the following technologies:

  • Spark Structured Streaming: scalable and fault-tolerant stream processing engine
  • Kafka: message broker and source of the pipeline
  • minio: distributed object storage for store processed date
  • DeltaLake: an open-source storage framework that enables building a Lakehouse architecture with compute engines Like Spark
  • prometheus, prometheus pushgateway and grafana for monitoring system

Faker

If Faker is enabled, in the background, fake data is generated at a defined rate(Faker_num_threads,Faker_sleep_s)

Metrics

The extraction of metrics is implemented in metrics.py, and it can be extended for other sources.

Test

I used the following Docker images to test the code:
  • bitnami/kafka
  • minio/minio
  • prom/prometheus
  • prom/pushgateway
  • grafana/grafana

my spark version was 3.4.1 and delta 2.4 StreamingQueryListener is a new class in spark 3.4.0: https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss/api/pyspark.sql.streaming.StreamingQueryListener.html

About

This is a pyspark pipeline for consume message from kafka and insert into delta table

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages