GitHub - ArmanShakeri/pysparkMetrics: This is a pyspark pipeline for consume message from kafka and insert into delta table

Summary

This is Pyspark pipeline for consume message from kafka and insert processed record into Delta table.

In most data processing pipelines, the source is apache Kafka and I needed to monitor the Kafka consumption status and its lag in an external monitoring system. Therefore, I created this project and I used the following technologies:

Spark Structured Streaming: scalable and fault-tolerant stream processing engine
Kafka: message broker and source of the pipeline
minio: distributed object storage for store processed date
DeltaLake: an open-source storage framework that enables building a Lakehouse architecture with compute engines Like Spark
prometheus, prometheus pushgateway and grafana for monitoring system

Faker

If Faker is enabled, in the background, fake data is generated at a defined rate(Faker_num_threads,Faker_sleep_s)

Metrics

The extraction of metrics is implemented in metrics.py, and it can be extended for other sources.

Test

I used the following Docker images to test the code:

bitnami/kafka
minio/minio
prom/prometheus
prom/pushgateway
grafana/grafana

my spark version was 3.4.1 and delta 2.4 StreamingQueryListener is a new class in spark 3.4.0: https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss/api/pyspark.sql.streaming.StreamingQueryListener.html

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.env		.env
README.md		README.md
generate_fake_data.py		generate_fake_data.py
main.py		main.py
metrics.py		metrics.py
prometheus.yml		prometheus.yml
push_metrics.py		push_metrics.py
spark_app.py		spark_app.py
stats_listener.py		stats_listener.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.env

.env

README.md

README.md

generate_fake_data.py

generate_fake_data.py

main.py

main.py

metrics.py

metrics.py

prometheus.yml

prometheus.yml

push_metrics.py

push_metrics.py

spark_app.py

spark_app.py

stats_listener.py

stats_listener.py

Repository files navigation

Summary

Faker

Metrics

Test

About

Releases

Packages

Languages

ArmanShakeri/pysparkMetrics

Folders and files

Latest commit

History

Repository files navigation

Summary

Faker

Metrics

Test

About

Topics

Resources

Stars

Watchers

Forks

Languages