Streaming data-ops

TODO: Please see this task board in GitHub.

Streaming data-ops

Sample repo for understanding spark structured streaming data-ops with DataBricks and Azure IoT tools

You can utilize device simulator in my GitHub. It can send expected telemetry to IoT Hub and Spark.
This repo uses Azure Event Hubs Connector for Apache Spark instead of Kafka Connector, because it supports Azure IoT Hubs message property. See more detail in this blog.

Motivation (problem statement)

After I see some public documents, I realized there are few documents to describe following things.

How to write unit test for streaming data in local Spark environment.
How to automate CI/CD pipeline with DataBricks.

For helping developers to keep their code quality high through testing and pipelines, I want to share how to achieve it.

Architecture

How to run app locally

If you are new for developing inside a container, please read this document and setup environment by refering Getting started.
Clone and open repository inside the container with this document.
Set environment variable with event hub (IoT Hub) information

export EVENTHUB_CONNECTION_STRING="{Your event hub connection string}"
export EVENTHUB_CONSUMER_GROUP="{Your consumer group name}"

Run pyspark --packages com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.16 < stream_app.py in Visual Studio Code terminal to execute structred streaming. It shows telemetry in console.

environment variable example

Name	Example	IoT Hub Build-in endpoints name
EVENTHUB_CONNECTION_STRING	`Endpoint=sb://xxx.servicebus.windows.net/; SharedAccessKeyName=xxxxx;SharedAccessKey=xxx;EntityPath=xxxx`	Event Hub-compatible endpoint
EVENTHUB_CONSUMER_GROUP	Consume group name which you created. Default is `$Default`	Consumer Groups

Uses xxx for mocking secrets
Please refer 3rd column to pick up connection setting from Azure IoT Hub's built-in endpoint

How to run test locally

In this repo, we uses pytest for unit testing. If you want to run unit test, please type and run pytest in the root folder. You'll see following output in the terminal.

Setup and run CI/CD pipeline with Azure DataBricks and Azure Devops

Please see this dotument.

Utilize `pyot` library from Databricks notebook

In the Notebook, app fetches secrets from Azure KeyVault, so you need to setup it at first.

Save your EVENTHUB_CONNECTION_STRING's value (Event Hub-compatible endpoint in IoT Hub) as iot-connection-string secret in Azure KeyVault. Please refer this document.
Setup Azure Key Vault-backed scope in your Azure DataBricks. Please add key-vault-secrets as scope name. Please refer this document.
Please import ProcessStreaming.py under notebooks folder to your Databricks and run it on a cluster.

Reference

DataOps strategy

For understanding concept of this repo, please read following repo and blog

Spark structured streaming and Azure Event Hubs

Unit testing with Spark structred streaming

We have 2 options to test streaming data. One is reading stored file as stream and another is using MemoryStream. Because I can easily generate json file from real streaming data, I selected to use first choice.

Please see Streaming Our Data section in this document. It describes how to emulate stream
Unit Testing Apache Spark Structured Streaming using MemoryStream

Setup development environment

Update vscode setting for resolving pyspark import error problem

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
.devcontainer		.devcontainer
.vscode		.vscode
devops		devops
docs		docs
notebooks		notebooks
pyot		pyot
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
conftest.py		conftest.py
requirements.txt		requirements.txt
setup.py		setup.py
stream_app.py		stream_app.py

License

NT-D/streaming-dataops

Folders and files

Latest commit

History

Repository files navigation

Streaming data-ops

Motivation (problem statement)

Architecture

How to run app locally

environment variable example

How to run test locally

Setup and run CI/CD pipeline with Azure DataBricks and Azure Devops

Utilize pyot library from Databricks notebook

Reference

DataOps strategy

Spark structured streaming and Azure Event Hubs

Unit testing with Spark structred streaming

Setup development environment

About

Resources

License

Stars

Watchers

Forks

Languages

Utilize `pyot` library from Databricks notebook