CI/CD with Azure DataBricks and Azure Devops

This document describes how to automate testing, buidling package and deploy it to Azure DataBricks with Azure DevOps and Azure KeyVault. You'll create CI/CD pipeliine on Azure DevOps and utilized KeyVault to persisting and fetching secrets such as Azure DataBricks hostname and tokens

Manual Steps

For easier understanding about pyspark CI/CD with DataBricks, let me explain manual steps at first.

Build the python package

Because DataBricks accespts a whl package instead of project files, we need to build package with setuptool. If you are a beginner of packaging, please see Packaging Python Projects in the reference section.

Run python setup.py sdist bdist_wheel

Publish the package to the Azure DataBricks

Upload whl package to DataBrciks and install it to specific cluster

Setup DataBricks CLI
Run dbfs mkdirs dbfs:/FileStore/whls to make library folder in DBFS (DataBricks File System)
Run dbfs cp "{Your local code path}/streaming-dataops/dist/pyot-0.0.1-py3-none-any.whl" dbfs:/FileStore/whls to upload package to DBFS.
Run databricks clusters list to see cluster ids
Run databricks clusters start --cluster-id {Your cluster id} to start your cluster
Install own python library by runnning databricks libraries install --cluster-id {Your cluster id} --whl "dbfs:/FileStore/whls/pyot-0.0.1-py3-none-any.whl"

Install required package

Install Event Hub connector by running databricks libraries install --cluster-id {Your cluster id} --maven-coordinates com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.16

Setup CI/CD pipeline in Azure DevOps

Prerequistics

Have initialized Azure DataBricks and created at least one cluster (DBR 6.5/Spark 2.4.5)
Have basic understanding to use Azure Pipeline. You can learn it here.
Access secrets from Azure KeyVault. You can understand in this tutorial.

Steps

Please Fork or Clone this repository, so that you can update YAML pipeline with your own information.
Get Databricks personal access token.
Save following required secrets in Azure KeyVault
Setup Azure Pipeline with devops/ci-python-package.yml and run pipeline. You can refer this page to know how to setup pipeline.
- After the step 6 in the document, please select Existing Azure Pipelines YAML file in [Configure] tab, then you can select YAML file in [Path] dropdown menu in pop-up window to utilize our pipeline.
- In the [Review] tab, you can setup KeyVault configuration. Please click Settings just above the task: AzureKeyVault@1, so that you can setup your Azure subscription and Key vault name.

Required secrets

Secrets name	Secret value
databricks-host	{Your Databricks host name} such as https://adb-xxxx.azuredatabricks.net
databricks-token	{Your DataBricks personal access token}
databricks-cluster-id	{Your cluster id}

Tips

Databricks CLI needs initiali interactive setup by databricks configure --token, but we can't do interaction in the pipeline. CLI 0.8.0 and above supports DATABRICKS_HOST and DATABRICKS_TOKEN environmental variable, so we utilize it from CLI for achieving non-interactive setup.

Reference

Packaging Python Projects
Libraries CLI

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

CI/CD with Azure DataBricks and Azure Devops

Manual Steps

Build the python package

Publish the package to the Azure DataBricks

Install required package

Setup CI/CD pipeline in Azure DevOps

Prerequistics

Steps

Required secrets

Tips

Reference

Files

README.md

Latest commit

History

README.md

File metadata and controls

CI/CD with Azure DataBricks and Azure Devops

Manual Steps

Build the python package

Publish the package to the Azure DataBricks

Install required package

Setup CI/CD pipeline in Azure DevOps

Prerequistics

Steps

Required secrets

Tips

Reference