This document describes how to automate testing, buidling package and deploy it to Azure DataBricks with Azure DevOps and Azure KeyVault. You'll create CI/CD pipeliine on Azure DevOps and utilized KeyVault to persisting and fetching secrets such as Azure DataBricks hostname and tokens
For easier understanding about pyspark CI/CD with DataBricks, let me explain manual steps at first.
Because DataBricks accespts a whl package instead of project files, we need to build package with setuptool. If you are a beginner of packaging, please see Packaging Python Projects in the reference section.
- Run
python setup.py sdist bdist_wheel
Upload whl package to DataBrciks and install it to specific cluster
- Setup DataBricks CLI
- Run
dbfs mkdirs dbfs:/FileStore/whls
to make library folder in DBFS (DataBricks File System) - Run
dbfs cp "{Your local code path}/streaming-dataops/dist/pyot-0.0.1-py3-none-any.whl" dbfs:/FileStore/whls
to upload package to DBFS. - Run
databricks clusters list
to see cluster ids - Run
databricks clusters start --cluster-id {Your cluster id}
to start your cluster - Install own python library by runnning
databricks libraries install --cluster-id {Your cluster id} --whl "dbfs:/FileStore/whls/pyot-0.0.1-py3-none-any.whl"
- Install Event Hub connector by running
databricks libraries install --cluster-id {Your cluster id} --maven-coordinates com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.16
- Have initialized Azure DataBricks and created at least one cluster (DBR 6.5/Spark 2.4.5)
- Have basic understanding to use Azure Pipeline. You can learn it here.
- Access secrets from Azure KeyVault. You can understand in this tutorial.
- Please Fork or Clone this repository, so that you can update YAML pipeline with your own information.
- Get Databricks personal access token.
- Save following required secrets in Azure KeyVault
- Setup Azure Pipeline with
devops/ci-python-package.yml
and run pipeline. You can refer this page to know how to setup pipeline.- After the step 6 in the document, please select Existing Azure Pipelines YAML file in [Configure] tab, then you can select YAML file in [Path] dropdown menu in pop-up window to utilize our pipeline.
- In the [Review] tab, you can setup KeyVault configuration. Please click Settings just above the task: AzureKeyVault@1, so that you can setup your Azure subscription and Key vault name.
Secrets name | Secret value |
---|---|
databricks-host | {Your Databricks host name} such as https://adb-xxxx.azuredatabricks.net |
databricks-token | {Your DataBricks personal access token} |
databricks-cluster-id | {Your cluster id} |
Databricks CLI needs initiali interactive setup by databricks configure --token
, but we can't do interaction in the pipeline. CLI 0.8.0 and above supports DATABRICKS_HOST
and DATABRICKS_TOKEN
environmental variable, so we utilize it from CLI for achieving non-interactive setup.