Azure Databricks clusters include a number of Python, R, Java, and Scala libraries that are pre-installed as part of the Databricks Runtime. View the Databricks Runtime release notes for your cluster's Databricks Runtime version to view the list of installed libraries. However, despite the long list of pre-installed libraries, you may encounter cases where you need to either add a third-party library or locally-built code to one or more cluster execution environments. The easiest way to do this in Azure Databricks is to create a new library. Libraries can be written in Python, Java, Scala, and R. You can create and manage libraries using the UI, Databricks CLI, or by invoking the Libraries API. We will show examples of using each of these options in the sections below.
When you create a library, you can choose the destination for the library within the Workspace, just like you do when creating a notebook or folder. If you want the library to be shared by all users of your Workspace, create the library within the Shared folder. This is also true for notebooks and dashboards that you create within the workspace.
On the other hand, you may only want the library to be available to your user or someone else. To do this, simply create it within the associated User folder instead.
Additional things to keep in mind about libraries are:
- Libraries are immutable. They can only be created and deleted.
- To completely delete a library from a cluster you must restart the cluster.
- Azure Databricks stores libraries that you upload in the FileStore.
- After you attach a library to a cluster, to use the library you must reattach any notebooks using the cluster.
In the Business Intelligence and Data Visualization article, we referenced the 3rd-party d3a
Maven package. We did not go over the details on how to create a library to add the package to the cluster's execution environment. Follow the steps below to add the d3a
package in a new shared library:
-
Go to the Workspace folder and right-click the Shared folder. Select Create -> Library.
-
In the New Library form, select Maven Coordinates as the source, then select Search Spark Packages and Maven Central. Alternately, if you know the exact Maven coordinate, enter it within the Coordinate field. Maven coordinates are in the form groupId:artifactId:version; for example,
graphframes:graphframes:0.5.0-spark2.1-s_2.11
. -
Your search results should appear within the Search Packages dialog. Note: Sometimes you will need to reenter your search in the search box on top of this dialog. The select list to the right will allow you to narrow your search to Spark Packages or Maven Central. In this case, select Spark Packages. The Releases select list allows you to select the package release that is compatible with your Spark version. Select the latest release, then click + Select under the Options column.
-
After selecting the package, the Search Packages dialog will close. You should see the graphframes coordinate listed, based on your selection. The Advanced Options allow you to select a specific repository URL you would like to use to obtain the package as an alternative, such as
https://oss.sonatype.org/content/repositories
. The Excludes field enables you to exclude specific dependencies from the selected package by providing thegroupId
and theartifactId
of the dependencies that you want to exclude; for example, log4j:log4j. Select Create Library. -
The Library details are displayed after creating the new library. It is here that you can view its artifacts, including dependencies, delete the library, and select which clusters to which the library should be attached. Check the Attach automatically to all clusters checkbox to attach this library to all existing clusters and any new clusters that are created in the future.
If you have any notebooks attached to a cluster to which you attached the library, you must first detach then reattach that notebook to the cluster for it to be able to access the library.
Now that the required d3a
Maven package has been attached to the cluster through the new library, it can be used within a notebook as follows:
import d3a._
graphs.force(
height = 800,
width = 1200,
clicks = sql("select src, dst as dest, count(1) as count from departureDelays_geo where delay <= 0 group by src, dst").as[Edge])
The import
command is able to locate the package because the files have been uploaded and references to it attached to the cluster. You must remember to detach and reattach the cluster to the notebook if you added the library after the notebook had been attached to the cluster.
If you wanted to install a Java or Scala JAR, also referred to as a local library, follow these steps in the New Library form that appears starting with step 2 above:
-
In the Source drop-down list, select Upload Java/Scala JAR.
-
Enter a library name.
-
Click and drag your JAR to the JAR File text box.
-
Select Create Library. The library detail screen will display.
-
In the Attach column, select the clusters to attach the library to, or select Attach automatically to all clusters.
You can also install a PyPI package or upload a Python Egg. To do so, follow these steps in the New Library form:
-
In the Source drop-down list, select Upload Python Egg or PyPI.
-
If installing a PyPI package, enter a PyPI package name and select Install Library. The library detail screen will display.
Note: PyPI has a specific format for installing specific versions of libraries. For example, to install a specific version of simplejson, use this format for the library:
simplejson==3.15.0
. -
If installing a Python Egg:
- Enter a Library Name.
- Click and drag the egg and optionally the documentation egg to the Egg File box.
- Select Create Library. The library detail screen will display.
-
-
In the Attach column, select the clusters to attach the library to, or select Attach automatically to all clusters.
R has a rich ecosystem of packages called the Comprehensive R Archive Network, or CRAN. To install CRAN libraries that you can use on your Azure Databricks clusters, follow these steps in the New Library form:
-
In the Source drop-down list, select R Library.
-
In the Install from drop-down list, CRAN-like Repository is the only option and is selected by default. This option covers both CRAN and bioconductor repositories.
-
In the Repository field, enter the CRAN repository URL.
-
In the Package field, enter the name of the package.
-
Select Create Library. The library detail screen will display.
-
In the Attach column, select the clusters to attach the library to, or select Attach automatically to all clusters.
Some libraries require lower-level configuration and cannot be uploaded using the methods described in this article. To install these libraries, you can write a custom UNIX script that runs at cluster creation time, using cluster node initialization scripts. An initialization script is a shell script that runs during startup for each cluster new before the Spark driver or worker JVM starts.
Important
To install Python packages, use the Azure Databricks pip binary located at /databricks/python/bin/pip to ensure that Python packages install into Databricks Python virtual environment rather than the system Python environment. For example, /databricks/python/bin/pip install .
If you want to install libraries for all clusters, then you need to use a global init script. You do this by storing the scripts in the dbfs:/databricks/init/
directory.
Conversely, if you only want to install libraries to a specific cluster, then use cluster-specific init scripts. These scripts also reside within the dbfs:/databricks/init/
directory, but under subdirectories named the same as the cluster name. For instance, if your cluster name is lab
, then init scripts for that cluster would be stored within dbfs:/databricks/init/lab
. You must create the directory if it does not already exist.
As you can see from the file paths above, which are prefixed with dbfs:
, all initialization scripts are created and managed from the Databricks File System -DBFS.
Things of note: Any change to an init script will require a cluster restart to take effect. Also, if you are using cluster-specific init scripts, avoid spaces in your cluster names as they are used in the script and output paths.
To install a library using a global init script, perform the following steps from a notebook:
-
Create dbfs:/databricks/init/ if it doesn’t exist.
dbutils.fs.mkdirs("dbfs:/databricks/init/")
You can display a list of existing global init scripts with the following:
display(dbutils.fs.ls("dbfs:/databricks/init/"))
-
Create the init script for your library. In this case, we are installing PostgreSQL:
dbutils.fs.put("/databricks/init/postgresql-install.sh",""" #!/bin/bash wget --quiet -O /mnt/driver-daemon/jars/postgresql-42.2.2.jar http://central.maven.org/maven2/org/postgresql/postgresql/42.2.2/postgresql-42.2.2.jar wget --quiet -O /mnt/jars/driver-daemon/postgresql-42.2.2.jar http://central.maven.org/maven2/org/postgresql/postgresql/42.2.2/postgresql-42.2.2.jar""", True)
Every time a cluster launches it will execute this append script.
To install a library for a specific cluster, perform the following steps from a notebook:
-
Create dbfs:/databricks/init/ if it doesn’t exist.
dbutils.fs.mkdirs("dbfs:/databricks/init/")
-
Configure a cluster name variable. This should be the name of the cluster you want to initialize with this script.
clusterName = "lab"
-
Create a directory named
lab
(or your cluster name).dbutils.fs.mkdirs("dbfs:/databricks/init/%s/"%clusterName)
-
Create the init script for your library.
dbutils.fs.put("/databricks/init/lab/postgresql-install.sh",""" #!/bin/bash wget --quiet -O /mnt/driver-daemon/jars/postgresql-42.2.2.jar http://central.maven.org/maven2/org/postgresql/postgresql/42.2.2/postgresql-42.2.2.jar wget --quiet -O /mnt/jars/driver-daemon/postgresql-42.2.2.jar http://central.maven.org/maven2/org/postgresql/postgresql/42.2.2/postgresql-42.2.2.jar""", True)
Notice that the only difference in the init script here vs. the global init script is that the cluster-specific one includes the cluster name (in this case,
lab
) in the file path. -
Check to make sure the cluster-specific init script exists.
display(dbutils.fs.ls("dbfs:/databricks/init/%s/postgresql-install.sh"%clusterName))
The output should look similar to the following: