The Apache Spark engine, the underlying platform for Azure Databricks, helps you ingest, manipulate, and display raw data with incredible speed. Increasingly, business leaders recognize the importance of data-driven decision making, and how having a data-oriented mindset can provide them a competitive advantage. The fact that you are using Azure Databricks to begin with is likely the result of this realization, by you or a decision maker within your organization.
Business intelligence, oftentimes referred to as BI, is simply a process for collecting and analyzing data with the goal of extracting actionable insights from those data. These insights help inform business decisions that are backed by analysis of data that is most often represented by visualizations.
Data visualization is a visual snapshot of the data, that can be either static or interactive. It is usually in the form of a chart, map, or graph that helps one process the underlying raw data in a much more rapid and natural way. Our brains are wired to process visual information. Studies show that approximately 90% of the information our brains process and store is visual. This is due to how quickly and efficiently our brains can interpret visual data compared to other forms, such as text. In fact, studies show that we process images 60,000 times faster than text. This is why analysts and business decision makers rely so heavily on visuals. They are the most effective and efficient way to process and share raw data.
Here's an example we will use to drive home the point. Below is a subset of income tax data extracted from flat text files and stored in a Spark DataFrame. There are millions of rows of data, broken down by state.
state | zipcode | sngle_returns | joint_returns | numdep | total_income_amount | taxable_interest_amount | net_capital_gains | biz_net_income |
---|---|---|---|---|---|---|---|---|
AL | 0 | 488030 | 122290 | 571240 | 11444868 | 77952 | 23583 | 824487 |
AL | 0 | 195840 | 155230 | 383240 | 17810952 | 81216 | 54639 | 252768 |
AL | 0 | 72710 | 146880 | 189340 | 16070153 | 80627 | 84137 | 259836 |
AL | 0 | 24860 | 126480 | 134370 | 14288572 | 71086 | 105947 | 214668 |
AL | 0 | 16930 | 168170 | 177800 | 26053920 | 149150 | 404166 | 567439 |
AL | 0 | 3530 | 42190 | 48270 | 20752068 | 271416 | 1569967 | 822565 |
AL | 3500 | 950 | 260 | 710 | 19851 | 183 | 4 | 1657 |
AL | 3500 | 590 | 410 | 860 | 49338 | 172 | 54 | 788 |
AL | 3500 | 290 | 490 | 620 | 56170 | 185 | 139 | 584 |
AL | 3500 | 90 | 490 | 530 | 52977 | 89 | 173 | 339 |
AL | 3500 | 40 | 460 | 450 | 64329 | 205 | 709 | 1720 |
AL | 3500 | 0 | 40 | 30 | 15359 | 130 | 0 | 0 |
AL | 3500 | 800 | 190 | 860 | 19011 | 135 | -1 | 840 |
AL | 3500 | 360 | 300 | 850 | 36996 | 87 | 42 | -193 |
AL | 3500 | 140 | 230 | 340 | 28664 | 95 | 22 | 233 |
AL | 3500 | 50 | 170 | 200 | 19583 | 38 | 62 | -48 |
AL | 3500 | 0 | 160 | 170 | 25136 | 67 | 48 | 84 |
AL | 3500 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
AL | 3500 | 250 | 110 | 200 | 5515 | 26 | 0 | 543 |
AL | 3500 | 110 | 150 | 230 | 11906 | 32 | 0 | 188 |
AL | 3500 | 40 | 130 | 160 | 11854 | 17 | 6 | 49 |
AL | 3500 | 20 | 110 | 130 | 10911 | 6 | 0 | -33 |
AL | 3500 | 0 | 130 | 140 | 18399 | 14 | 0 | 0 |
AL | 3500 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
AL | 3500 | 2720 | 620 | 2360 | 52779 | 260 | 18 | 5331 |
AL | 3500 | 1150 | 830 | 2100 | 98951 | 306 | 218 | 3335 |
AL | 3500 | 530 | 1050 | 1610 | 116810 | 261 | 382 | 2197 |
For context, here is the data source:
val taxes2013 = spark
.read.format("csv")
.option("header", "true")
.load("dbfs:/databricks-datasets/data.gov/irs_zip_code_data/data-001/2013_soi_zipcode_agi.csv")
What is the quickest way to determine, from this data, the average income for each state? Let's display this same data on a map that has each state shaded according to the average income for that state. With a glance it is easy to determine which states have a higher income and the overall spread of income across the United States.
This map visualization is one of the default options that come with Azure Databricks out of the box. All notebooks, regardless of their language, support Databricks visualization using the display
function. The display function includes support for visualizing multiple data types. As opposed to the handful of basic visualizations that other notebook engines provide, Azure Databricks includes several out of the box that you traditionally would need to rely on an external library such as matplotlib
to obtain. However, if you wish to use external libraries to augment the default ones, you are free to do so.
These visualizations are suitable for many situations, but sometimes you need to pull in a 3rd-party library or package. Here's an example of referencing the d3a
package. You can find out more about using 3rd-party and custom packages in Creating and Using 3rd-party and Local Libraries. This example also uses the 3rd-party graphframes
package to provide DataFrame-based graphs.
// On-time and Early Arrivals
import d3a._
graphs.force(
height = 800,
width = 1200,
clicks = sql("select src, dst as dest, count(1) as count from departureDelays_geo where delay <= 0 group by src, dst").as[Edge])
The output of the code above is an interactive visualization displayed right in the notebook. You can hover over a circle on the map to view the vertices and edges that represent the relationship between origin and destination airports.
Azure Databricks gives you the ability to create dashboards that make it easy to publish your visualizations into a presentation format. This capability allows you to hide the underlying code so viewers can just focus on the visualizations and labeled data.
For our example, we will continue using the IRS Tax Statistics dataset that we used above. You can follow along, because the files are already available via the /databricks-datasets
directory which is a repository of public, Azure Databricks-hosted datasets that is available on all Azure Databricks accounts.
-
First, create a new Scala notebook in the Azure Databricks dashboard.
-
In the first cell, add the following code snippet to display a title using markdown format:
%md # Income Tax Analysis Dashboard
-
In the next cell, create a new DataFrame with the tax data you will use for the dashboard.
val taxes2013 = spark .read.format("csv") .option("header", "true") .load("dbfs:/databricks-datasets/data.gov/irs_zip_code_data/data-001/2013_soi_zipcode_agi.csv")
-
In a new cell, create a temporary table from the DataFrame:
taxes2013.createOrReplaceTempView('taxes2013');
-
If you look at the contents of the data, you will notice that it contains a large number of columns, many of which have unfamiliar names. For the purposes of the dashboard, select a subset of the columns into a new table, and shorten the zip code length from 5 to 4 to widen the general area you will display from your data set:
%sql DROP TABLE IF EXISTS cleaned_taxes; CREATE TABLE cleaned_taxes AS SELECT state, int(zipcode / 10) as zipcode, int(mars1) as single_returns, int(mars2) as joint_returns, int(numdep) as numdep, double(A02650) as total_income_amount, double(A00300) as taxable_interest_amount, double(a01000) as net_capital_gains, double(a00900) as biz_net_income FROM taxes2013
-
Paste the following markdown into a new cell as a sub-header for the diagram that follows:
%md ## Average income tax by state
-
Now create a new DataFrame from the new table, and change the display visualization to "Map":
val cleanedTaxes = spark.table("cleaned_taxes") display(cleanedTaxes.groupBy("state").avg("total_income_amount"))
Your output should look as follows:
Select Plot options... and make sure it is configured as follows:
-
Paste the following markdown into a new cell as a sub-header for the diagram that follows:
%md ## Zip codes with lowest total capital gains
-
To display the set of zip codes with the lowest total capital gains, enter the following in a new cell and change the visualization to the Bar chart:
%sql SELECT zipcode, SUM(net_capital_gains) AS cap_gains FROM cleaned_taxes WHERE NOT (zipcode = 0000 OR zipcode = 9999) GROUP BY zipcode ORDER BY cap_gains ASC LIMIT 10
Your output should look as follows:
-
Paste the following markdown into a new cell as a sub-header for the diagram that follows:
%md ## Total capital gains and business net income by zip code
-
Finally, create a combo metric that represents the total capital gains and business net income by zip code. This is weighted very strongly by capital gains as we can see in the plot:
%sql SELECT zipcode, SUM(biz_net_income) as business_net_income, SUM(net_capital_gains) as capital_gains, SUM(net_capital_gains) + SUM(biz_net_income) as capital_and_business_income FROM cleaned_taxes WHERE NOT (zipcode = 0000 OR zipcode = 9999) GROUP BY zipcode ORDER BY capital_and_business_income DESC LIMIT 50
Change the visualization to the Bar chart and set the following plot options:
The outcome should look similar to the following:
-
To create your dashboard, select View from the top menu, then + New Dashboard.
-
A settings pane for the dashboard will slide out from the right. Type a name for the dashboard, set the width and layout options, the select Present Dashboard.
-
Presentation mode displays your dashboard in a full window view. The charts are still interactive, in that you can hover over their elements to view data labels and other information. You now have a business intelligence dashboard with great visualizations of your data!
While Azure Databricks is built for collaboration, making it easy for data scientists, engineers, architects, and developers to share and work with data, there are times when you need to go outside the environment to give access to others to your data. A good example of this is business leaders who want to view an executive summary, or perhaps business analysts who want to create ad-hoc reports but do not have the ability or desire to write code. In other cases, you want visualizations that you can embed in a web page that automatically refreshes as new data becomes available. In all of these cases, an external BI system makes a lot of sense. One option is to use Power BI, Microsoft's premier suite of business analytics tools. With it, you can create individual reports from datasets that draw from Azure Databricks' tables, and static or live (auto-refreshes data) dashboards, all of which can be consumed on the web or across mobile devices. Users can create their own personalized dashboards, and you can embed any dashboard or report into a web application of your choosing.
What you gain from using a BI system such as Power BI, is a way for your users to work with the data you provide through Azure Databricks, without them needing to ever log in to the workspace. You can perform all of the data wrangling and processing within your Databricks notebooks, store the data into tables, then directly connect those tables to Power BI. Within Power BI, you have the option to copy data from those tables and store it within Power BI, or have it dynamically link to the tables so Azure Databricks provides the data in real-time.
Here is a walkthrough on how to connect Azure Databricks tables to Power BI and build some powerful visualizations without the end user writing any code.
In this scenario, a travel agency has created a machine learning model in Azure Databricks, using Spark Machine Language Library (MLlib). They regularly batch process scheduled flight data, joined with weather predictions, by cleaning and transforming the combined data sets and scoring the output with a trained machine learning model. The last step in the data pipeline is to create a global table that summarizes the scored flight delay predictions. They compare this data to historical flight delay data as well, and regularly retrain their model. In this case, they want to analyze scheduled flights for their customers and visualize the likelihood of flight delays, as well as the flights that are most often delayed given the time of year and weather conditions.
Here is a screenshot of the data they want to use in Power BI:
The next step is to save this query output to a global table.
summary = spark.sql("SELECT OriginAirportCode, Month, DayofMonth, CRSDepHour, Sum(prediction) NumDelays, CONCAT(Latitude, ',', Longitude) OriginLatLong FROM scoredflights s INNER JOIN airport_code_location_lookup_clean a ON s.OriginAirportCode = a.Airport GROUP BY OriginAirportCode, OriginLatLong, Month, DayofMonth, CRSDepHour Having Sum(prediction) > 1 ORDER BY NumDelays DESC")
summary.write.mode("overwrite").saveAsTable("flight_delays_summary")
Before you begin, you must first obtain the JDBC connection string to your Azure Databricks cluster.
-
In Azure Databricks, go to Clusters and select your cluster.
-
On the cluster edit page, scroll down and select the JDBC/ODBC tab.
-
On the JDBC/ODBC tab, copy and save the JDBC URL.
-
Construct the JDBC server address that you will use when you set up your Spark cluster connection in Power BI Desktop.
-
Take the JDBC URL that you copied and saved in step 3 and do the following:
-
Replace jdbc:hive2 with https.
-
Remove everything in the path between the port number and sql, retaining the components indicated by the boxes in the image below.
- In our example, the server address would be:
https://eastus.azuredatabricks.net:443/sql/protocolv1/o/1707858429329790/0614-124738-doubt405 or https://eastus.azuredatabricks.net:443/sql/protocolv1/o/1707858429329790/lab (if you choose the aliased version)
-
-
Install and launch Power BI Desktop.
-
When Power BI Desktop opens, you will need to enter your personal information, or Sign in if you already have an account.
-
Select Other from the left, and select Spark (Beta) from the list of available data sources.
-
Select Connect.
-
You will receive a prompt warning you that the Spark connector is still in preview. Select Continue.
-
On the next screen, you will be prompted for your Spark cluster information.
-
Paste the JDBC connection string you constructed a few steps ago into the Server field.
-
Select the HTTP protocol.
-
Select DirectQuery for the Data Connectivity mode, and select OK. This option will offload query tasks to the Azure Databricks Spark cluster, providing near-real time querying.
-
Enter your credentials on the next screen as follows.
a. User name: token
b. Password: Create a new personal access token. Paste the new token here.
-
Select Connect.
-
In the Navigator dialog, check the box next to flight_delays_summary, and select Load.
-
It will take several minutes for the data to load into the Power BI Desktop client.
-
Once the data finishes loading, you will see the fields appear on the far right of the Power BI Desktop client window.
-
From the Visualizations area, next to Fields, select the Globe icon to add a Map visualization to the report design surface.
-
With the Map visualization still selected, drag the OriginLatLong field to the Location field under Visualizations. Then Next, drag the NumDelays field to the Size field under Visualizations.
-
You should now see a map that looks similar to the following (resize and zoom on your map if necessary):
-
Unselect the Map visualization by clicking on the white space next to the map in the report area.
-
From the Visualizations area, select the Stacked Column Chart icon to add a bar chart visual to the report's design surface.
-
With the Stacked Column Chart still selected, drag the DayofMonth field and drop it into the Axis field located under Visualizations.
-
Next, drag the NumDelays field over, and drop it into the Value field.
-
Grab the corner of the new Stacked Column Chart visual on the report design surface, and drag it out to make it as wide as the bottom of your report design surface. It should look something like the following.
-
Unselect the Stacked Column Chart visual by clicking on the white space next to the map on the design surface.
-
From the Visualizations area, select the Treemap icon to add this visualization to the report.
-
With the Treemap visualization selected, drag the OriginAirportCode field into the Group field under Visualizations.
-
Next, drag the NumDelays field over, and drop it into the Values field.
-
Grab the corner of the Treemap visual on the report design surface, and expand it to fill the area between the map and the right edge of the design surface. The report should now look similar to the following.
-
You can cross filter any of the visualizations on the report by clicking on one of the other visuals within the report, as shown below. (This may take a few seconds to change, as the data is loaded.)
-
You can save the report, by clicking Save from the File menu, and entering a name and location for the file.
-
Once this report is saved, you can import it into the Power BI online service, so others can access the data and the reports can be embedded elsewhere.
This guide walked you through using Azure Databricks dashboards and Power BI as business intelligence tools. Refer to the Azure Databricks documentation to learn how to use other business intelligence tools, such as Tableau, Alteryx, Datadog, Looker, and SQL Workbench/J.