Business intelligence and data visualization

The Apache Spark engine, the underlying platform for Azure Databricks, helps you ingest, manipulate, and display raw data with incredible speed. Increasingly, business leaders recognize the importance of data-driven decision making, and how having a data-oriented mindset can provide them a competitive advantage. The fact that you are using Azure Databricks to begin with is likely the result of this realization, by you or a decision maker within your organization.

What is business intelligence, and why should I care?

Business intelligence, oftentimes referred to as BI, is simply a process for collecting and analyzing data with the goal of extracting actionable insights from those data. These insights help inform business decisions that are backed by analysis of data that is most often represented by visualizations.

What are data visualizations in this context?

Data visualization is a visual snapshot of the data, that can be either static or interactive. It is usually in the form of a chart, map, or graph that helps one process the underlying raw data in a much more rapid and natural way. Our brains are wired to process visual information. Studies show that approximately 90% of the information our brains process and store is visual. This is due to how quickly and efficiently our brains can interpret visual data compared to other forms, such as text. In fact, studies show that we process images 60,000 times faster than text. This is why analysts and business decision makers rely so heavily on visuals. They are the most effective and efficient way to process and share raw data.

Here's an example we will use to drive home the point. Below is a subset of income tax data extracted from flat text files and stored in a Spark DataFrame. There are millions of rows of data, broken down by state.

state	zipcode	sngle_returns	joint_returns	numdep	total_income_amount	taxable_interest_amount	net_capital_gains	biz_net_income
AL	0	488030	122290	571240	11444868	77952	23583	824487
AL	0	195840	155230	383240	17810952	81216	54639	252768
AL	0	72710	146880	189340	16070153	80627	84137	259836
AL	0	24860	126480	134370	14288572	71086	105947	214668
AL	0	16930	168170	177800	26053920	149150	404166	567439
AL	0	3530	42190	48270	20752068	271416	1569967	822565
AL	3500	950	260	710	19851	183	4	1657
AL	3500	590	410	860	49338	172	54	788
AL	3500	290	490	620	56170	185	139	584
AL	3500	90	490	530	52977	89	173	339
AL	3500	40	460	450	64329	205	709	1720
AL	3500	0	40	30	15359	130	0	0
AL	3500	800	190	860	19011	135	-1	840
AL	3500	360	300	850	36996	87	42	-193
AL	3500	140	230	340	28664	95	22	233
AL	3500	50	170	200	19583	38	62	-48
AL	3500	0	160	170	25136	67	48	84
AL	3500	0	0	0	0	0	0	0
AL	3500	250	110	200	5515	26	0	543
AL	3500	110	150	230	11906	32	0	188
AL	3500	40	130	160	11854	17	6	49
AL	3500	20	110	130	10911	6	0	-33
AL	3500	0	130	140	18399	14	0	0
AL	3500	0	0	0	0	0	0	0
AL	3500	2720	620	2360	52779	260	18	5331
AL	3500	1150	830	2100	98951	306	218	3335
AL	3500	530	1050	1610	116810	261	382	2197

For context, here is the data source:

val taxes2013 = spark
  .read.format("csv")
  .option("header", "true")
  .load("dbfs:/databricks-datasets/data.gov/irs_zip_code_data/data-001/2013_soi_zipcode_agi.csv")

What is the quickest way to determine, from this data, the average income for each state? Let's display this same data on a map that has each state shaded according to the average income for that state. With a glance it is easy to determine which states have a higher income and the overall spread of income across the United States.

This map visualization is one of the default options that come with Azure Databricks out of the box. All notebooks, regardless of their language, support Databricks visualization using the display function. The display function includes support for visualizing multiple data types. As opposed to the handful of basic visualizations that other notebook engines provide, Azure Databricks includes several out of the box that you traditionally would need to rely on an external library such as matplotlib to obtain. However, if you wish to use external libraries to augment the default ones, you are free to do so.

These visualizations are suitable for many situations, but sometimes you need to pull in a 3rd-party library or package. Here's an example of referencing the d3a package. You can find out more about using 3rd-party and custom packages in Creating and Using 3rd-party and Local Libraries. This example also uses the 3rd-party graphframes package to provide DataFrame-based graphs.

// On-time and Early Arrivals
import d3a._

graphs.force(
  height = 800,
  width = 1200,
  clicks = sql("select src, dst as dest, count(1) as count from departureDelays_geo where delay <= 0 group by src, dst").as[Edge])

The output of the code above is an interactive visualization displayed right in the notebook. You can hover over a circle on the map to view the vertices and edges that represent the relationship between origin and destination airports.

Business intelligence and Azure Databricks

Azure Databricks gives you the ability to create dashboards that make it easy to publish your visualizations into a presentation format. This capability allows you to hide the underlying code so viewers can just focus on the visualizations and labeled data.

For our example, we will continue using the IRS Tax Statistics dataset that we used above. You can follow along, because the files are already available via the /databricks-datasets directory which is a repository of public, Azure Databricks-hosted datasets that is available on all Azure Databricks accounts.

First, create a new Scala notebook in the Azure Databricks dashboard.
In the first cell, add the following code snippet to display a title using markdown format:
```
%md # Income Tax Analysis Dashboard
```

In the next cell, create a new DataFrame with the tax data you will use for the dashboard.

val taxes2013 = spark
  .read.format("csv")
  .option("header", "true")
  .load("dbfs:/databricks-datasets/data.gov/irs_zip_code_data/data-001/2013_soi_zipcode_agi.csv")

In a new cell, create a temporary table from the DataFrame:
```
taxes2013.createOrReplaceTempView('taxes2013');
```

If you look at the contents of the data, you will notice that it contains a large number of columns, many of which have unfamiliar names. For the purposes of the dashboard, select a subset of the columns into a new table, and shorten the zip code length from 5 to 4 to widen the general area you will display from your data set:

%sql
DROP TABLE IF EXISTS cleaned_taxes;

CREATE TABLE cleaned_taxes AS
SELECT state, int(zipcode / 10) as zipcode,
  int(mars1) as single_returns,
  int(mars2) as joint_returns,
  int(numdep) as numdep,
  double(A02650) as total_income_amount,
  double(A00300) as taxable_interest_amount,
  double(a01000) as net_capital_gains,
  double(a00900) as biz_net_income
FROM taxes2013

Paste the following markdown into a new cell as a sub-header for the diagram that follows:
```
%md ## Average income tax by state
```
Now create a new DataFrame from the new table, and change the display visualization to "Map":
```
val cleanedTaxes = spark.table("cleaned_taxes")
display(cleanedTaxes.groupBy("state").avg("total_income_amount"))
```
Your output should look as follows:

Select Plot options... and make sure it is configured as follows:
Paste the following markdown into a new cell as a sub-header for the diagram that follows:
```
%md ## Zip codes with lowest total capital gains
```

To display the set of zip codes with the lowest total capital gains, enter the following in a new cell and change the visualization to the Bar chart:

%sql
SELECT zipcode, SUM(net_capital_gains) AS cap_gains
FROM cleaned_taxes
  WHERE NOT (zipcode = 0000 OR zipcode = 9999)
GROUP BY zipcode
ORDER BY cap_gains ASC
LIMIT 10

Your output should look as follows:

Paste the following markdown into a new cell as a sub-header for the diagram that follows:
```
%md ## Total capital gains and business net income by zip code
```

Finally, create a combo metric that represents the total capital gains and business net income by zip code. This is weighted very strongly by capital gains as we can see in the plot:

%sql
SELECT zipcode,
  SUM(biz_net_income) as business_net_income,
  SUM(net_capital_gains) as capital_gains,
  SUM(net_capital_gains) + SUM(biz_net_income) as capital_and_business_income
FROM cleaned_taxes
  WHERE NOT (zipcode = 0000 OR zipcode = 9999)
GROUP BY zipcode
ORDER BY capital_and_business_income DESC
LIMIT 50

Change the visualization to the Bar chart and set the following plot options:

The outcome should look similar to the following:

To create your dashboard, select View from the top menu, then + New Dashboard.
A settings pane for the dashboard will slide out from the right. Type a name for the dashboard, set the width and layout options, the select Present Dashboard.
Presentation mode displays your dashboard in a full window view. The charts are still interactive, in that you can hover over their elements to view data labels and other information. You now have a business intelligence dashboard with great visualizations of your data!

Using external BI tools and services with Azure Databricks

While Azure Databricks is built for collaboration, making it easy for data scientists, engineers, architects, and developers to share and work with data, there are times when you need to go outside the environment to give access to others to your data. A good example of this is business leaders who want to view an executive summary, or perhaps business analysts who want to create ad-hoc reports but do not have the ability or desire to write code. In other cases, you want visualizations that you can embed in a web page that automatically refreshes as new data becomes available. In all of these cases, an external BI system makes a lot of sense. One option is to use Power BI, Microsoft's premier suite of business analytics tools. With it, you can create individual reports from datasets that draw from Azure Databricks' tables, and static or live (auto-refreshes data) dashboards, all of which can be consumed on the web or across mobile devices. Users can create their own personalized dashboards, and you can embed any dashboard or report into a web application of your choosing.

What you gain from using a BI system such as Power BI, is a way for your users to work with the data you provide through Azure Databricks, without them needing to ever log in to the workspace. You can perform all of the data wrangling and processing within your Databricks notebooks, store the data into tables, then directly connect those tables to Power BI. Within Power BI, you have the option to copy data from those tables and store it within Power BI, or have it dynamically link to the tables so Azure Databricks provides the data in real-time.

Here is a walkthrough on how to connect Azure Databricks tables to Power BI and build some powerful visualizations without the end user writing any code.

Scenario

In this scenario, a travel agency has created a machine learning model in Azure Databricks, using Spark Machine Language Library (MLlib). They regularly batch process scheduled flight data, joined with weather predictions, by cleaning and transforming the combined data sets and scoring the output with a trained machine learning model. The last step in the data pipeline is to create a global table that summarizes the scored flight delay predictions. They compare this data to historical flight delay data as well, and regularly retrain their model. In this case, they want to analyze scheduled flights for their customers and visualize the likelihood of flight delays, as well as the flights that are most often delayed given the time of year and weather conditions.

Here is a screenshot of the data they want to use in Power BI:

The next step is to save this query output to a global table.

summary = spark.sql("SELECT  OriginAirportCode, Month, DayofMonth, CRSDepHour, Sum(prediction) NumDelays,     CONCAT(Latitude, ',', Longitude) OriginLatLong FROM scoredflights s INNER JOIN airport_code_location_lookup_clean a ON s.OriginAirportCode = a.Airport GROUP BY OriginAirportCode, OriginLatLong, Month, DayofMonth, CRSDepHour  Having Sum(prediction) > 1 ORDER BY NumDelays DESC")

summary.write.mode("overwrite").saveAsTable("flight_delays_summary")

Obtaining the JDBC connection string to your Azure Databricks cluster

Before you begin, you must first obtain the JDBC connection string to your Azure Databricks cluster.

In Azure Databricks, go to Clusters and select your cluster.
On the cluster edit page, scroll down and select the JDBC/ODBC tab.
On the JDBC/ODBC tab, copy and save the JDBC URL.
- Construct the JDBC server address that you will use when you set up your Spark cluster connection in Power BI Desktop.
- Take the JDBC URL that you copied and saved in step 3 and do the following:
- Replace jdbc:hive2 with https.
- Remove everything in the path between the port number and sql, retaining the components indicated by the boxes in the image below.
- In our example, the server address would be:
https://eastus.azuredatabricks.net:443/sql/protocolv1/o/1707858429329790/0614-124738-doubt405 or https://eastus.azuredatabricks.net:443/sql/protocolv1/o/1707858429329790/lab (if you choose the aliased version)

Connect to Azure Databricks using Power BI Desktop

Install and launch Power BI Desktop.
When Power BI Desktop opens, you will need to enter your personal information, or Sign in if you already have an account.
Select Get data on the screen that is displayed next.
Select Other from the left, and select Spark (Beta) from the list of available data sources.
Select Connect.
You will receive a prompt warning you that the Spark connector is still in preview. Select Continue.
On the next screen, you will be prompted for your Spark cluster information.
Paste the JDBC connection string you constructed a few steps ago into the Server field.
Select the HTTP protocol.
Select DirectQuery for the Data Connectivity mode, and select OK. This option will offload query tasks to the Azure Databricks Spark cluster, providing near-real time querying.
Enter your credentials on the next screen as follows.

a. User name: token

b. Password: Create a new personal access token. Paste the new token here.
Select Connect.
In the Navigator dialog, check the box next to flight_delays_summary, and select Load.
It will take several minutes for the data to load into the Power BI Desktop client.

Create Power BI report

Once the data finishes loading, you will see the fields appear on the far right of the Power BI Desktop client window.
From the Visualizations area, next to Fields, select the Globe icon to add a Map visualization to the report design surface.
With the Map visualization still selected, drag the OriginLatLong field to the Location field under Visualizations. Then Next, drag the NumDelays field to the Size field under Visualizations.
You should now see a map that looks similar to the following (resize and zoom on your map if necessary):
Unselect the Map visualization by clicking on the white space next to the map in the report area.
From the Visualizations area, select the Stacked Column Chart icon to add a bar chart visual to the report's design surface.
With the Stacked Column Chart still selected, drag the DayofMonth field and drop it into the Axis field located under Visualizations.
Next, drag the NumDelays field over, and drop it into the Value field.
Grab the corner of the new Stacked Column Chart visual on the report design surface, and drag it out to make it as wide as the bottom of your report design surface. It should look something like the following.
Unselect the Stacked Column Chart visual by clicking on the white space next to the map on the design surface.
From the Visualizations area, select the Treemap icon to add this visualization to the report.
With the Treemap visualization selected, drag the OriginAirportCode field into the Group field under Visualizations.
Next, drag the NumDelays field over, and drop it into the Values field.
Grab the corner of the Treemap visual on the report design surface, and expand it to fill the area between the map and the right edge of the design surface. The report should now look similar to the following.
You can cross filter any of the visualizations on the report by clicking on one of the other visuals within the report, as shown below. (This may take a few seconds to change, as the data is loaded.)
You can save the report, by clicking Save from the File menu, and entering a name and location for the file.
Once this report is saved, you can import it into the Power BI online service, so others can access the data and the reports can be embedded elsewhere.

Using other BI tools

This guide walked you through using Azure Databricks dashboards and Power BI as business intelligence tools. Refer to the Azure Databricks documentation to learn how to use other business intelligence tools, such as Tableau, Alteryx, Datadog, Looker, and SQL Workbench/J.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bi-and-datavis.md

bi-and-datavis.md

Business intelligence and data visualization

What is business intelligence, and why should I care?

What are data visualizations in this context?

Business intelligence and Azure Databricks

Using external BI tools and services with Azure Databricks

Scenario

Obtaining the JDBC connection string to your Azure Databricks cluster

Connect to Azure Databricks using Power BI Desktop

Create Power BI report

Using other BI tools

Files

bi-and-datavis.md

Latest commit

History

bi-and-datavis.md

File metadata and controls

Business intelligence and data visualization

What is business intelligence, and why should I care?

What are data visualizations in this context?

Business intelligence and Azure Databricks

Using external BI tools and services with Azure Databricks

Scenario

Obtaining the JDBC connection string to your Azure Databricks cluster

Connect to Azure Databricks using Power BI Desktop

Create Power BI report

Using other BI tools