Home

Short Name

Using Spark SQL to access NOSQL HBase Tables

Short Description

Learn how to use Spark SQL and HSpark connector package to create / query data tables that reside in HBase region servers

Offering Type

Cognitive

Introduction

This journey is intended to provide application developers familiar with SQL, the ability to access NOSQL HBase data tables using SQL commands. You will quickly learn how to create and query the data tables by using Apache Spark SQL and the HSpark connector package.

Authors

By Bo Meng and Rich Hagarty

Code

https://github.com/IBM/sparksql-for-hbase

Demo

N/A

Video

https://www.youtube.com/watch?v=E1GPJMn0qF0

Overview

Apache Spark is an open source big data processing engine that is built for speed, ease of use, and sophisticated analytics.

Apache HBase is a popular open source, NOSQL distributed database which runs on top of the Hadoop Distributed File System (HDFS), and is well-suited for faster read/write operations on large datasets with high throughput and low input/output latency. Like Spark, HBase is built for fast processing of large amounts of data. Spark plus HBase has become a very popular solution for handling big data applications.

But, unlike relational and traditional databases, HBase lacks support for SQL scripting, data types, etc., and requires the Java API to achieve the equivalent functionality. Not a good option if you want to manage and access your data with SQL.

A solution to this problem is HSpark, which connects to Spark and enables Spark SQL commands to be executed against an HBase data store.

This journey is intended to provide application developers familiar with SQL, the ability to access HBase data tables using the same SQL commands. You will quickly learn how to create and query the data tables by using Apache Spark SQL and the HSpark connector package. This allows you to take advantage of the significant performance gains from using HBase without having to learn the Java APIs required to traditionally access the HBase data tables.

HSpark provides a new approach to supporting HBase. It leverages the unified big data processing engine of Spark, while also providing native SQL access to HBase data tables.

When you have completed this journey, you will understand how to:

Install and configure Apache Spark and HSpark connector.
Learn to create metadata for tables in Apache HBase.
Write Spark SQL queries to retrieve HBase data for analysis.

Flow

Set up the environment (Apache Spark, Apache HBase, and HSpark).
Create the tables using HSpark.
Load the data into the tables.
Query the data using HSpark Shell.

Included components

Apache Spark: An open-source, fast and general-purpose cluster computing system.
Apache HBase: A distribute key/value data store built to run on top of HDFS.
HSpark: Provides access to HBase using SparkSQL.

Featured technologies

Data Science: Systems and scientific methods to analyze structured and unstructured data in order to extract knowledge and insights.
Artificial Intelligence: Artificial intelligence can be applied to disparate solution spaces to deliver disruptive technologies.

Blog

https://developer.ibm.com/code/?p=23602&preview=true

Links

Spark performance: https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
HBase performance: https://blog.cloudera.com/blog/2016/06/new-study-evaluating-apache-hbase-performance-on-modern-storage-media/
TPC-DS site: http://www.tpc.org/tpcds/
BigSQL blog: https://developer.ibm.com/hadoop/2017/07/13/announcing-bigsql-5-0/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly