Skip to content
Brian Femiano edited this page May 4, 2013 · 29 revisions

Build

Before you can build this storage handler, checkout and build Accumulo from the latest source. svn co https://svn.apache.org/repos/asf/accumulo/trunk/ then mvn clean install to get 1.6.0 installed in your local repo. This will have to do until Accumulo 1.5+ is hosted in maven central. Alternatively you could install Accumulo 1.5 in your local m2 repository and change the storage handler pom to use those dependencies.

Hive Syntax

It's important to have all the Required Accumulo jars on the Hive AUX_JARS path. Also make sure you are running at least Accumulo 1.5 and the latest Hive 0.10 to avoid Thrift incompatibilities.

Use the CREATE EXTERNAL syntax to define a link to the Accumulo table in the Hive metastore.

CREATE EXTERNAL TABLE footable ...

You can define as may Hive metastore entries for the same Accumulo table as you like. So long as the names are distinct in Hive. (footable, footable2, etc.)

The below HQL defines a Hive table named hivetable1 that maps to an Accumulo table 'my_accumulo_table'.

CREATE EXTERNAL TABLE hivetable1(name STRING, age INT, rowid STRING)
 STORED BY 'org.apache.accumulo.storagehandler.AccumuloStorageHandler'
 WITH SERDEPROPERTIES ('accumulo.columns.mapping' = 'cf|name,cf|age,rowID',
       'accumulo.table.name' = 'my_accumulo_table');

The properties accumulo.columns.mapping and accumulo.table.name are required for creating external mappings. Otherwise you'll get an error during table creation.

name and age map to Accumulo column-family/qualifiers cf|name, and cf|age respectively. Column-family,qualifier pairs must be pipe separated. Each pair must appear in a comma-separated string in accumulo.columns.mapping.

The example also has one additional Hive column to display the rowID, which must be listed explicity as 'rowID' in the property accumulo.columns.mapping. The number of hive columns defined must equal the number of columy-family|qual pairs + 1 for the rowID (if present). rowID is not required to be mapped.

The order in which column-family|qualifier pairs and the rowID appear in accumulo.columns.mapping dictates which Hive column they will be mapped to. It's very important the Hive column types match the value byte [] types stored in Accumulo. For example, for a Hive column with type INT mapped to cf|f1 in Accumulo, the byte [] value stored at the key should represent a 4 byte integer, otherwise the display and filtering will not behavior properly. More flexible support for type hints is a work in progress.

From here you can issue normal SELECT statements and get results to the Hive shell.

SELECT * from hivetable1;

SELECT name,age from hivetable1 WHERE name = 'brian';

Hive rows map to Accumulo key/value pairs on a per-rowID basis. The latest visible version of each mapped qualifier is displayed. If no user-visible key containing a given qualifier exists for a rowID, null is displayed for that column of that row in Hive.

In other words, a Hive row consists of a flatted view of the mapped column-family/qualifier pairs on file for a given rowID in the table. The value returned from Accumulo corresponds to the most recent timestamp the user is authorized to see.

For all operations over Accumulo-mapped Hive tables including CREATE EXTERNAL, you must supply -hiveconf variables for the important Accumulo connection properties (instance id, user, pass, zoohosts).

One easy way is to configure a shell script that passes all the properties directly to the Hive client at launch.

hive -hiveconf accumulo.instance.id=test \
      -hiveconf accumulo.user.name=brian \
      -hiveconf accumulo.user.pass=password \
      -hiveconf accumulo.zookeepers=localhost:2181 \
      -f my_hive_script.sql

For actual usage references, see query_acled.sql and query_acled.sh

See How Iterator Predicate pushdown works for a better understanding of how Iterators and Ranges are programmatically derived from WHERE clauses.