Skip to content

bfemiano/accumulo-hive-storage-manager

Repository files navigation

Query data stored in Accumulo tables directly with HiveQL.

Pertains to issue: https://issues.apache.org/jira/browse/ACCUMULO-143

Currently does not work with Hadoop 2.0/CDH4.

ACLED examples:

$ACCUMULO_HOME/bin, $HADOOP_HOME/bin, $HIVE_HOME/bin on environment path. Either wget or curl installed.

The query examples use a cleaned up version of the structured Acled Nigeria dataset. (http://www.acleddata.com/)

  1. Navigate to src/test/hql/acled and run ingest.sh. The script handles creating and loading data for both the Hive and Accumulo acled tables named 'acled_nigeria' and 'acled' respectively. The ETL and data for both processes runs standalone from the ingest directory.

  2. See query_acled.sql for CREATE EXTERNAL TABLE example, required aux jars, and several sample queries that utilize both the Hive and Accumulo tables. The number of hive columns in table definition must be equal to accumulo.column.mapping.

  3. Run query_acled.sh to see the different query results. Make sure to configure the -hiveconf variables for your local Accumulo instance.

Known limitations:

  • Requires Hive 0.10 and Accumulo 1.5+ which both use Thrift 0.9. Otherwise there are binary incompatibilities.
  • Requires Hadoop 1.0/0.20.2x/CDH3.
  • Supported Hive column types limited to int, double, string and bigint.
  • Hive column type mapping assumes value type consistency for the same qualifier across different rows. For example, r1/cf/q/v cannot hold an int while r2/cf/q/v is a double.
  • The Hive column types must match Accumulo value types. An Accumulo value holding integer bytes should be mapped as a hive column of type int.
  • Does not yet support INSERT.
  • Iterator pushdown only works on WHERE clauses consisting of purely conjunctive predicates. This is a known Hive limitation with the IndexPredicateAnalyzer.
  • 'Like' CompareOpt is not considered decomposable by the predicate analyzer. This has to do with the Hive UDFLike not extending GenericUDF.
  • Iterator pushdown only kicks in for operators <, >, =, >=, <=, !=.

Future enhancements:

  • Allow INSERT for field serialization to Accumulo. OutputFormat exists but is not wired to Serde or tested.
  • Serde property for setting fixed timestamp during mutations.
  • Allow per-qualifier type hints in the serde property, similar to the latest build of the HBase StorageHandler.
  • Support for remaining hive primitive column types.
  • Support for complex value types (Struct, Map, Array, Union).
  • Allow custom Authorizations to be supplied from an external source.

Usage

Licensed AS-IS under Apache License 2.0

About

Working commits for Hive connector to Accumulo. This will eventually be checked directly into Accumulo.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages