Skip to content

pavanka/hive-io-experimental

 
 

Repository files navigation

Overview

Build Status

A Hive Input/Output Library in Java.

The Hive project is built to be used through command line HQL.
As such it does not have a good interface to connect to it through code directly.
This project fixes that problem - it adds support for input/output with Hive in code directly.

Currently we have:

  1. Simple APIs to read and write a table in a single process.
  2. Hadoop compatible Input/OutputFormat that can interact with Hive tables.
  3. hivetail executable for dumping data from a Hive table to stdout.
  4. Examples of HiveIO usage for building applications.

Building

This project uses Maven for its build.
To build the code yourself clone this repository and run mvn clean install

Using

To use this library in your project with maven add this to you pom.xml:

<dependency>
    <groupId>com.facebook.hiveio</groupId>
    <artifactId>hive-io-exp-core</artifactId>
    <version>0.10</version>
</dependency>

You can also browse the released builds and download the jars directly.

Reference

The project's maven site is generated with each release.
API information is available here: JavaDocs.

1. Simple API

If you just want to quickly read or write a Hive table from a Java process, these are for you.

The input API is simply:

public class HiveInput {
  public static Iterable<HiveReadableRecord> readTable(HiveInputDescription inputDesc);
}

And the output API is:

public class HiveOutput {
  public static void writeTable(
        HiveOutputDescription outputDesc,
        Iterable<HiveWritableRecord> records)
}

For example usages of these take a look at the cmdline apps and the tests in the code.

2. Hadoop Compatible API

Design

The input and output classes have a notion of profiles. A profile is tagged by a name and describes the input or output you wish to connect to. Profiles are serialized into the Hadoop Configuration so that they can be passed around easily. MapReduce takes classes you configure it with and instantiates them using reflection on many hosts.

This sort of model makes it hard to configure the classes beyond just setting which to use. The approach we have chosen is to have the user create input / output descriptions, and set them with a profile. This writes it to the configuration which gets shipped to every worker by MapReduce. See example code mentioned below for more information.

Input

The Hive IO library conforms to Hadoop's InputFormat API. The class that implements this API is HiveApiInputFormat. MapReduce creates the InputFormat using reflection. It will call getSplits() to generate splits on the hive tables to read. Each split is then sent to some worker, which then calls createRecordReader(split). This creates a RecordReaderImpl which is used to read each record.

Output

The Hive IO library conforms to Hadoop's OutputFormat API. The class that implements this API is HiveApiOutputFormat.
MapReduce creates the OutputFormat using reflection.
It will call checkOutputSpecs() to verify that the output can be written.
Then it will call createRecordWriter() on each map task host. These record writers will get passed each record to write.
Finally it will call getOutputCommitter() and finalize the output.
Because of its use of reflection, to make use of this library you will need to extend HiveApiOutputFormat (and potentially HiveApiRecordWriter).

The best way to get started with output is to have a look at the example mapreduce code.
This is a simple example of using the library that should be used as a starting point for projects.

Benchmarks

Using Hive IO we were able to read from Hive at 140 MB/s. For comparison, using the hive command line on the same partition reads at around 35 MB/s. This benchmark is an evolving piece of work and we still have performance tweaks to make. Code for the benchmark is located here.

3. HiveTail

This library comes with a hive reader, called hivetail, that can be used to stream through Hive data.
It reads from a Hive table / partition and dumps rows to stdout.
It is also a good starting point for implementing input with this library.
The code is located here.
To use it first build the code or download the jar from Maven Central. Then on your machine run java -jar hive-io-exp-cmdline-0.10-jar-with-dependencies.jar help tail.

4. Examples

The cmdline module contains a suite of command line tools using HiveIO.
The jar-with-dependencies that it builds is completely standalone.
It includes hive jars and everything else it needs so that it can be run directly on the command line.
Moreover it does not use mapreduce, but rather executes completely in a single JVM.

This package contains:

  1. A tool that writes to a predefined Hive table with multiple threads.
  2. The tailer, see hivetail above.
  3. Input benchmark, see above.

Additionally there is a mapreduce module that contains a suite of tools for MapReduce jobs with HiveIO.
Currently it contains an example MR job that writes to a Hive table.

Real World Examples

Giraph uses Hive IO to read and write graphs from Hive. It is another good example that can be used for starter code.
It is a more complicated use case than the examples mentioned above as it converts records to graph vertex and edge objects.
It has scaled to terabytes of data and is running in production at large companies (e.g. Facebook).
The code is here. Specifically you should look at vertex-input, edge-input, and output.
Note especially how the vertex and edge inputs use different profiles to allow reading from multiple Hive tables.

About

Hive I/O Library

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Java 99.7%
  • Ruby 0.3%