Skip to content

exasol/parquet-io-java

Repository files navigation

parquet-io-java

Build Status Maven Central – Parquet for Java

Quality Gate Status

Security Rating Reliability Rating Maintainability Rating Technical Debt

Code Smells Coverage Duplicated Lines (%) Lines of Code

This project provides a library that reads Parquet files into Java objects.

Installation

Add this library as a dependency to your project's pom.xml file.

<dependencies>
    <dependency>
        <groupId>com.exasol</groupId>
        <artifactId>parquet-io-java</artifactId>
        <version>LATEST VERSION</version>
    </dependency>
</dependencies>

Please use the latest version of the library.

Usage

Here is a small example code showing the usage of the library.

final Path path = new Path("/data/parquet/part-0000.parquet");
final Configuration conf = new Configuration();
try (final ParquetReader<Row> reader = RowParquetReader
        .builder(HadoopInputFile.fromPath(path, conf)).build()) {
    Row row = reader.read();
    while (row != null) {
        List<Object> values = row.getValues();
        System.out.println(values);
        row = reader.read();
    }
} catch (final IOException exception) {
    //
}

Data Type Mapping

The following table shows how each Parquet data type is mapped into Java data types.

Parquet Data Type Parquet Logical Type Java Data Type
boolean Boolean
int32 Integer
int32 date Date
int32 decimal(p, s) BigDecimal
int64 Long
int64 timestamp_millis Timestamp
int64 timestamp_micros Timestamp
int64 decimal(p, s) BigDecimal
float Float
double Double
binary String
binary utf8 String
binary decimal(p, s) BigDecimal
fixed_len_byte_array String
fixed_len_byte_array decimal(p, s) BigDecimal
fixed_len_byte_array uuid UUID
int96 Timestamp
group Map
group LIST List
group MAP Map
group REPEATED List

Parquet Repeated Types

Parquet data type can repeat a single field or the group of fields. The parquet-io-java (PIOJ) reads these data types into Java List type.

For example, given the following Parquet schemas:

message parquet_schema {
  repeated binary name (UTF8);
}
message parquet_schema {
  repeated group person {
    required binary name (UTF8);
  }
}

The PIOJ reads both of these Parquet types into Java list of ["John", "Jane"].

On the other hand, you can import a repeated group with multiple fields as a list of maps.

message parquet_schema {
  repeated group person {
    required binary name (UTF8);
    optional int32 age;
  }
}

The PIOJ reads it into a list of person maps:

[ Map("name" -> "John", "age" -> 24), Map("name" -> "Jane", "age" -> 22) ]

Information for Users

Information for Developers