Hadoop ETL UDFs

Deprecation Warning

⚠️This project is no longer maintained.

As an alternative, please check out our cloud-storage-extension project. It allows you import formatted (Avro, Orc, Parquet) data from Hadoop Distributed File System (HDFS) or Cloud Storage Systems (S3, Azure Blob or GCS).

For Kerberos based access to Hadoop cluster you can use our Hive Virtual Schema or Impala Virtual Schema projects.

Overview

Hadoop ETL UDFs are the main way to transfer data between Exasol and Hadoop (HCatalog tables on HDFS). The SQL syntax for calling the UDFs is similar to that of Exasol's native IMPORT and EXPORT commands, but with added UDF paramters for specifying the various necessary and optional Hadoop properties.

A brief overview of features includes support for:

HCatalog Metadata (e.g., table location, columns, partitions).
Multiple file formats (e.g., Parquet, ORC, RCFile)
HDFS HA
Partitions
Parallelization

For a more detailed description of the features, please refer to the IMPORT and EXPORT sections below.

Getting Started

Before you can start using the Hadoop ETL UDFs, you have to deploy the UDFs in your Exasol database. Please follow the step-by-step deployment guide.

Using the UDFs

After deloying the UDFs, you can begin using them to easily transfer data to and from Hadoop.

IMPORT

The IMPORT UDFs load data into Exasol from Hadoop (HCatalog tables on HDFS). To import data, you just need to execute the SQL statement IMPORT INTO ... FROM SCRIPT ETL.IMPORT_HCAT_TABLE WITH ... with the appropriate parameters. This calls the ETL.IMPORT_HCAT_TABLE UDF, which was previously created during deployment.

For example, run the following statement to import data into an existing table.

CREATE TABLE sample_07 (code VARCHAR(1000), description VARCHAR (1000), total_emp INT, salary INT);

IMPORT INTO sample_07
FROM SCRIPT ETL.IMPORT_HCAT_TABLE WITH
 HCAT_DB         = 'default'
 HCAT_TABLE      = 'sample_07'
 HCAT_ADDRESS    = 'thrift://hive-metastore-host:9083'
 HCAT_USER       = 'hive'
 HDFS_USER       = 'hdfs';

Please see the IMPORT details for a full description.

EXPORT

Note: This functionality is available in Exasol starting with version 6.0.3.

The EXPORT UDFs load data from Exasol into Hadoop (HCatalog tables on HDFS). To export data, you just need to execute the SQL statement EXPORT ... INTO SCRIPT ETL.EXPORT_HCAT_TABLE WITH ... with the appropriate parameters. This calls the ETL.EXPORT_HCAT_TABLE UDF, which was previously created during deployment.

For example, run the following statement to export data from an existing table.

CREATE TABLE TABLE1 (COL1 SMALLINT, COL2 INT, COL3 VARCHAR(50));

EXPORT TABLE1
INTO SCRIPT ETL.EXPORT_HCAT_TABLE WITH
 HCAT_DB         = 'default'
 HCAT_TABLE      = 'test_table'
 HCAT_ADDRESS    = 'thrift://hive-metastore-host:9083'
 HCAT_USER       = 'hive'
 HDFS_USER       = 'hdfs';

Please see the EXPORT details for a full description.

Frequent Issues

In case you cannot connect to certain parts of Hadoop it is a good idea to test the DNS hostname resolution and TCP/IP connectivity to all hosts and ports of Hadoop (HCatalog, HDFS, and Kerberos servers if used). For this you can use the python script in solution 325. Note that this script is designed for testing http connections, so you can ignore the http check failures.
Google DataProc Integration issues.
Hive null values are imported as \N. For now, you can do post processing after the import to convert them into proper values.

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
.github/workflows		.github/workflows
doc		doc
exa-parquet-api		exa-parquet-api
hadoop-etl-common		hadoop-etl-common
hadoop-etl-dist		hadoop-etl-dist
parquet-pre-1.7		parquet-pre-1.7
parquet		parquet
tools		tools
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

License

exasol/hadoop-etl-udfs

Folders and files

Latest commit

History

Repository files navigation

Hadoop ETL UDFs

Deprecation Warning

Overview

Getting Started

Using the UDFs

IMPORT

EXPORT

Note: This functionality is available in Exasol starting with version 6.0.3.

Frequent Issues

About

Topics

Resources

License

Stars

Watchers

Forks

Languages