Hive Dataset as external table with HDFS Dataset #461

kchen0x · 2017-01-30T23:54:01Z

I create a dataset at HDFS with schema and partition:

kite-dataset create dataset:hdfs://10.0.1.63:8020/user/pnda/PNDA_datasets/datasets/kafka/depa_raw --schema sensorRecord.avsc --partition-by partition.json

and use Gobblin to continuously ingest data from kafka to HDFS. The partition looks like:

[
  {"type": "identity", "source": "src", "name": "source"},
  {"type": "year",     "source": "timestamp"},
  {"type": "month",    "source": "timestamp"},
  {"type": "day",      "source": "timestamp"},
  {"type": "hour",     "source": "timestamp"}
]

This part works well.

Then I try to use Hive to query this data, so I create a new Hive dataset as an external table by assign the --location parameter:

kite-dataset create depa_raw --location hdfs://10.0.1.63:8020/user/pnda/PNDA_datasets/datasets/kafka/depa_raw

Then I can find the table default/depa_raw and data in Hive.

But one thing wrong. With the data keep coming from Kafka to HDFS, the partition increases in HDFS by path, but in Hive table, no partition will be created automatically! Which means I can't see newly updated data in Hive.

So what can I do to solve this problem? (I just want to get newly coming data in Hive)

I tried kite-dataset delete depa_raw, and wanted to create a new external Hive table, but all the data on HDFS gone after the command.
I tried kite-dataset update depa_raw --location hdfs://10.0.1.63:8020/user/pnda/PNDA_datasets/datasets/kafka/depa_raw but nothing happened.

The text was updated successfully, but these errors were encountered:

mkwhitacre · 2017-01-31T03:22:07Z

It is not a great solution but you can repair[1] the table with:

MSCK REPAIR TABLE

[1] - https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RecoverPartitions(MSCKREPAIRTABLE)

kchen0x · 2017-01-31T20:57:40Z

@mkwhitacre Thank you very much, it really solved my problem.

I have one more question:
I use Gobblin setting up a MapReduce job to consume data in Kafka and write it to Kite dataset. But when I try to write it directly to dataset:hive:depa_raw with

Datasets.load(datasetURI)

Map Reduce job will always fail without specific exception. Only when I set datasetURI="dataset:hdfs://<ip>:<port>/path/to/depa_raw", it can work correctly.

That is why I create a new hive dataset:

kite-dataset create depa_raw --location hdfs://10.0.1.63:8020/user/pnda/PNDA_datasets/datasets/kafka/depa_raw

So, what could be the possible reason that causes this problem?

mkwhitacre · 2017-01-31T22:18:48Z

Does it also fail when you do: "dataset:hdfs://nameservice/path/to/depa_raw"?

Without a specific exception it is harder to diagnose but I'm guessing it is a problem with your config not being populated with either the configuration for the Hive Metastore or the jars on your classpath.

kchen0x · 2017-01-31T23:41:25Z

No matter what format I use, as long as it is the dataset:hdfs, it will work. But, 'dataset:hive' will not.
Imagine I have two datasets here:

All the configuration and the class I've used are here:

https://github.com/quentin-chen/gobblin-pnda

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hive Dataset as external table with HDFS Dataset #461

Hive Dataset as external table with HDFS Dataset #461

kchen0x commented Jan 30, 2017

mkwhitacre commented Jan 31, 2017

kchen0x commented Jan 31, 2017 •

edited

mkwhitacre commented Jan 31, 2017

kchen0x commented Jan 31, 2017

Hive Dataset as external table with HDFS Dataset #461

Hive Dataset as external table with HDFS Dataset #461

Comments

kchen0x commented Jan 30, 2017

mkwhitacre commented Jan 31, 2017

kchen0x commented Jan 31, 2017 • edited

mkwhitacre commented Jan 31, 2017

kchen0x commented Jan 31, 2017

kchen0x commented Jan 31, 2017 •

edited