Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hive Dataset as external table with HDFS Dataset #461

Open
kchen0x opened this issue Jan 30, 2017 · 4 comments
Open

Hive Dataset as external table with HDFS Dataset #461

kchen0x opened this issue Jan 30, 2017 · 4 comments

Comments

@kchen0x
Copy link

kchen0x commented Jan 30, 2017

I create a dataset at HDFS with schema and partition:

kite-dataset create dataset:hdfs://10.0.1.63:8020/user/pnda/PNDA_datasets/datasets/kafka/depa_raw --schema sensorRecord.avsc --partition-by partition.json

and use Gobblin to continuously ingest data from kafka to HDFS. The partition looks like:

[
  {"type": "identity", "source": "src", "name": "source"},
  {"type": "year",     "source": "timestamp"},
  {"type": "month",    "source": "timestamp"},
  {"type": "day",      "source": "timestamp"},
  {"type": "hour",     "source": "timestamp"}
]

This part works well.

Then I try to use Hive to query this data, so I create a new Hive dataset as an external table by assign the --location parameter:

kite-dataset create depa_raw --location hdfs://10.0.1.63:8020/user/pnda/PNDA_datasets/datasets/kafka/depa_raw

Then I can find the table default/depa_raw and data in Hive.

But one thing wrong. With the data keep coming from Kafka to HDFS, the partition increases in HDFS by path, but in Hive table, no partition will be created automatically! Which means I can't see newly updated data in Hive.

So what can I do to solve this problem? (I just want to get newly coming data in Hive)

  • I tried kite-dataset delete depa_raw, and wanted to create a new external Hive table, but all the data on HDFS gone after the command.
  • I tried kite-dataset update depa_raw --location hdfs://10.0.1.63:8020/user/pnda/PNDA_datasets/datasets/kafka/depa_raw but nothing happened.
@mkwhitacre
Copy link
Contributor

It is not a great solution but you can repair[1] the table with:

MSCK REPAIR TABLE

[1] - https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RecoverPartitions(MSCKREPAIRTABLE)

@kchen0x
Copy link
Author

kchen0x commented Jan 31, 2017

@mkwhitacre Thank you very much, it really solved my problem.

I have one more question:
I use Gobblin setting up a MapReduce job to consume data in Kafka and write it to Kite dataset. But when I try to write it directly to dataset:hive:depa_raw with

Datasets.load(datasetURI)

Map Reduce job will always fail without specific exception. Only when I set datasetURI="dataset:hdfs://<ip>:<port>/path/to/depa_raw", it can work correctly.

That is why I create a new hive dataset:

kite-dataset create depa_raw --location hdfs://10.0.1.63:8020/user/pnda/PNDA_datasets/datasets/kafka/depa_raw

So, what could be the possible reason that causes this problem?

@mkwhitacre
Copy link
Contributor

Does it also fail when you do: "dataset:hdfs://nameservice/path/to/depa_raw"?

Without a specific exception it is harder to diagnose but I'm guessing it is a problem with your config not being populated with either the configuration for the Hive Metastore or the jars on your classpath.

@kchen0x
Copy link
Author

kchen0x commented Jan 31, 2017

No matter what format I use, as long as it is the dataset:hdfs, it will work. But, 'dataset:hive' will not.
Imagine I have two datasets here:

All the configuration and the class I've used are here:

https://github.com/quentin-chen/gobblin-pnda

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants