Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible Stocator config issue #227

Open
desimonemike123 opened this issue Dec 13, 2019 · 2 comments
Open

Possible Stocator config issue #227

desimonemike123 opened this issue Dec 13, 2019 · 2 comments

Comments

@desimonemike123
Copy link

We're hitting a stocator configuration issue within an HDP 2.6.5 cluster (ships with Spark 2.3.x, HDFS, YARN, MapReduce2 2.7.x).  Per stocator doc, we built the stocator-1.0.35-IBM-SDK.jar, configured IBM COS buckets, and using spark-submit with the --jars option we were was able read and write to the buckets without issue.  Within the spark program we set the _jsc.hadoopConfiguration() with the various req'd keys (fs.cos.serviceName.iam.api.key, etc...).   We also use Jupyter notebook, and to enable the notebook env, we installed the stocator.jar file into the .../hdp/2.6.5.0-292/spark2/jars and .../hdp/2.6.5.0-292/hadoop/lib directories across the cluster and was able read/write to the buckets without issue as well.

We typically define external Hive tables over our HDFS data and this is where we are encountering issues with stocator and COS. We determined that the stocator jar also needed to be installed under .../hdp/2.6.5.0-292/hive/lib.  We couldn't find a way to dynamically pass-in req'd keys to Hive (using Beeline or Spark) to create the table successfully.   Note that the table definition now has a Location parameter in the form  "cos://bucketName.serviceName/dir" .  We found that if we added all the req'd fs.cos keys to our clusters core-site.xml, we could then create the table in Hive.   Is there a way to dynamically pass-in the keys, having them present in core-xml presents security issues?  The new external table definition with Hive looks correct.   Where we're currently stuck is when we try to retrieve data from the table.  Whether we try to retrieve using spark.sql (select * from HiveTableName ...), or from beeline cli, we get an error that leads me to believe we're missing some configuration.  Detailed stackTrace info is below.  As you can see, stocator does appear to "List" the files in the bucket directory without issue.  But then we encounter the -- java.net.UnknownHostException: mab-ancillary.mab -- error. mab-ancillary is the bucketName and mab is the serviceName, so we believe we're missing a configuration step.  Note, for 'fun', we created a bogus DNS entry, with host 'mab-ancillary.mab' and ip-address of our COS endpoint, and then retrieval does in fact work.  Any help would be appreciated.

2019-12-12 19:52:50,072 DEBUG [HiveServer2-HttpHandler-Pool: Thread-281 - /cliservice]: cos.COSAPIClient (COSAPIClient.java:isStocatorOrigin(1202)) - isStocatorOrigin: for anc_table/
2019-12-12 19:52:50,072 DEBUG [HiveServer2-HttpHandler-Pool: Thread-281 - /cliservice]: cos.COSAPIClient (COSAPIClient.java:isStocatorOrigin(1209)) - isStocatorOrigin: found cached for stocator origin for anc_table. Status true
2019-12-12 19:52:50,072 DEBUG [HiveServer2-HttpHandler-Pool: Thread-281 - /cliservice]: cos.COSAPIClient (COSAPIClient.java:isStocatorOrigin(1202)) - isStocatorOrigin: for anc_table/
2019-12-12 19:52:50,072 DEBUG [HiveServer2-HttpHandler-Pool: Thread-281 - /cliservice]: cos.COSAPIClient (COSAPIClient.java:isStocatorOrigin(1209)) - isStocatorOrigin: found cached for stocator origin for anc_table. Status true
2019-12-12 19:52:50,072 DEBUG [HiveServer2-HttpHandler-Pool: Thread-281 - /cliservice]: cos.COSAPIClient (COSAPIClient.java:createFileStatus(696)) - createFileStatus: found exact file: fake directory cos://mab-ancillary.mab/anc_table/_SUCCESS
2019-12-12 19:52:50,072 DEBUG [HiveServer2-HttpHandler-Pool: Thread-281 - /cliservice]: cos.COSAPIClient (COSAPIClient.java:isStocatorOrigin(1202)) - isStocatorOrigin: for anc_table/
2019-12-12 19:52:50,072 DEBUG [HiveServer2-HttpHandler-Pool: Thread-281 - /cliservice]: cos.COSAPIClient (COSAPIClient.java:isStocatorOrigin(1209)) - isStocatorOrigin: found cached for stocator origin for anc_table. Status true
2019-12-12 19:52:50,073 DEBUG [HiveServer2-HttpHandler-Pool: Thread-281 - /cliservice]: cos.COSAPIClient (COSAPIClient.java:createFileStatus(699)) - createFileStatus: found exact file: normal file cos://mab-ancillary.mab/anc_table/part-00000-c2be6382-7064-4d49-a819-6fbaf75d29b1-c000-attempt_20191209164602_0001_m_000000_0.csv
2019-12-12 19:52:50,073 DEBUG [HiveServer2-HttpHandler-Pool: Thread-281 - /cliservice]: cos.COSAPIClient (COSAPIClient.java:createFileStatus(699)) - createFileStatus: found exact file: normal file cos://mab-ancillary.mab/anc_table/part-00001-c2be6382-7064-4d49-a819-6fbaf75d29b1-c000-attempt_20191209164602_0001_m_000001_0.csv
2019-12-12 19:52:50,073 DEBUG [HiveServer2-HttpHandler-Pool: Thread-281 - /cliservice]: cos.COSAPIClient (COSAPIClient.java:createFileStatus(699)) - createFileStatus: found exact file: normal file cos://mab-ancillary.mab/anc_table/part-00001-c2be6382-7064-4d49-a819-6fbaf75d29b1-c000-attempt_20191209164602_0001_m_000001_0.csv
2019-12-12 19:52:50,073 DEBUG [HiveServer2-HttpHandler-Pool: Thread-281 - /cliservice]: fs.ObjectStoreFileSystem (ObjectStoreFileSystem.java:listStatus(395)) - listStatus: cos://mab-ancillary.mab/anc_table completed. return 2 results
2019-12-12 19:52:50,079 INFO  [HiveServer2-HttpHandler-Pool: Thread-281 - /cliservice]: session.HiveSessionImpl (HiveSessionImpl.java:releaseBeforeOpLock(366)) - We are resetting the hadoop caller context for thread HiveServer2-HttpHandler-Pool: Thread-281 - /cliservice
2019-12-12 19:52:50,079 DEBUG [HiveServer2-HttpHandler-Pool: Thread-281 - /cliservice]: security.UserGroupInformation (UserGroupInformation.java:doAs(1873)) - PrivilegedActionException as:ambari-server (auth:PROXY) via hive/hive.aice.svc.cluster.local@SL.CLOUD9.IBM.COM (auth:KERBEROS) cause:org.apache.hive.service.cli.HiveSQLException: java.io.IOException: java.lang.IllegalArgumentException: java.net.UnknownHostException: mab-ancillary.mab
2019-12-12 19:52:50,079 WARN  [HiveServer2-HttpHandler-Pool: Thread-281 - /cliservice]: thrift.ThriftCLIService (ThriftCLIService.java:FetchResults(718)) - Error fetching results:
org.apache.hive.service.cli.HiveSQLException: java.io.IOException: java.lang.IllegalArgumentException: java.net.UnknownHostException: mab-ancillary.mab
at org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:416)
at org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:243)
at org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:793)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:6

@gilv
Copy link
Contributor

gilv commented Dec 15, 2019

@desimonemike123 based on the log, it seems you are using Hive, right? If this is the case, then stocator doesn't support Hive flows

@desimonemike123
Copy link
Author

Thx for the fast reply Gil, I appreciate it. Yes, we do currently utilize Hive to store schema and partition metadata for our externally defined tables on HDFS. I was trying to follow that same convention for data that resides on IBM COS. I was reading documentation on IBM's Analytic Engine where it states utilizing Stocator as its connector to COS, and also provides a sample of a defining Hive tables over COS (https://cloud.ibm.com/docs/services/AnalyticsEngine?topic=AnalyticsEngine-working-with-hive). Therefore I made the assumption that Stocator supports Hive. I haven't stood up the analytic engine service yet, but assume either that dev team added in the Hive support, or I'll encounter a similar issue. It's my understanding that Spark takes advantage of the partition metadata stored in Hive when the table is queried (avoiding an up-front discovery of all partitions/sub-directories for the given data set). This is one of the reasons I'm trying support external Hive tables being located on both HDFS and COS within our HDP cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants