Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with Glue to Hive through Amazon S3 Objects #72

Open
jainanuj07 opened this issue Jul 24, 2020 · 2 comments
Open

Issue with Glue to Hive through Amazon S3 Objects #72

jainanuj07 opened this issue Jul 24, 2020 · 2 comments

Comments

@jainanuj07
Copy link

i am trying to migrate Glue Catalog to Hive Metastore of an EMR Cluster. I am using mysql 8 RDS instance as my hive-metastore. As mysql 8 connection is not supported in glue so i am trying to migrate using S3 objects.

I am able to successfully migrate from glue to S3 objects But when i am trying to run spark job for hive metastore-migration from S3 to to-metastore then it is failing with

java.sql.BatchUpdateException: Duplicate entry 'events_db' for key 'DBS.UNIQUE_DATABASE'
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at com.mysql.cj.util.Util.handleNewInstance(Util.java:192)
at com.mysql.cj.util.Util.getInstance(Util.java:167)
at com.mysql.cj.util.Util.getInstance(Util.java:174)
at com.mysql.cj.jdbc.exceptions.SQLError.createBatchUpdateException(SQLError.java:224)
at com.mysql.cj.jdbc.ClientPreparedStatement.executeBatchSerially(ClientPreparedStatement.java:853)
at com.mysql.cj.jdbc.ClientPreparedStatement.executeBatchInternal(ClientPreparedStatement.java:435)
at com.mysql.cj.jdbc.StatementImpl.executeBatch(StatementImpl.java:796)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:672)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:834)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:834)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)

I am running spark job on my EMR cluster like this :
spark-submit --deploy-mode cluster hive_metastore_migration.py --mode 'to-metastore' --jdbc-url <> --jdbc-username <> --jdbc-password <> --input_path 's3://'

Any workaround to resolve this problem.

@AshishInamdar02
Copy link

Hi,

I followed the steps mentioned but whenever I run the Glue Job, it fails with the below error:

Invalid input: Exception in thread "main" java.io.FileNotFoundException: File file:/tmp/g-c8df90907da0771653b3edb778243f5a38e73e14-2295083367131776400/Scripts/hive_metastore_migration.py does not exist

@AshishInamdar02
Copy link

@jainanuj07, Can you please help with the issues I am facing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants