Merge SQL failing with ParseException #70

srinikvv · 2020-06-22T06:23:30Z

Hi Team,
I am trying to perform MERGE on HiveAcid talbes using Qubole Spark-ACID, but was facing below errors:

Created an assembly jar from latest code from mater and tried to execute MERGE statement using spark.sql from spark-shell:

/spark-2.4.3-bin-hadoop2.7/bin/spark-shell --jars /sandbox/spark-acid-assembly-0.5.0.jar --conf spark.sql.extensions=com.qubole.spark.hiveacid.HiveAcidAutoConvertExtension --conf "spark.hadoop.yarn.timeline-service.enabled=false"
scala> spark.sql("MERGE INTO TARGET_TABLE AS T  USING SOURCE_TABLE AS S ON t.key_col= s.key_col WHEN MATCHED THEN UPDATE SET                    col_1 = s.col_1, col_2 = s.col_2 WHEN NOT MATCHED THEN INSERT VALUES ( s.key_col, s.col_1, s.col_2)")

org.apache.spark.sql.catalyst.parser.ParseException: mismatched input 'MERGE' expecting {'(', 'SELECT', 'FROM', 'ADD', 'DESC', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'INSERT', 'DELETE', 'DESCRIBE', 'EXPLAIN', 'SHOW', 'USE', 'DROP', 'ALTER', 'MAP', 'SET', 'RESET', 'START', 'COMMIT', 'ROLLBACK', 'REDUCE', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'DFS', 'TRUNCATE', 'ANALYZE', 'LIST', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'EXPORT', 'IMPORT', 'LOAD'}(line 1, pos 0)

== SQL ==
MERGE INTO TARGET_TABLE AS T  USING SOURCE_TABLE AS S ON t.key_col= s.key_col WHEN MATCHED THEN UPDATE SET                    col_1 = s.col_1, col_2 = s.col_2 WHEN NOT MATCHED THEN INSERT VALUES ( s.key_col, s.col_1, s.col_2)
^^^

  at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:241)
  at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:117)
  at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
  at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:69)
  at com.qubole.spark.datasources.hiveacid.sql.execution.SparkAcidSqlParser.parsePlan(SparkAcidSqlParser.scala:56)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642)
  ... 49 elided

The text was updated successfully, but these errors were encountered:

amoghmargoor · 2020-06-22T06:54:46Z

@srinikvv how was this created: /sandbox/spark-acid-assembly-0.5.0.jar ?

srinikvv · 2020-06-22T07:02:08Z

@amoghmargoor I

@srinikvv how was this created: /sandbox/spark-acid-assembly-0.5.0.jar ?

I built this assembly jar with latest code from master branch

amoghmargoor · 2020-06-22T07:54:28Z

@srinikvv Can you check with this jar once: https://drive.google.com/file/d/1sqsFzUtyrWvXfE7g_Q8brNHqMnb14Gvv/view?usp=sharing ? And also share your jar please, if the one i provided works.

srinikvv · 2020-06-22T08:38:01Z

@amoghmargoor Tried the jar provided and still see the same issue.

amoghmargoor · 2020-06-22T23:03:42Z

@srinikvv this is working fine, i rechecked. This would happen only if sql extension is not getting added on your end. It is difficult for me to figure out why that would not get added. But can you check that angle by

putting log messages in https://github.com/qubole/spark-acid/blob/master/src/main/scala/com/qubole/spark/hiveacid/HiveAcidAutoConvert.scala where rule new parser is injected and new rule too.
Checking if UPDATE or DELETE is working or not because we have extended grammar for them too.

srinikvv · 2020-06-23T13:23:32Z

@amoghmargoor I checked UPDATE is working fine.
Also the error stack trace in case of MERGE indicates an exception at:

spark-acid/src/main/scala/com/qubole/spark/datasources/hiveacid/sql/execution/SparkAcidSqlParser.scala

Line 56 in a36f56d

case e: AcidParseException => throw e.parseException

Hence I believe this is not an issue with sql extension not getting added. Please check and advice

amoghmargoor · 2020-06-23T14:17:56Z

@srinikvv Can you check what is the token we get here:

spark-acid/src/main/scala/com/qubole/spark/datasources/hiveacid/sql/execution/SparkAcidSqlParser.scala

Line 112 in a36f56d

tokStream.LA(1) match {

? and is that function returning true or false ?

amoghmargoor · 2020-06-23T14:22:38Z

@srinikvv this stack trace you printed is not corresponding to current code too:
at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:241) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:117) at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:69) at com.qubole.spark.datasources.hiveacid.sql.execution.SparkAcidSqlParser.parsePlan(SparkAcidSqlParser.scala:56)

Line 56 doesn't have function call to parsePlan. Please recheck your jars.

srinikvv · 2020-06-23T15:51:37Z

@srinikvv this stack trace you printed is not corresponding to current code too:
at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:241) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:117) at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:69) at com.qubole.spark.datasources.hiveacid.sql.execution.SparkAcidSqlParser.parsePlan(SparkAcidSqlParser.scala:56)

Line 56 doesn't have function call to parsePlan. Please recheck your jars.

@amoghmargoor below is from master branch, SparkAcidSqlParser.scala:56 is part of parsePlan function. Am I missing anything?

amoghmargoor · 2020-06-23T16:40:37Z

@srinikvv
I think you are reading the stack trace wrong. If you look carefully at stack trace it is saying:

SparkSession.sql(SparkSession.scala:642) calls SparkAcidSqlParser.parsePlan i.e., SparkAcidSqlParser.parsePlan(SparkAcidSqlParser.scala:56). SparkAcidSqlParser.parsePlan(SparkAcidSqlParser.scala:56) will call AbstractSqlParser.parsePlan and that will call SparkSqlParser.parse and so on... So according to stack trace Line 56 of SparkAcidSqlParser.scala should have a function call to parsePlan but instead it has throw statement which means you are using old code still.

amoghmargoor · 2020-06-26T14:41:53Z

hey @srinikvv ... were you able to fix your issue ?

srinikvv · 2020-06-29T06:49:32Z

@amoghmargoor your suspicion is correct, spark-shell was using a cached/previous version of the spark-acid-assembly.jar. I was able to test the latest version using pySpark and conf "spark.driver.userClassPathFirst=true". However, I see below error while trying to activate the extension:

>>> spark.sparkContext._jvm.com.qubole.spark.hiveacid.HiveAcidAutoConvertExtension() \
...             .apply(spark._jsparkSession.extensions())
********HiveAcidAutoConvertExtension***********
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/spark-2.4.3-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/spark-2.4.3-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/spark-2.4.3-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o35.apply.
: java.lang.LinkageError: loader constraint violation: when resolving method "org.apache.spark.sql.SparkSessionExtensions.injectResolutionRule(Lscala/Function1;)V" the class loader (instance of org/apache/spark/util/ChildFirstURLClassLoader) of the current class, com/qubole/spark/hiveacid/HiveAcidAutoConvertExtension, and the class loader (instance of sun/misc/Launcher$AppClassLoader) for the method's defining class, org/apache/spark/sql/SparkSessionExtensions, have different Class objects for the type scala/Function1 used in the signature
        at com.qubole.spark.hiveacid.HiveAcidAutoConvertExtension.apply(HiveAcidAutoConvert.scala:72)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:745)

amoghmargoor · 2020-07-01T05:39:20Z

oh ... something like this just works for me:
bin/pyspark --master local[1] --jars /Users/amoghm/src/apache/spark-acid/target/scala-2.11/spark-acid-assembly-0.5.0.jar --conf spark.sql.extensions=com.qubole.spark.hiveacid.HiveAcidAutoConvertExtension

srinikvv · 2020-07-03T18:26:11Z

@amoghmargoor I was finally able to get this working after downloading Hadoop2.8.2 binaries and setting SPARK_DIST_CLASSPATH to reference these libraries.

However the MERGE syntax only works with ACID tables which are not bucketed and I see below exception when used against ACID bucketed tables:

java.lang.RuntimeException: Unsupported operation type - MERGE for Bucketed table stage.tmp_ren_data_set
        at com.qubole.spark.hiveacid.HiveAcidErrors$.unsupportedOperationTypeBucketedTable(HiveAcidErrors.scala:53)
        at com.qubole.spark.hiveacid.HiveAcidOperationDelegate.checkForSupport(AcidOperationDelegate.scala:471)

As per the documentation on Apache Hive Confluence all ACID tables must be bucketed, hence MERGE statement with Spark-Acid without support for bucketed tables is not practically usable.

Do you have any plans in the near future to support MERGE on bucketed tables?

amoghmargoor · 2020-07-03T19:07:48Z

Hi @srinikvv ,
good to hear you could fix the issue. is this something we can add to FAQ or Troubleshooting guide that can help others ?

Btw regarding bucketed tables, Hive ACID does not require them anymore. This was the restriction with earlier implementation of Acid which was changed and now this restriction does not hold. We create non-bucketed Hive ACID table all the time internally in Qubole. If you are using Hive 3.1 and onwards you should be good. We did not add bucketed table support because bucketing hashes are not same across engine.

srinikvv · 2020-07-06T16:31:02Z

@amoghmargoor Yes this can be added to FAQ or Troubleshooting. I believe the issue is when using a spark binaries compiled with hadoop libraries < 2.8.2 (was using spark-2.4.3-bin-hadoop2.7). As a workaround we have downloaded hadoop 2.8.2 libraries and setting SPARK_DIST_CLASSPATH to refer the new hadoop libraries as below before running the spark-submit command:
export SPARK_DIST_CLASSPATH=$(<path_to_hadoop_2.8.2_bin>/hadoop classpath)

A better approach may be is to build a spark 2.4.3 with Hadoop 2.8.2 binaries. I am currently trying this and let you know the result.

Also I tried to perform merge on non bucketed ACID tables, was facing org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException for a specific table, below is the stack trace:

20/07/06 07:58:26 INFO MergeImpl: MERGE Clause 1: UPDATE being executed 20/07/06 07:58:26 INFO TableWriter: Write Operation being performed to table booking_f: UPDATE 20/07/06 07:59:16 WARN TaskSetManager: Lost task 50.0 in stage 10.0 (TID 1753, hdop-stg-wrk2.vmware.com, executor 3): org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException): /warehouse/tablespace/managed/hive/base.db/booking_f/delete_delta_0000011_0000011_0000/bucket_02359 path already being created, clientname=DFSClient_NONMAPREDUCE_1168652467_87 at org.apache.hadoop.ipc.Client.call(Client.java:1475) at org.apache.hadoop.ipc.Client.call(Client.java:1412) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) at com.sun.proxy.$Proxy15.create(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:296) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy16.create(Unknown Source) at org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1648) at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1689) at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1624) at org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:448) at org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:444) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:459) at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:387) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:892) at com.qubole.shaded.orc.impl.PhysicalFsWriter.<init>(PhysicalFsWriter.java:95) at com.qubole.shaded.orc.impl.WriterImpl.<init>(WriterImpl.java:177) at com.qubole.shaded.hadoop.hive.ql.io.orc.WriterImpl.<init>(WriterImpl.java:94) at com.qubole.shaded.hadoop.hive.ql.io.orc.OrcFile.createWriter(OrcFile.java:334) at com.qubole.shaded.hadoop.hive.ql.io.orc.OrcRecordUpdater.addSplitUpdateEvent(OrcRecordUpdater.java:456) at com.qubole.shaded.hadoop.hive.ql.io.orc.OrcRecordUpdater.update(OrcRecordUpdater.java:498) at com.qubole.spark.hiveacid.writer.hive.HiveAcidFullAcidWriter.process(HiveAcidWriter.scala:275) at com.qubole.spark.hiveacid.writer.TableWriter$$anon$1$$anonfun$apply$1.apply(TableWriter.scala:146) at com.qubole.spark.hiveacid.writer.TableWriter$$anon$1$$anonfun$apply$1.apply(TableWriter.scala:146) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at com.qubole.spark.hiveacid.writer.TableWriter$$anon$1.apply(TableWriter.scala:146) at com.qubole.spark.hiveacid.writer.TableWriter$$anon$1.apply(TableWriter.scala:138) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)
This looks similar to 41. However I realized that the merge statement is failing while trying to create a delete_delta folder. Can you assist me here?

srinikvv · 2020-07-06T16:40:15Z

@amoghmargoor is there a plan to modify the plugin to work on Spark 3?
We want to explore some feature of Spark3 and it also comes pre-compiled with Hadoop 3.2 which can help avoid the earlier issue out of the box.
FYI: I already tried initializing spark-acid plugin with Spark 3 but it fails with below error:
Py4JJavaError: An error occurred while calling o28.sessionState. : java.lang.NoClassDefFoundError: org/apache/spark/internal/Logging$class at com.qubole.spark.datasources.hiveacid.sql.execution.SparkAcidSqlParser.<init>(SparkAcidSqlParser.scala:20) at com.qubole.spark.hiveacid.HiveAcidAutoConvertExtension$$anonfun$apply$3.apply(HiveAcidAutoConvert.scala:74) at com.qubole.spark.hiveacid.HiveAcidAutoConvertExtension$$anonfun$apply$3.apply(HiveAcidAutoConvert.scala:73) at org.apache.spark.sql.SparkSessionExtensions.$anonfun$buildParser$1(SparkSessionExtensions.scala:205) at scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60) at scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68) at scala.collection.mutable.ArrayBuffer.foldLeft(ArrayBuffer.scala:49) at org.apache.spark.sql.SparkSessionExtensions.buildParser(SparkSessionExtensions.scala:204) at org.apache.spark.sql.internal.BaseSessionStateBuilder.sqlParser$lzycompute(BaseSessionStateBuilder.scala:129) at org.apache.spark.sql.internal.BaseSessionStateBuilder.sqlParser(BaseSessionStateBuilder.scala:128) at org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:329) at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1107) at org.apache.spark.sql.SparkSession.$anonfun$sessionState$2(SparkSession.scala:157) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:155) at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:152) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:745)

srinikvv · 2020-07-06T17:47:42Z

I also wanted to report another error I observed while trying to perform merge using a spark table (RDD based table created using df.createOrReplaceTempView) as source.

An error occurred while calling o82.sql. : java.lang.UnsupportedOperationException: Update Dataframe doesn't have expected schema. Provided: StructField(rowId,StructType(StructField(writeId,LongType,true), StructField(bucketId,IntegerType,true), StructField(rowId,LongType,true)),true),StructField(booking_id,LongType,true),StructField(edw_create_date,TimestampType,true),StructField(edw_update_date,TimestampType,false) Expected: StructField(rowId,StructType(StructField(writeId,LongType,true), StructField(bucketId,IntegerType,true), StructField(rowId,LongType,true)),true),StructField(booking_id,LongType,true),StructField(edw_create_date,TimestampType,true),StructField(edw_update_date,TimestampType,true) at com.qubole.spark.hiveacid.HiveAcidOperationDelegate.mergeUpdate(AcidOperationDelegate.scala:403) at com.qubole.spark.hiveacid.merge.MergeImpl$$anonfun$runMergeOperations$2.apply(MergeImpl.scala:218) at com.qubole.spark.hiveacid.merge.MergeImpl$$anonfun$runMergeOperations$2.apply(MergeImpl.scala:210) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:44) at scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:37) at com.qubole.spark.hiveacid.merge.MergeImpl.runMergeOperations(MergeImpl.scala:210) at com.qubole.spark.hiveacid.merge.MergeImpl.run(MergeImpl.scala:196) at com.qubole.spark.hiveacid.HiveAcidOperationDelegate.merge(AcidOperationDelegate.scala:444) at com.qubole.spark.hiveacid.HiveAcidTable$$anonfun$merge$1.apply(HiveAcidTable.scala:199) at com.qubole.spark.hiveacid.HiveAcidTable$$anonfun$merge$1.apply(HiveAcidTable.scala:199) at com.qubole.spark.hiveacid.HiveTxnWrapper.inTxnRetry$1(HiveAcidTable.scala:330) at com.qubole.spark.hiveacid.HiveTxnWrapper.inTxn(HiveAcidTable.scala:355) at com.qubole.spark.hiveacid.HiveAcidTable.merge(HiveAcidTable.scala:198) at com.qubole.spark.hiveacid.datasource.HiveAcidRelation.merge(HiveAcidRelation.scala:98) at com.qubole.spark.datasources.hiveacid.sql.catalyst.plans.command.MergeCommand.run(MergeCommand.scala:51) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194) at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194) at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3364) at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3363) at org.apache.spark.sql.Dataset.<init>(Dataset.scala:194) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:79) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:745)
The error suggests that the schema of the source and the target does not match. However the only discrepancy is that the column edw_update_date was nullable in Target but this column was passed as current_timestamp() in source, which makes it a not null column in source schema. This however should not cause any error while performing the merge.
@amoghmargoor please let me know if you need me to open another issue for this

amoghmargoor · 2020-07-07T21:39:18Z

@srinikvv W.r.t org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException, is this being thrown on task retries ? similar to issue #43 ? That issue has been fixed now.

reg, Spark 3 we are yet to start work on it. My guess is we might start looking at it July end.

reg the type issue, yes data transfer from non-nullable to nullable should be allowed. we will take a look at it. You can get unblocked by explicitly assigning schema to your source DF that matches target schema.

amoghmargoor · 2020-07-15T21:31:13Z

hey @srinikvv , how are things ? were you able to get MERGE working ?

srinikvv · 2020-07-15T22:09:15Z

@srinikvv W.r.t org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException, is this being thrown on task retries ? similar to issue #43 ? That issue has been fixed now.

reg, Spark 3 we are yet to start work on it. My guess is we might start looking at it July end.

reg the type issue, yes data transfer from non-nullable to nullable should be allowed. we will take a look at it. You can get unblocked by explicitly assigning schema to your source DF that matches target schema.

@amoghmargoor I built a latest jar with code from master branch on 14-Jul and retested the failing merge statement, I still get this error for only a specific table. Below are the steps I do

Perform MERGE on an empty table with full dateset --> completes successfully (approx 90M records)
Wait until compaction is successful on this table
Perform MERGE using a only changed data (13K records). Fails during the Update step with below error message
I believe multiple Tasks are trying to write to the same file, not sure how to avoid this error.
Note: The same Merge statement is running fine from Hive.

20/07/15 14:55:08 INFO MergeImpl: MERGE requires right outer join between Target and Source.
20/07/15 14:55:08 INFO MergeImpl: MERGE Clause 1: UPDATE being executed
20/07/15 14:55:08 INFO TableWriter: Write Operation being performed to table service_contract_f: UPDATE
20/07/15 14:55:12 WARN TaskSetManager: Lost task 132.0 in stage 45.0 (TID 2473, hdop-stg-wrk12.vmware.com, executor 9): org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException): /warehouse/tablespace/managed/hive/base.db/service_contract_f/delete_delta_0000027_0000027_0000/bucket_00004 path already being created, clientname=DFSClient_NONMAPREDUCE_-1263939954_80
        at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1493)
        at org.apache.hadoop.ipc.Client.call(Client.java:1439)
        at org.apache.hadoop.ipc.Client.call(Client.java:1349)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
        at com.sun.proxy.$Proxy15.create(Unknown Source)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:297)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
        at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
        at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
        at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
        at com.sun.proxy.$Proxy16.create(Unknown Source)
        at org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:269)
        at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1274)
        at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1216)
        at org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:477)
        at org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:474)
        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:474)
        at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:415)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1067)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1048)
        at com.qubole.shaded.orc.impl.PhysicalFsWriter.<init>(PhysicalFsWriter.java:95)
        at com.qubole.shaded.orc.impl.WriterImpl.<init>(WriterImpl.java:177)
        at com.qubole.shaded.hadoop.hive.ql.io.orc.WriterImpl.<init>(WriterImpl.java:94)
        at com.qubole.shaded.hadoop.hive.ql.io.orc.OrcFile.createWriter(OrcFile.java:334)
        at com.qubole.shaded.hadoop.hive.ql.io.orc.OrcRecordUpdater.addSplitUpdateEvent(OrcRecordUpdater.java:456)
        at com.qubole.shaded.hadoop.hive.ql.io.orc.OrcRecordUpdater.update(OrcRecordUpdater.java:498)
        at com.qubole.spark.hiveacid.writer.hive.HiveAcidFullAcidWriter.process(HiveAcidWriter.scala:292)
        at com.qubole.spark.hiveacid.writer.TableWriter$$anon$1$$anonfun$apply$1.apply(TableWriter.scala:146)
        at com.qubole.spark.hiveacid.writer.TableWriter$$anon$1$$anonfun$apply$1.apply(TableWriter.scala:146)
        at scala.collection.Iterator$class.foreach(Iterator.scala:891)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.foreach(WholeStageCodegenExec.scala:634)
        at com.qubole.spark.hiveacid.writer.TableWriter$$anon$1.apply(TableWriter.scala:146)
        at com.qubole.spark.hiveacid.writer.TableWriter$$anon$1.apply(TableWriter.scala:138)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:121)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

Please check this and let me know if I can do anything to fix this issue.

srinikvv · 2020-07-16T05:47:03Z

@amoghmargoor When we create ACID tables without bucketing, we see a lot of unevenly sized files underneath the HDFS storage. please let me know if we can do a Zoom session to show you this issue?

amoghmargoor · 2020-07-16T06:35:30Z

@srinikvv that would be great. I and @sourabh912 are in PST timezone and would like to join the call. Send some time slots that would work for you guys. also feel free to join the group spark-acid+subscribe@googlegroups.com

srinikvv · 2020-07-16T07:16:56Z

@amoghmargoor we can meet at 16July IST 9:00PM to 10:00PM, if that works for you guys.

amoghmargoor · 2020-07-17T05:30:32Z

hey @srinikvv I missed the message above. This timing should work for me on Friday (but I guess it would be Friday night for you guys). Or else on Monday too I would be available for the call. Let me know if that would be fine.

srinikvv · 2020-07-17T12:18:37Z

@amoghmargoor lets meet Today. please use below zoom link:
Topic: Veera Venkata Rao's Zoom Meeting
Time: Jul 17, 2020 09:00 PM Mumbai, Kolkata, New Delhi

Join Zoom Meeting
https://VMware.zoom.us/j/98812494211?pwd=cjhqRktESTJ0L0l3elJxaVJ1YVMxUT09
Meeting ID: 988 1249 4211
Password: 494932

amoghmargoor · 2020-07-17T14:39:26Z

sure ... see you at 8:30 am PST.

…

Sent from my iPhone

On 17-Jul-2020, at 5:18 AM, srinikvv ***@***.***> wrote: @amoghmargoor lets meet Today. please use below zoom link: Topic: Veera Venkata Rao's Zoom Meeting Time: Jul 17, 2020 09:00 PM Mumbai, Kolkata, New Delhi Join Zoom Meeting https://VMware.zoom.us/j/98812494211?pwd=cjhqRktESTJ0L0l3elJxaVJ1YVMxUT09 Meeting ID: 988 1249 4211 Password: 494932 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

srinikvv · 2020-07-20T11:33:38Z

@amoghmargoor Appreciate you guys taking time to understand/debug the issue.
As discussed, please let us know the steps to perform update using an API. Also if you have the Debug Jar or Branch created with appropriate logger statements to help fix the issue.

amoghmargoor · 2020-07-20T18:34:29Z

@srinikvv I think existing update API may not be able to support your use case. Can you try running by compiling the jars from here: https://github.com/amoghmargoor/spark-acid/pull/new/issue-70 ? I have added few logs. Run it with just 2 executors and provide me logs for the driver and both the executors after the failure. I may followup with few more such iterations. you can mail me the logs - amoghm@qubole.com, if you don't want to attach here.

Another question we had is: Was speculative execution also enabled ? Thanks.

srinikvv · 2020-07-21T19:20:52Z

@amoghmargoor shared the logs via email. Reg Speculative execution, we are executing jobs with default value for spark.speculation (which is false for version Spark2.4.3)

amoghmargoor · 2020-07-22T21:40:08Z

@srinikvv We have figured out an issue why this could be happening and I have added the fix here: https://github.com/amoghmargoor/spark-acid/pull/new/issue-70. Can you recreate jar from the branch above and check if your issue has been fixed by it ?

sourabh912 · 2020-08-06T19:07:44Z

@srinikvv : Thanks again for reporting the issue. As part of this we identified few issues (issue #83 and issue #93) and fixed them.
I am closing this one. Please feel free to reopen it or open a new issue if you face any problem.

amoghmargoor self-assigned this Jun 22, 2020

amoghmargoor linked a pull request Jul 22, 2020 that will close this issue

Issue-70 Fix the repartitioning logic to handle statement IDs #81

Open

amoghmargoor mentioned this issue Jul 24, 2020

Issue with Multi Statement Delta files #83

Closed

sourabh912 mentioned this issue Aug 6, 2020

Update/delete operation on acid table sometimes fetches wrong bucket ID to write to #92

Closed

amoghmargoor closed this as completed Aug 6, 2020

amoghmargoor assigned sourabh912 Aug 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge SQL failing with ParseException #70

Merge SQL failing with ParseException #70

srinikvv commented Jun 22, 2020 •

edited

amoghmargoor commented Jun 22, 2020

srinikvv commented Jun 22, 2020

amoghmargoor commented Jun 22, 2020 •

edited

srinikvv commented Jun 22, 2020

amoghmargoor commented Jun 22, 2020 •

edited

srinikvv commented Jun 23, 2020 •

edited

amoghmargoor commented Jun 23, 2020

amoghmargoor commented Jun 23, 2020

srinikvv commented Jun 23, 2020

amoghmargoor commented Jun 23, 2020 •

edited

amoghmargoor commented Jun 26, 2020

srinikvv commented Jun 29, 2020 •

edited

amoghmargoor commented Jul 1, 2020

srinikvv commented Jul 3, 2020 •

edited

amoghmargoor commented Jul 3, 2020

srinikvv commented Jul 6, 2020

srinikvv commented Jul 6, 2020

srinikvv commented Jul 6, 2020

amoghmargoor commented Jul 7, 2020 •

edited

amoghmargoor commented Jul 15, 2020

srinikvv commented Jul 15, 2020 •

edited

srinikvv commented Jul 16, 2020 •

edited

amoghmargoor commented Jul 16, 2020

srinikvv commented Jul 16, 2020

amoghmargoor commented Jul 17, 2020

srinikvv commented Jul 17, 2020

amoghmargoor commented Jul 17, 2020 via email

srinikvv commented Jul 20, 2020 •

edited

amoghmargoor commented Jul 20, 2020 •

edited

srinikvv commented Jul 21, 2020

amoghmargoor commented Jul 22, 2020

sourabh912 commented Aug 6, 2020

Merge SQL failing with ParseException #70

Merge SQL failing with ParseException #70

Comments

srinikvv commented Jun 22, 2020 • edited

amoghmargoor commented Jun 22, 2020

srinikvv commented Jun 22, 2020

amoghmargoor commented Jun 22, 2020 • edited

srinikvv commented Jun 22, 2020

amoghmargoor commented Jun 22, 2020 • edited

srinikvv commented Jun 23, 2020 • edited

amoghmargoor commented Jun 23, 2020

amoghmargoor commented Jun 23, 2020

srinikvv commented Jun 23, 2020

amoghmargoor commented Jun 23, 2020 • edited

amoghmargoor commented Jun 26, 2020

srinikvv commented Jun 29, 2020 • edited

amoghmargoor commented Jul 1, 2020

srinikvv commented Jul 3, 2020 • edited

amoghmargoor commented Jul 3, 2020

srinikvv commented Jul 6, 2020

srinikvv commented Jul 6, 2020

srinikvv commented Jul 6, 2020

amoghmargoor commented Jul 7, 2020 • edited

amoghmargoor commented Jul 15, 2020

srinikvv commented Jul 15, 2020 • edited

srinikvv commented Jul 16, 2020 • edited

amoghmargoor commented Jul 16, 2020

srinikvv commented Jul 16, 2020

amoghmargoor commented Jul 17, 2020

srinikvv commented Jul 17, 2020

amoghmargoor commented Jul 17, 2020 via email

srinikvv commented Jul 20, 2020 • edited

amoghmargoor commented Jul 20, 2020 • edited

srinikvv commented Jul 21, 2020

amoghmargoor commented Jul 22, 2020

sourabh912 commented Aug 6, 2020

srinikvv commented Jun 22, 2020 •

edited

amoghmargoor commented Jun 22, 2020 •

edited

amoghmargoor commented Jun 22, 2020 •

edited

srinikvv commented Jun 23, 2020 •

edited

amoghmargoor commented Jun 23, 2020 •

edited

srinikvv commented Jun 29, 2020 •

edited

srinikvv commented Jul 3, 2020 •

edited

amoghmargoor commented Jul 7, 2020 •

edited

srinikvv commented Jul 15, 2020 •

edited

srinikvv commented Jul 16, 2020 •

edited

srinikvv commented Jul 20, 2020 •

edited

amoghmargoor commented Jul 20, 2020 •

edited