Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] orc_write_test.py::test_write_round_trip_corner failed with DATAGEN_SEED=1715517863 #10806

Closed
sameerz opened this issue May 13, 2024 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@sameerz
Copy link
Collaborator

sameerz commented May 13, 2024

Describe the bug
Two cases of orc_write_test.py::test_write_round_trip_corner failed:

  1. test_write_round_trip_corner[native-SetValues(MapType(StringType(), StringType(), True),[{}, None, {'A': ''}, {'B': None}])]
  2. test_write_round_trip_corner[hive-SetValues(MapType(StringType(), StringType(), True),[{}, None, {'A': ''}, {'B': None}])]
Details
[2024-05-12T14:16:03.380Z] _ test_write_round_trip_corner[native-SetValues(MapType(StringType(), StringType(), True),[{}, None, {'A': ''}, {'B': None}])] _
[2024-05-12T14:16:03.380Z] [gw2] linux -- Python 3.9.19 /opt/conda/bin/python
[2024-05-12T14:16:03.380Z] 
[2024-05-12T14:16:03.380Z] spark_tmp_path = '/tmp/pyspark_tests//it-test-340-213-151-km8zv-r0k7f-gw2-1812-1083524667/'
[2024-05-12T14:16:03.380Z] orc_gen = SetValues(MapType(StringType(), StringType(), True),[{}, None, {'A': ''}, {'B': None}])
[2024-05-12T14:16:03.380Z] orc_impl = 'native'
[2024-05-12T14:16:03.380Z] 
[2024-05-12T14:16:03.380Z]     @pytest.mark.parametrize('orc_gen', orc_write_odd_empty_strings_gens_sample, ids=idfn)
[2024-05-12T14:16:03.381Z]     @pytest.mark.parametrize('orc_impl', ["native", "hive"])
[2024-05-12T14:16:03.381Z]     def test_write_round_trip_corner(spark_tmp_path, orc_gen, orc_impl):
[2024-05-12T14:16:03.381Z]         gen_list = [('_c0', orc_gen)]
[2024-05-12T14:16:03.381Z]         data_path = spark_tmp_path + '/ORC_DATA'
[2024-05-12T14:16:03.381Z] >       assert_gpu_and_cpu_writes_are_equal_collect(
[2024-05-12T14:16:03.381Z]                 lambda spark, path: gen_df(spark, gen_list, 128000, num_slices=1).write.orc(path),
[2024-05-12T14:16:03.381Z]                 lambda spark, path: spark.read.orc(path),
[2024-05-12T14:16:03.381Z]                 data_path,
[2024-05-12T14:16:03.381Z]                 conf={'spark.sql.orc.impl': orc_impl, 'spark.rapids.sql.format.orc.write.enabled': True})
[2024-05-12T14:16:03.381Z] 
[2024-05-12T14:16:03.381Z] ../../src/main/python/orc_write_test.py:99: 
[2024-05-12T14:16:03.381Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2024-05-12T14:16:03.381Z] ../../src/main/python/asserts.py:285: in assert_gpu_and_cpu_writes_are_equal_collect
[2024-05-12T14:16:03.381Z]     _assert_gpu_and_cpu_writes_are_equal(write_func, read_func, base_path, 'COLLECT', conf=conf)
[2024-05-12T14:16:03.381Z] ../../src/main/python/asserts.py:272: in _assert_gpu_and_cpu_writes_are_equal
[2024-05-12T14:16:03.381Z]     from_gpu = with_cpu_session(gpu_bring_back, conf=conf)
[2024-05-12T14:16:03.381Z] ../../src/main/python/spark_session.py:147: in with_cpu_session
[2024-05-12T14:16:03.381Z]     return with_spark_session(func, conf=copy)
[2024-05-12T14:16:03.381Z] /opt/conda/lib/python3.9/contextlib.py:79: in inner
[2024-05-12T14:16:03.381Z]     return func(*args, **kwds)
[2024-05-12T14:16:03.381Z] ../../src/main/python/spark_session.py:131: in with_spark_session
[2024-05-12T14:16:03.381Z]     ret = func(_spark)
[2024-05-12T14:16:03.381Z] ../../src/main/python/asserts.py:205: in 
[2024-05-12T14:16:03.381Z]     bring_back = lambda spark: limit_func(spark).collect()
[2024-05-12T14:16:03.381Z] ../../../spark-3.4.0-bin-hadoop3-scala2.13/python/pyspark/sql/dataframe.py:1216: in collect
[2024-05-12T14:16:03.381Z]     sock_info = self._jdf.collectToPython()
[2024-05-12T14:16:03.381Z] /home/jenkins/agent/workspace/jenkins-rapids_integration-scala213-dev-github-151-3.4.0/jars/spark-3.4.0-bin-hadoop3-scala2.13/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py:1322: in __call__
[2024-05-12T14:16:03.381Z]     return_value = get_return_value(
[2024-05-12T14:16:03.381Z] ../../../spark-3.4.0-bin-hadoop3-scala2.13/python/pyspark/errors/exceptions/captured.py:169: in deco
[2024-05-12T14:16:03.381Z]     return f(*a, **kw)
[2024-05-12T14:16:03.381Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2024-05-12T14:16:03.381Z] 
[2024-05-12T14:16:03.381Z] answer = 'xro1529653'
[2024-05-12T14:16:03.381Z] gateway_client = 
[2024-05-12T14:16:03.381Z] target_id = 'o1529652', name = 'collectToPython'
[2024-05-12T14:16:03.381Z] 
[2024-05-12T14:16:03.381Z]     def get_return_value(answer, gateway_client, target_id=None, name=None):
[2024-05-12T14:16:03.381Z]         """Converts an answer received from the Java gateway into a Python object.
[2024-05-12T14:16:03.381Z]     
[2024-05-12T14:16:03.381Z]         For example, string representation of integers are converted to Python
[2024-05-12T14:16:03.381Z]         integer, string representation of objects are converted to JavaObject
[2024-05-12T14:16:03.381Z]         instances, etc.
[2024-05-12T14:16:03.381Z]     
[2024-05-12T14:16:03.381Z]         :param answer: the string returned by the Java gateway
[2024-05-12T14:16:03.381Z]         :param gateway_client: the gateway client used to communicate with the Java
[2024-05-12T14:16:03.381Z]             Gateway. Only necessary if the answer is a reference (e.g., object,
[2024-05-12T14:16:03.381Z]             list, map)
[2024-05-12T14:16:03.381Z]         :param target_id: the name of the object from which the answer comes from
[2024-05-12T14:16:03.381Z]             (e.g., *object1* in `object1.hello()`). Optional.
[2024-05-12T14:16:03.381Z]         :param name: the name of the member from which the answer comes from
[2024-05-12T14:16:03.381Z]             (e.g., *hello* in `object1.hello()`). Optional.
[2024-05-12T14:16:03.381Z]         """
[2024-05-12T14:16:03.381Z]         if is_error(answer)[0]:
[2024-05-12T14:16:03.381Z]             if len(answer) > 1:
[2024-05-12T14:16:03.381Z]                 type = answer[1]
[2024-05-12T14:16:03.381Z]                 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
[2024-05-12T14:16:03.381Z]                 if answer[1] == REFERENCE_TYPE:
[2024-05-12T14:16:03.381Z] >                   raise Py4JJavaError(
[2024-05-12T14:16:03.381Z]                         "An error occurred while calling {0}{1}{2}.\n".
[2024-05-12T14:16:03.381Z]                         format(target_id, ".", name), value)
[2024-05-12T14:16:03.381Z] E                   py4j.protocol.Py4JJavaError: An error occurred while calling o1529652.collectToPython.
[2024-05-12T14:16:03.381Z] E                   : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 23550.0 failed 1 times, most recent failure: Lost task 0.0 in stage 23550.0 (TID 78185) (100.103.204.21 executor 0): java.io.IOException: Error reading file: file:/tmp/pyspark_tests/it-test-340-213-151-km8zv-r0k7f-gw2-1812-1083524667/ORC_DATA/GPU/part-00000-302369e3-93aa-4290-82a2-f949443489f1-c000.snappy.orc
[2024-05-12T14:16:03.381Z] E                   	at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1450)
[2024-05-12T14:16:03.381Z] E                   	at org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.java:207)
[2024-05-12T14:16:03.381Z] E                   	at org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.java:100)
[2024-05-12T14:16:03.382Z] E                   	at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
[2024-05-12T14:16:03.382Z] E                   	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125)
[2024-05-12T14:16:03.382Z] E                   	at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594)
[2024-05-12T14:16:03.382Z] E                   	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source)
[2024-05-12T14:16:03.382Z] E                   	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
[2024-05-12T14:16:03.382Z] E                   	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
[2024-05-12T14:16:03.382Z] E                   	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
[2024-05-12T14:16:03.382Z] E                   	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
[2024-05-12T14:16:03.382Z] E                   	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:888)
[2024-05-12T14:16:03.382Z] E                   	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:888)
[2024-05-12T14:16:03.382Z] E                   	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
[2024-05-12T14:16:03.382Z] E                   	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
[2024-05-12T14:16:03.382Z] E                   	at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
[2024-05-12T14:16:03.382Z] E                   	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
[2024-05-12T14:16:03.382Z] E                   	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
[2024-05-12T14:16:03.382Z] E                   	at org.apache.spark.scheduler.Task.run(Task.scala:139)
[2024-05-12T14:16:03.382Z] E                   	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
[2024-05-12T14:16:03.382Z] E                   	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
[2024-05-12T14:16:03.382Z] E                   	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
[2024-05-12T14:16:03.382Z] E                   	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
[2024-05-12T14:16:03.382Z] E                   	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
[2024-05-12T14:16:03.382Z] E                   	at java.base/java.lang.Thread.run(Thread.java:840)
[2024-05-12T14:16:03.382Z] E                   Caused by: java.io.EOFException: Read past end of RLE integer from compressed stream column 3 kind LENGTH position: 341 length: 341 range: 0 offset: 341 limit: 341 range 0 = 83086 to 83427 uncompressed: 512 to 512
[2024-05-12T14:16:03.382Z] E                   	at org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:60)
[2024-05-12T14:16:03.382Z] E                   	at org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:329)
[2024-05-12T14:16:03.382Z] E                   	at org.apache.orc.impl.RunLengthIntegerReaderV2.nextVector(RunLengthIntegerReaderV2.java:379)
[2024-05-12T14:16:03.382Z] E                   	at org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.commonReadByteArrays(TreeReaderFactory.java:1984)
[2024-05-12T14:16:03.382Z] E                   	at org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.readOrcByteArrays(TreeReaderFactory.java:2022)
[2024-05-12T14:16:03.382Z] E                   	at org.apache.orc.impl.TreeReaderFactory$StringDirectTreeReader.nextVector(TreeReaderFactory.java:2120)
[2024-05-12T14:16:03.382Z] E                   	at org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:1963)
[2024-05-12T14:16:03.382Z] E                   	at org.apache.orc.impl.TreeReaderFactory$MapTreeReader.nextVector(TreeReaderFactory.java:2888)
[2024-05-12T14:16:03.382Z] E                   	at org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65)
[2024-05-12T14:16:03.382Z] E                   	at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatchForLevel(StructBatchReader.java:100)
[2024-05-12T14:16:03.382Z] E                   	at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:77)
[2024-05-12T14:16:03.382Z] E                   	at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1425)
[2024-05-12T14:16:03.382Z] E                   	... 24 more
[2024-05-12T14:16:03.382Z] E                   
[2024-05-12T14:16:03.382Z] E                   Driver stacktrace:
[2024-05-12T14:16:03.382Z] E                   	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2785)
[2024-05-12T14:16:03.382Z] E                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2721)
[2024-05-12T14:16:03.382Z] E                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2720)
[2024-05-12T14:16:03.382Z] E                   	at scala.collection.immutable.List.foreach(List.scala:333)
[2024-05-12T14:16:03.382Z] E                   	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2720)
[2024-05-12T14:16:03.382Z] E                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1206)
[2024-05-12T14:16:03.382Z] E                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1206)
[2024-05-12T14:16:03.382Z] E                   	at scala.Option.foreach(Option.scala:437)
[2024-05-12T14:16:03.382Z] E                   	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1206)
[2024-05-12T14:16:03.382Z] E                   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2984)
[2024-05-12T14:16:03.382Z] E                   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2923)
[2024-05-12T14:16:03.382Z] E                   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2912)
[2024-05-12T14:16:03.382Z] E                   	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
[2024-05-12T14:16:03.382Z] E                   	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:971)
[2024-05-12T14:16:03.382Z] E                   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2263)
[2024-05-12T14:16:03.382Z] E                   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2284)
[2024-05-12T14:16:03.382Z] E                   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2303)
[2024-05-12T14:16:03.382Z] E                   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2328)
[2024-05-12T14:16:03.382Z] E                   	at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1019)
[2024-05-12T14:16:03.382Z] E                   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
[2024-05-12T14:16:03.382Z] E                   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
[2024-05-12T14:16:03.383Z] E                   	at org.apache.spark.rdd.RDD.withScope(RDD.scala:405)
[2024-05-12T14:16:03.383Z] E                   	at org.apache.spark.rdd.RDD.collect(RDD.scala:1018)
[2024-05-12T14:16:03.383Z] E                   	at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:448)
[2024-05-12T14:16:03.383Z] E                   	at org.apache.spark.sql.Dataset.$anonfun$collectToPython$1(Dataset.scala:3997)
[2024-05-12T14:16:03.383Z] E                   	at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:4167)
[2024-05-12T14:16:03.383Z] E                   	at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:526)
[2024-05-12T14:16:03.383Z] E                   	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:4165)
[2024-05-12T14:16:03.383Z] E                   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:118)
[2024-05-12T14:16:03.383Z] E                   	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:195)
[2024-05-12T14:16:03.383Z] E                   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:103)
[2024-05-12T14:16:03.383Z] E                   	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827)
[2024-05-12T14:16:03.383Z] E                   	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
[2024-05-12T14:16:03.383Z] E                   	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:4165)
[2024-05-12T14:16:03.383Z] E                   	at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3994)
[2024-05-12T14:16:03.383Z] E                   	at jdk.internal.reflect.GeneratedMethodAccessor96.invoke(Unknown Source)
[2024-05-12T14:16:03.383Z] E                   	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[2024-05-12T14:16:03.383Z] E                   	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
[2024-05-12T14:16:03.383Z] E                   	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
[2024-05-12T14:16:03.383Z] E                   	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
[2024-05-12T14:16:03.383Z] E                   	at py4j.Gateway.invoke(Gateway.java:282)
[2024-05-12T14:16:03.383Z] E                   	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
[2024-05-12T14:16:03.383Z] E                   	at py4j.commands.CallCommand.execute(CallCommand.java:79)
[2024-05-12T14:16:03.383Z] E                   	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
[2024-05-12T14:16:03.383Z] E                   	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
[2024-05-12T14:16:03.383Z] E                   	at java.base/java.lang.Thread.run(Thread.java:840)
[2024-05-12T14:16:03.383Z] E                   Caused by: java.io.IOException: Error reading file: file:/tmp/pyspark_tests/it-test-340-213-151-km8zv-r0k7f-gw2-1812-1083524667/ORC_DATA/GPU/part-00000-302369e3-93aa-4290-82a2-f949443489f1-c000.snappy.orc
[2024-05-12T14:16:03.383Z] E                   	at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1450)
[2024-05-12T14:16:03.383Z] E                   	at org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.java:207)
[2024-05-12T14:16:03.383Z] E                   	at org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.java:100)
[2024-05-12T14:16:03.383Z] E                   	at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
[2024-05-12T14:16:03.383Z] E                   	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125)
[2024-05-12T14:16:03.383Z] E                   	at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594)
[2024-05-12T14:16:03.383Z] E                   	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source)
[2024-05-12T14:16:03.383Z] E                   	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
[2024-05-12T14:16:03.383Z] E                   	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
[2024-05-12T14:16:03.383Z] E                   	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
[2024-05-12T14:16:03.383Z] E                   	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
[2024-05-12T14:16:03.383Z] E                   	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:888)
[2024-05-12T14:16:03.383Z] E                   	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:888)
[2024-05-12T14:16:03.383Z] E                   	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
[2024-05-12T14:16:03.383Z] E                   	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
[2024-05-12T14:16:03.383Z] E                   	at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
[2024-05-12T14:16:03.383Z] E                   	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
[2024-05-12T14:16:03.383Z] E                   	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
[2024-05-12T14:16:03.383Z] E                   	at org.apache.spark.scheduler.Task.run(Task.scala:139)
[2024-05-12T14:16:03.383Z] E                   	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
[2024-05-12T14:16:03.383Z] E                   	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
[2024-05-12T14:16:03.383Z] E                   	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
[2024-05-12T14:16:03.383Z] E                   	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
[2024-05-12T14:16:03.383Z] E                   	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
[2024-05-12T14:16:03.383Z] E                   	... 1 more
[2024-05-12T14:16:03.383Z] E                   Caused by: java.io.EOFException: Read past end of RLE integer from compressed stream column 3 kind LENGTH position: 341 length: 341 range: 0 offset: 341 limit: 341 range 0 = 83086 to 83427 uncompressed: 512 to 512
[2024-05-12T14:16:03.383Z] E                   	at org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:60)
[2024-05-12T14:16:03.383Z] E                   	at org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:329)
[2024-05-12T14:16:03.383Z] E                   	at org.apache.orc.impl.RunLengthIntegerReaderV2.nextVector(RunLengthIntegerReaderV2.java:379)
[2024-05-12T14:16:03.383Z] E                   	at org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.commonReadByteArrays(TreeReaderFactory.java:1984)
[2024-05-12T14:16:03.383Z] E                   	at org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.readOrcByteArrays(TreeReaderFactory.java:2022)
[2024-05-12T14:16:03.384Z] E                   	at org.apache.orc.impl.TreeReaderFactory$StringDirectTreeReader.nextVector(TreeReaderFactory.java:2120)
[2024-05-12T14:16:03.384Z] E                   	at org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:1963)
[2024-05-12T14:16:03.384Z] E                   	at org.apache.orc.impl.TreeReaderFactory$MapTreeReader.nextVector(TreeReaderFactory.java:2888)
[2024-05-12T14:16:03.384Z] E                   	at org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65)
[2024-05-12T14:16:03.384Z] E                   	at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatchForLevel(StructBatchReader.java:100)
[2024-05-12T14:16:03.384Z] E                   	at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:77)
[2024-05-12T14:16:03.384Z] E                   	at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1425)
[2024-05-12T14:16:03.384Z] E                   	... 24 more
[2024-05-12T14:16:03.384Z] 
[2024-05-12T14:16:03.384Z] /home/jenkins/agent/workspace/jenkins-rapids_integration-scala213-dev-github-151-3.4.0/jars/spark-3.4.0-bin-hadoop3-scala2.13/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py:326: Py4JJavaError
[2024-05-12T14:16:03.384Z] ----------------------------- Captured stdout call -----------------------------
[2024-05-12T14:16:03.384Z] ### CPU RUN ###
[2024-05-12T14:16:03.384Z] ### GPU RUN ###
[2024-05-12T14:16:03.384Z] ### WRITE: GPU TOOK 0.3831362724304199 CPU TOOK 0.46770763397216797 ###
[2024-05-12T14:16:03.384Z] _ test_write_round_trip_corner[hive-SetValues(MapType(StringType(), StringType(), True),[{}, None, {'A': ''}, {'B': None}])] _
[2024-05-12T14:16:03.384Z] [gw2] linux -- Python 3.9.19 /opt/conda/bin/python
[2024-05-12T14:16:03.384Z] 
[2024-05-12T14:16:03.384Z] spark_tmp_path = '/tmp/pyspark_tests//it-test-340-213-151-km8zv-r0k7f-gw2-1812-294764446/'
[2024-05-12T14:16:03.384Z] orc_gen = SetValues(MapType(StringType(), StringType(), True),[{}, None, {'A': ''}, {'B': None}])
[2024-05-12T14:16:03.384Z] orc_impl = 'hive'
[2024-05-12T14:16:03.384Z] 
[2024-05-12T14:16:03.384Z]     @pytest.mark.parametrize('orc_gen', orc_write_odd_empty_strings_gens_sample, ids=idfn)
[2024-05-12T14:16:03.384Z]     @pytest.mark.parametrize('orc_impl', ["native", "hive"])
[2024-05-12T14:16:03.384Z]     def test_write_round_trip_corner(spark_tmp_path, orc_gen, orc_impl):
[2024-05-12T14:16:03.384Z]         gen_list = [('_c0', orc_gen)]
[2024-05-12T14:16:03.384Z]         data_path = spark_tmp_path + '/ORC_DATA'
[2024-05-12T14:16:03.384Z] >       assert_gpu_and_cpu_writes_are_equal_collect(
[2024-05-12T14:16:03.384Z]                 lambda spark, path: gen_df(spark, gen_list, 128000, num_slices=1).write.orc(path),
[2024-05-12T14:16:03.384Z]                 lambda spark, path: spark.read.orc(path),
[2024-05-12T14:16:03.384Z]                 data_path,
[2024-05-12T14:16:03.384Z]                 conf={'spark.sql.orc.impl': orc_impl, 'spark.rapids.sql.format.orc.write.enabled': True})
[2024-05-12T14:16:03.384Z] 
[2024-05-12T14:16:03.384Z] ../../src/main/python/orc_write_test.py:99: 
[2024-05-12T14:16:03.384Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2024-05-12T14:16:03.384Z] ../../src/main/python/asserts.py:285: in assert_gpu_and_cpu_writes_are_equal_collect
[2024-05-12T14:16:03.384Z]     _assert_gpu_and_cpu_writes_are_equal(write_func, read_func, base_path, 'COLLECT', conf=conf)
[2024-05-12T14:16:03.384Z] ../../src/main/python/asserts.py:272: in _assert_gpu_and_cpu_writes_are_equal
[2024-05-12T14:16:03.384Z]     from_gpu = with_cpu_session(gpu_bring_back, conf=conf)
[2024-05-12T14:16:03.384Z] ../../src/main/python/spark_session.py:147: in with_cpu_session
[2024-05-12T14:16:03.384Z]     return with_spark_session(func, conf=copy)
[2024-05-12T14:16:03.384Z] /opt/conda/lib/python3.9/contextlib.py:79: in inner
[2024-05-12T14:16:03.384Z]     return func(*args, **kwds)
[2024-05-12T14:16:03.384Z] ../../src/main/python/spark_session.py:131: in with_spark_session
[2024-05-12T14:16:03.384Z]     ret = func(_spark)
[2024-05-12T14:16:03.384Z] ../../src/main/python/asserts.py:205: in 
[2024-05-12T14:16:03.384Z]     bring_back = lambda spark: limit_func(spark).collect()
[2024-05-12T14:16:03.384Z] ../../../spark-3.4.0-bin-hadoop3-scala2.13/python/pyspark/sql/dataframe.py:1216: in collect
[2024-05-12T14:16:03.384Z]     sock_info = self._jdf.collectToPython()
[2024-05-12T14:16:03.384Z] /home/jenkins/agent/workspace/jenkins-rapids_integration-scala213-dev-github-151-3.4.0/jars/spark-3.4.0-bin-hadoop3-scala2.13/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py:1322: in __call__
[2024-05-12T14:16:03.384Z]     return_value = get_return_value(
[2024-05-12T14:16:03.384Z] ../../../spark-3.4.0-bin-hadoop3-scala2.13/python/pyspark/errors/exceptions/captured.py:169: in deco
[2024-05-12T14:16:03.384Z]     return f(*a, **kw)
[2024-05-12T14:16:03.384Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2024-05-12T14:16:03.384Z] 
[2024-05-12T14:16:03.384Z] answer = 'xro1533235'
[2024-05-12T14:16:03.384Z] gateway_client = 
[2024-05-12T14:16:03.384Z] target_id = 'o1533234', name = 'collectToPython'
[2024-05-12T14:16:03.384Z] 
[2024-05-12T14:16:03.384Z]     def get_return_value(answer, gateway_client, target_id=None, name=None):
[2024-05-12T14:16:03.384Z]         """Converts an answer received from the Java gateway into a Python object.
[2024-05-12T14:16:03.384Z]     
[2024-05-12T14:16:03.384Z]         For example, string representation of integers are converted to Python
[2024-05-12T14:16:03.384Z]         integer, string representation of objects are converted to JavaObject
[2024-05-12T14:16:03.384Z]         instances, etc.
[2024-05-12T14:16:03.384Z]     
[2024-05-12T14:16:03.384Z]         :param answer: the string returned by the Java gateway
[2024-05-12T14:16:03.384Z]         :param gateway_client: the gateway client used to communicate with the Java
[2024-05-12T14:16:03.384Z]             Gateway. Only necessary if the answer is a reference (e.g., object,
[2024-05-12T14:16:03.384Z]             list, map)
[2024-05-12T14:16:03.384Z]         :param target_id: the name of the object from which the answer comes from
[2024-05-12T14:16:03.384Z]             (e.g., *object1* in `object1.hello()`). Optional.
[2024-05-12T14:16:03.384Z]         :param name: the name of the member from which the answer comes from
[2024-05-12T14:16:03.384Z]             (e.g., *hello* in `object1.hello()`). Optional.
[2024-05-12T14:16:03.384Z]         """
[2024-05-12T14:16:03.384Z]         if is_error(answer)[0]:
[2024-05-12T14:16:03.384Z]             if len(answer) > 1:
[2024-05-12T14:16:03.385Z]                 type = answer[1]
[2024-05-12T14:16:03.385Z]                 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
[2024-05-12T14:16:03.385Z]                 if answer[1] == REFERENCE_TYPE:
[2024-05-12T14:16:03.385Z] >                   raise Py4JJavaError(
[2024-05-12T14:16:03.385Z]                         "An error occurred while calling {0}{1}{2}.\n".
[2024-05-12T14:16:03.385Z]                         format(target_id, ".", name), value)
[2024-05-12T14:16:03.385Z] E                   py4j.protocol.Py4JJavaError: An error occurred while calling o1533234.collectToPython.
[2024-05-12T14:16:03.385Z] E                   : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 23600.0 failed 1 times, most recent failure: Lost task 0.0 in stage 23600.0 (TID 78235) (100.103.204.21 executor 0): java.io.IOException: Error reading file: file:/tmp/pyspark_tests/it-test-340-213-151-km8zv-r0k7f-gw2-1812-294764446/ORC_DATA/GPU/part-00000-1075a7c6-a73f-47be-8317-a6aafaed34f6-c000.snappy.orc
[2024-05-12T14:16:03.385Z] E                   	at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1450)
[2024-05-12T14:16:03.385Z] E                   	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.ensureBatch(RecordReaderImpl.java:77)
[2024-05-12T14:16:03.385Z] E                   	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.hasNext(RecordReaderImpl.java:93)
[2024-05-12T14:16:03.385Z] E                   	at org.apache.hadoop.hive.ql.io.orc.SparkOrcNewRecordReader.nextKeyValue(SparkOrcNewRecordReader.java:82)
[2024-05-12T14:16:03.385Z] E                   	at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
[2024-05-12T14:16:03.385Z] E                   	at org.apache.spark.sql.execution.datasources.RecordReaderIterator$$anon$1.hasNext(RecordReaderIterator.scala:61)
[2024-05-12T14:16:03.385Z] E                   	at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:576)
[2024-05-12T14:16:03.385Z] E                   	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125)
[2024-05-12T14:16:03.385Z] E                   	at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:576)
[2024-05-12T14:16:03.385Z] E                   	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
[2024-05-12T14:16:03.385Z] E                   	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:888)
[2024-05-12T14:16:03.385Z] E                   	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:888)
[2024-05-12T14:16:03.385Z] E                   	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
[2024-05-12T14:16:03.385Z] E                   	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
[2024-05-12T14:16:03.385Z] E                   	at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
[2024-05-12T14:16:03.385Z] E                   	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
[2024-05-12T14:16:03.385Z] E                   	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
[2024-05-12T14:16:03.385Z] E                   	at org.apache.spark.scheduler.Task.run(Task.scala:139)
[2024-05-12T14:16:03.385Z] E                   	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
[2024-05-12T14:16:03.385Z] E                   	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
[2024-05-12T14:16:03.385Z] E                   	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
[2024-05-12T14:16:03.385Z] E                   	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
[2024-05-12T14:16:03.385Z] E                   	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
[2024-05-12T14:16:03.385Z] E                   	at java.base/java.lang.Thread.run(Thread.java:840)
[2024-05-12T14:16:03.385Z] E                   Caused by: java.io.EOFException: Read past end of RLE integer from compressed stream column 3 kind LENGTH position: 341 length: 341 range: 0 offset: 341 limit: 341 range 0 = 83086 to 83427 uncompressed: 512 to 512
[2024-05-12T14:16:03.385Z] E                   	at org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:60)
[2024-05-12T14:16:03.385Z] E                   	at org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:329)
[2024-05-12T14:16:03.385Z] E                   	at org.apache.orc.impl.RunLengthIntegerReaderV2.nextVector(RunLengthIntegerReaderV2.java:379)
[2024-05-12T14:16:03.385Z] E                   	at org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.commonReadByteArrays(TreeReaderFactory.java:1984)
[2024-05-12T14:16:03.385Z] E                   	at org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.readOrcByteArrays(TreeReaderFactory.java:2022)
[2024-05-12T14:16:03.385Z] E                   	at org.apache.orc.impl.TreeReaderFactory$StringDirectTreeReader.nextVector(TreeReaderFactory.java:2120)
[2024-05-12T14:16:03.385Z] E                   	at org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:1963)
[2024-05-12T14:16:03.385Z] E                   	at org.apache.orc.impl.TreeReaderFactory$MapTreeReader.nextVector(TreeReaderFactory.java:2888)
[2024-05-12T14:16:03.385Z] E                   	at org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65)
[2024-05-12T14:16:03.385Z] E                   	at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatchForLevel(StructBatchReader.java:100)
[2024-05-12T14:16:03.385Z] E                   	at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:77)
[2024-05-12T14:16:03.385Z] E                   	at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1425)
[2024-05-12T14:16:03.385Z] E                   	... 23 more
[2024-05-12T14:16:03.385Z] E                   
[2024-05-12T14:16:03.385Z] E                   Driver stacktrace:
[2024-05-12T14:16:03.385Z] E                   	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2785)
[2024-05-12T14:16:03.385Z] E                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2721)
[2024-05-12T14:16:03.385Z] E                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2720)
[2024-05-12T14:16:03.385Z] E                   	at scala.collection.immutable.List.foreach(List.scala:333)
[2024-05-12T14:16:03.385Z] E                   	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2720)
[2024-05-12T14:16:03.385Z] E                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1206)
[2024-05-12T14:16:03.385Z] E                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1206)
[2024-05-12T14:16:03.385Z] E                   	at scala.Option.foreach(Option.scala:437)
[2024-05-12T14:16:03.386Z] E                   	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1206)
[2024-05-12T14:16:03.386Z] E                   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2984)
[2024-05-12T14:16:03.386Z] E                   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2923)
[2024-05-12T14:16:03.386Z] E                   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2912)
[2024-05-12T14:16:03.386Z] E                   	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
[2024-05-12T14:16:03.386Z] E                   	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:971)
[2024-05-12T14:16:03.386Z] E                   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2263)
[2024-05-12T14:16:03.386Z] E                   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2284)
[2024-05-12T14:16:03.386Z] E                   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2303)
[2024-05-12T14:16:03.386Z] E                   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2328)
[2024-05-12T14:16:03.386Z] E                   	at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1019)
[2024-05-12T14:16:03.386Z] E                   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
[2024-05-12T14:16:03.386Z] E                   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
[2024-05-12T14:16:03.386Z] E                   	at org.apache.spark.rdd.RDD.withScope(RDD.scala:405)
[2024-05-12T14:16:03.386Z] E                   	at org.apache.spark.rdd.RDD.collect(RDD.scala:1018)
[2024-05-12T14:16:03.386Z] E                   	at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:448)
[2024-05-12T14:16:03.386Z] E                   	at org.apache.spark.sql.Dataset.$anonfun$collectToPython$1(Dataset.scala:3997)
[2024-05-12T14:16:03.386Z] E                   	at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:4167)
[2024-05-12T14:16:03.386Z] E                   	at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:526)
[2024-05-12T14:16:03.386Z] E                   	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:4165)
[2024-05-12T14:16:03.386Z] E                   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:118)
[2024-05-12T14:16:03.386Z] E                   	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:195)
[2024-05-12T14:16:03.386Z] E                   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:103)
[2024-05-12T14:16:03.386Z] E                   	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827)
[2024-05-12T14:16:03.386Z] E                   	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
[2024-05-12T14:16:03.386Z] E                   	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:4165)
[2024-05-12T14:16:03.386Z] E                   	at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3994)
[2024-05-12T14:16:03.386Z] E                   	at jdk.internal.reflect.GeneratedMethodAccessor96.invoke(Unknown Source)
[2024-05-12T14:16:03.386Z] E                   	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[2024-05-12T14:16:03.386Z] E                   	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
[2024-05-12T14:16:03.386Z] E                   	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
[2024-05-12T14:16:03.386Z] E                   	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
[2024-05-12T14:16:03.386Z] E                   	at py4j.Gateway.invoke(Gateway.java:282)
[2024-05-12T14:16:03.386Z] E                   	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
[2024-05-12T14:16:03.386Z] E                   	at py4j.commands.CallCommand.execute(CallCommand.java:79)
[2024-05-12T14:16:03.386Z] E                   	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
[2024-05-12T14:16:03.386Z] E                   	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
[2024-05-12T14:16:03.386Z] E                   	at java.base/java.lang.Thread.run(Thread.java:840)
[2024-05-12T14:16:03.386Z] E                   Caused by: java.io.IOException: Error reading file: file:/tmp/pyspark_tests/it-test-340-213-151-km8zv-r0k7f-gw2-1812-294764446/ORC_DATA/GPU/part-00000-1075a7c6-a73f-47be-8317-a6aafaed34f6-c000.snappy.orc
[2024-05-12T14:16:03.386Z] E                   	at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1450)
[2024-05-12T14:16:03.386Z] E                   	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.ensureBatch(RecordReaderImpl.java:77)
[2024-05-12T14:16:03.386Z] E                   	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.hasNext(RecordReaderImpl.java:93)
[2024-05-12T14:16:03.386Z] E                   	at org.apache.hadoop.hive.ql.io.orc.SparkOrcNewRecordReader.nextKeyValue(SparkOrcNewRecordReader.java:82)
[2024-05-12T14:16:03.386Z] E                   	at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
[2024-05-12T14:16:03.386Z] E                   	at org.apache.spark.sql.execution.datasources.RecordReaderIterator$$anon$1.hasNext(RecordReaderIterator.scala:61)
[2024-05-12T14:16:03.386Z] E                   	at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:576)
[2024-05-12T14:16:03.386Z] E                   	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125)
[2024-05-12T14:16:03.386Z] E                   	at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:576)
[2024-05-12T14:16:03.386Z] E                   	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
[2024-05-12T14:16:03.386Z] E                   	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:888)
[2024-05-12T14:16:03.386Z] E                   	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:888)
[2024-05-12T14:16:03.386Z] E                   	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
[2024-05-12T14:16:03.386Z] E                   	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
[2024-05-12T14:16:03.386Z] E                   	at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
[2024-05-12T14:16:03.386Z] E                   	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
[2024-05-12T14:16:03.386Z] E                   	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
[2024-05-12T14:16:03.386Z] E                   	at org.apache.spark.scheduler.Task.run(Task.scala:139)
[2024-05-12T14:16:03.386Z] E                   	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
[2024-05-12T14:16:03.386Z] E                   	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
[2024-05-12T14:16:03.386Z] E                   	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
[2024-05-12T14:16:03.386Z] E                   	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
[2024-05-12T14:16:03.387Z] E                   	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
[2024-05-12T14:16:03.387Z] E                   	... 1 more
[2024-05-12T14:16:03.387Z] E                   Caused by: java.io.EOFException: Read past end of RLE integer from compressed stream column 3 kind LENGTH position: 341 length: 341 range: 0 offset: 341 limit: 341 range 0 = 83086 to 83427 uncompressed: 512 to 512
[2024-05-12T14:16:03.387Z] E                   	at org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:60)
[2024-05-12T14:16:03.387Z] E                   	at org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:329)
[2024-05-12T14:16:03.387Z] E                   	at org.apache.orc.impl.RunLengthIntegerReaderV2.nextVector(RunLengthIntegerReaderV2.java:379)
[2024-05-12T14:16:03.387Z] E                   	at org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.commonReadByteArrays(TreeReaderFactory.java:1984)
[2024-05-12T14:16:03.387Z] E                   	at org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.readOrcByteArrays(TreeReaderFactory.java:2022)
[2024-05-12T14:16:03.387Z] E                   	at org.apache.orc.impl.TreeReaderFactory$StringDirectTreeReader.nextVector(TreeReaderFactory.java:2120)
[2024-05-12T14:16:03.387Z] E                   	at org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:1963)
[2024-05-12T14:16:03.387Z] E                   	at org.apache.orc.impl.TreeReaderFactory$MapTreeReader.nextVector(TreeReaderFactory.java:2888)
[2024-05-12T14:16:03.387Z] E                   	at org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65)
[2024-05-12T14:16:03.387Z] E                   	at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatchForLevel(StructBatchReader.java:100)
[2024-05-12T14:16:03.387Z] E                   	at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:77)
[2024-05-12T14:16:03.387Z] E                   	at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1425)
[2024-05-12T14:16:03.387Z] E                   	... 23 more
[2024-05-12T14:16:03.387Z] 
[2024-05-12T14:16:03.387Z] /home/jenkins/agent/workspace/jenkins-rapids_integration-scala213-dev-github-151-3.4.0/jars/spark-3.4.0-bin-hadoop3-scala2.13/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py:326: Py4JJavaError

Steps/Code to reproduce bug
Failed in rapids_integration-scala213-dev-github

Expected behavior
Test cases pass

Environment details (please complete the following information)

Additional context

@sameerz sameerz added bug Something isn't working ? - Needs Triage Need team to review and classify labels May 13, 2024
@sameerz sameerz changed the title [BUG] orc_write_test.py::test_write_round_trip_corner failed [BUG] orc_write_test.py::test_write_round_trip_corner failed with DATAGEN_SEED=1715517863 May 14, 2024
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label May 14, 2024
@ttnghia
Copy link
Collaborator

ttnghia commented May 15, 2024

I could reproduce the issue with input like this:

+-----------+
|        _c0|
+-----------+
|       null|
|         {}|
|{B -> null}|
|{B -> null}|
|{B -> null}|
|    {A -> }|
|       null|
|{B -> null}|
|{B -> null}|
|    {A -> }|
|       null|
|    {A -> }|
|    {A -> }|
|{B -> null}|
|    {A -> }|
|         {}|
|{B -> null}|
|{B -> null}|
|       null|
|{B -> null}|
+-----------+

The issue is due to Spark CPU cannot read file written by cudf's ORC writer for such input.

@ttnghia
Copy link
Collaborator

ttnghia commented May 22, 2024

Fixed in cudf by rapidsai/cudf#15789. I've tested and can confirm that this issue is resolved.

@ttnghia ttnghia closed this as completed May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants