Empty DataFrame save should mimic Delta Lake behaviour #306

osopardo1 · 2024-04-09T06:08:10Z

What went wrong?

If we try to save and empty DataFrame with qbeast format, we throw the following error:

java.lang.RuntimeException: The DataFrame is empty, why are you trying to index an empty dataset?
  at io.qbeast.spark.index.DoublePassOTreeDataAnalyzer$.analyze(OTreeDataAnalyzer.scala:351)

Instead, when we do the same with Delta, the library creates a folder in the path with the first commit information and no error is shown.

How to reproduce?

1. Code that triggered the bug, or steps to reproduce:

case class T(id: Int)
spark.emptyDataset[T].write.format("qbeast").option("columnsToIndex", "id").save("/tmp/empty_test")

2. Branch and commit id:

main 6e6b5b4

3. Spark version:

On the spark shell run spark.version.

3.5.0

4. Hadoop version:

On the spark shell run org.apache.hadoop.util.VersionInfo.getVersion().

3.3.4

5. How are you running Spark?

Are you running Spark inside a container? Are you launching the app on a remote K8s cluster? Or are you just running the tests in a local computer?

Locally

6. Stack trace:

Trace of the log/error messages.

java.lang.RuntimeException: The DataFrame is empty, why are you trying to index an empty dataset?
  at io.qbeast.spark.index.DoublePassOTreeDataAnalyzer$.analyze(OTreeDataAnalyzer.scala:351)
  at io.qbeast.spark.index.SparkOTreeManager$.index(SparkOTreeManager.scala:89)
  at io.qbeast.spark.index.SparkOTreeManager$.index(SparkOTreeManager.scala:38)
  at io.qbeast.spark.index.SparkOTreeManager$.index(SparkOTreeManager.scala:26)
  at io.qbeast.spark.table.IndexedTableImpl.$anonfun$doWrite$2(IndexedTable.scala:467)
  at io.qbeast.spark.delta.DeltaMetadataWriter.$anonfun$writeWithTransaction$5(DeltaMetadataWriter.scala:113)
  at io.qbeast.spark.delta.DeltaMetadataWriter.$anonfun$writeWithTransaction$5$adapted(DeltaMetadataWriter.scala:108)
  at org.apache.spark.sql.delta.DeltaLog.withNewTransaction(DeltaLog.scala:223)
  at io.qbeast.spark.delta.DeltaMetadataWriter.writeWithTransaction(DeltaMetadataWriter.scala:108)
  at io.qbeast.spark.delta.SparkDeltaMetadataManager$.updateWithTransaction(SparkDeltaMetadataManager.scala:45)
  at io.qbeast.spark.delta.SparkDeltaMetadataManager$.updateWithTransaction(SparkDeltaMetadataManager.scala:31)
  at io.qbeast.spark.table.IndexedTableImpl.doWrite(IndexedTable.scala:466)
  at io.qbeast.spark.table.IndexedTableImpl.$anonfun$write$3(IndexedTable.scala:429)
  at io.qbeast.spark.table.IndexedTableImpl.$anonfun$write$3$adapted(IndexedTable.scala:421)
  at io.qbeast.core.keeper.Keeper.withWrite(Keeper.scala:55)
  at io.qbeast.core.keeper.Keeper.withWrite$(Keeper.scala:52)
  at io.qbeast.core.keeper.LocalKeeper$.withWrite(LocalKeeper.scala:27)
  at io.qbeast.spark.table.IndexedTableImpl.write(IndexedTable.scala:421)
  at io.qbeast.spark.table.IndexedTableImpl.save(IndexedTable.scala:383)
  at io.qbeast.spark.internal.sources.QbeastDataSource.createRelation(QbeastDataSource.scala:125)
  at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
  at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:107)
  at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:125)
  at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:201)
  at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:108)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:66)
  at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:107)
  at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:461)
  at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:76)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:461)
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:32)
  at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
  at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32)
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:437)
  at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:98)
  at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:85)
  at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:83)
  at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:142)
  at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:859)
  at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:388)
  at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:355)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:240)
  ... 47 elided

The text was updated successfully, but these errors were encountered:

osopardo1 added the bug Something isn't working label Apr 9, 2024

osopardo1 changed the title ~~Empty DataFrame should mimic Delta Lake behaviour~~ Empty DataFrame save should mimic Delta Lake behaviour Apr 9, 2024

osopardo1 mentioned this issue Apr 9, 2024

Unclear behaviour of SparkColumnsToIndexSelector when DataFrame is empty #295

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Empty DataFrame save should mimic Delta Lake behaviour #306

Empty DataFrame save should mimic Delta Lake behaviour #306

osopardo1 commented Apr 9, 2024

Empty DataFrame save should mimic Delta Lake behaviour #306

Empty DataFrame save should mimic Delta Lake behaviour #306

Comments

osopardo1 commented Apr 9, 2024

What went wrong?

How to reproduce?

1. Code that triggered the bug, or steps to reproduce:

2. Branch and commit id:

3. Spark version:

4. Hadoop version:

5. How are you running Spark?

6. Stack trace: