Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty DataFrame save should mimic Delta Lake behaviour #306

Open
osopardo1 opened this issue Apr 9, 2024 · 0 comments
Open

Empty DataFrame save should mimic Delta Lake behaviour #306

osopardo1 opened this issue Apr 9, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@osopardo1
Copy link
Member

What went wrong?

If we try to save and empty DataFrame with qbeast format, we throw the following error:

java.lang.RuntimeException: The DataFrame is empty, why are you trying to index an empty dataset?
  at io.qbeast.spark.index.DoublePassOTreeDataAnalyzer$.analyze(OTreeDataAnalyzer.scala:351)

Instead, when we do the same with Delta, the library creates a folder in the path with the first commit information and no error is shown.

How to reproduce?

1. Code that triggered the bug, or steps to reproduce:

case class T(id: Int)
spark.emptyDataset[T].write.format("qbeast").option("columnsToIndex", "id").save("/tmp/empty_test")

2. Branch and commit id:

main 6e6b5b4

3. Spark version:

On the spark shell run spark.version.

3.5.0

4. Hadoop version:

On the spark shell run org.apache.hadoop.util.VersionInfo.getVersion().

3.3.4

5. How are you running Spark?

Are you running Spark inside a container? Are you launching the app on a remote K8s cluster? Or are you just running the tests in a local computer?

Locally

6. Stack trace:

Trace of the log/error messages.

java.lang.RuntimeException: The DataFrame is empty, why are you trying to index an empty dataset?
  at io.qbeast.spark.index.DoublePassOTreeDataAnalyzer$.analyze(OTreeDataAnalyzer.scala:351)
  at io.qbeast.spark.index.SparkOTreeManager$.index(SparkOTreeManager.scala:89)
  at io.qbeast.spark.index.SparkOTreeManager$.index(SparkOTreeManager.scala:38)
  at io.qbeast.spark.index.SparkOTreeManager$.index(SparkOTreeManager.scala:26)
  at io.qbeast.spark.table.IndexedTableImpl.$anonfun$doWrite$2(IndexedTable.scala:467)
  at io.qbeast.spark.delta.DeltaMetadataWriter.$anonfun$writeWithTransaction$5(DeltaMetadataWriter.scala:113)
  at io.qbeast.spark.delta.DeltaMetadataWriter.$anonfun$writeWithTransaction$5$adapted(DeltaMetadataWriter.scala:108)
  at org.apache.spark.sql.delta.DeltaLog.withNewTransaction(DeltaLog.scala:223)
  at io.qbeast.spark.delta.DeltaMetadataWriter.writeWithTransaction(DeltaMetadataWriter.scala:108)
  at io.qbeast.spark.delta.SparkDeltaMetadataManager$.updateWithTransaction(SparkDeltaMetadataManager.scala:45)
  at io.qbeast.spark.delta.SparkDeltaMetadataManager$.updateWithTransaction(SparkDeltaMetadataManager.scala:31)
  at io.qbeast.spark.table.IndexedTableImpl.doWrite(IndexedTable.scala:466)
  at io.qbeast.spark.table.IndexedTableImpl.$anonfun$write$3(IndexedTable.scala:429)
  at io.qbeast.spark.table.IndexedTableImpl.$anonfun$write$3$adapted(IndexedTable.scala:421)
  at io.qbeast.core.keeper.Keeper.withWrite(Keeper.scala:55)
  at io.qbeast.core.keeper.Keeper.withWrite$(Keeper.scala:52)
  at io.qbeast.core.keeper.LocalKeeper$.withWrite(LocalKeeper.scala:27)
  at io.qbeast.spark.table.IndexedTableImpl.write(IndexedTable.scala:421)
  at io.qbeast.spark.table.IndexedTableImpl.save(IndexedTable.scala:383)
  at io.qbeast.spark.internal.sources.QbeastDataSource.createRelation(QbeastDataSource.scala:125)
  at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
  at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:107)
  at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:125)
  at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:201)
  at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:108)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:66)
  at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:107)
  at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:461)
  at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:76)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:461)
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:32)
  at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
  at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32)
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:437)
  at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:98)
  at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:85)
  at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:83)
  at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:142)
  at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:859)
  at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:388)
  at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:355)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:240)
  ... 47 elided

@osopardo1 osopardo1 added the bug Something isn't working label Apr 9, 2024
@osopardo1 osopardo1 changed the title Empty DataFrame should mimic Delta Lake behaviour Empty DataFrame save should mimic Delta Lake behaviour Apr 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant