Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql workload fails when writing the output file -- Exception in thread "main" org.apache.spark.sql.AnalysisException:Found duplicate column(s) when inserting results-data-gen.csv: #196

Open
eddytruyen opened this issue Dec 11, 2020 · 2 comments

Comments

@eddytruyen
Copy link

Spark-Bench version (version number, tag, or git commit hash)

2.3.0
Hadoop is not installed.

Details of your cluster setup (Spark version, Standalone/Yarn/Local/Etc)

2.4.6

Scala version on your cluster

Scala version 2.11.12

Your exact configuration file (with system details anonymized for security)

I am running the example cvs-parquet without a spark-master and write to local files on nfs

spark-bench = {
  spark-submit-config = [{
    //spark-args = {
    //  executor-memory = "2147483648"
   // }
   //conf = {
   //   // Any configuration you need for your setup goes here, like:
   //   //spark.dynamicAllocation.enabled = "true"
   // }
    suites-parallel = false
    workload-suites = [
      {
        descr = "Generate a dataset, then take that same dataset and write it out to Parquet format"
        benchmark-output = "results-data-gen.csv"
        // We need to generate the dataset first through the data generator, then we take that dataset and convert it to Parquet.
        parallel = false
        workloads = [
          {
            name = "data-generation-kmeans"
            rows = 10000
            cols = 14
            output = "file:///opt/bitnami/spark/spark_data/spark-bench-test/kmeans-data.csv"
          },
          {
            name = "sql"
            query = "select * from input"
            input = "file:///opt/bitnami/spark/spark_data/spark-bench-test/kmeans-data.csv"
            output = "file:///opt/bitnami/spark/spark_data/spark-bench-test/kmeans-data.parquet"
          }
        ]
      },
      {
        descr = "Run two different SQL queries over the dataset in two different formats"
        benchmark-output = "file:///opt/bitnami/spark/spark_data/spark-bench-test/results-sql.csv"
        parallel = false
        repeat = 1
        workloads = [
          {
            name = "sql"
            input = ["file:///opt/bitnami/spark/spark_data/spark-bench-test/kmeans-data.csv", "file:///opt/bitnami/spark/spark_data/spark-bench-test/kmeans-data.parquet"]
            query = ["select * from input", "select c0, c22 from input where c0 < -0.9"]
            cache = false
          }
        ]
      }
    ]
  }]
}

Relevant stacktrace

Exception in thread "main" org.apache.spark.sql.AnalysisException: Found duplicate column(s) when inserting into file:/opt/bitnami/spark/spark_data/spark-bench/results-data-gen.csv: `total_runtime`;
        at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:85)
        at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:65)
        at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
        at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
        at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
        at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
        at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)
        at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81)
        at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:677)
        at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:677)
        at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
        at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
        at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
        at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:677)
        at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:286)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:272)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:230)
        at org.apache.spark.sql.DataFrameWriter.csv(DataFrameWriter.scala:665)
        at com.ibm.sparktc.sparkbench.utils.SparkFuncs$.writeToDisk(SparkFuncs.scala:104)
        at com.ibm.sparktc.sparkbench.workload.SuiteKickoff$.run(SuiteKickoff.scala:88)
        at com.ibm.sparktc.sparkbench.workload.MultipleSuiteKickoff$$anonfun$com$ibm$sparktc$sparkbench$workload$MultipleSuiteKickoff$$runSuitesSerially$1.apply(MultipleSuiteKickoff.scala:38)
        at com.ibm.sparktc.sparkbench.workload.MultipleSuiteKickoff$$anonfun$com$ibm$sparktc$sparkbench$workload$MultipleSuiteKickoff$$runSuitesSerially$1.apply(MultipleSuiteKickoff.scala:38)
        at scala.collection.immutable.List.foreach(List.scala:392)
        at com.ibm.sparktc.sparkbench.workload.MultipleSuiteKickoff$.com$ibm$sparktc$sparkbench$workload$MultipleSuiteKickoff$$runSuitesSerially(MultipleSuiteKickoff.scala:38)
        at com.ibm.sparktc.sparkbench.workload.MultipleSuiteKickoff$$anonfun$run$1.apply(MultipleSuiteKickoff.scala:28)
        at com.ibm.sparktc.sparkbench.workload.MultipleSuiteKickoff$$anonfun$run$1.apply(MultipleSuiteKickoff.scala:25)
        at scala.collection.immutable.List.foreach(List.scala:392)
        at com.ibm.sparktc.sparkbench.workload.MultipleSuiteKickoff$.run(MultipleSuiteKickoff.scala:25)
        at com.ibm.sparktc.sparkbench.cli.CLIKickoff$.main(CLIKickoff.scala:30)
        at com.ibm.sparktc.sparkbench.cli.CLIKickoff.main(CLIKickoff.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Description of your problem and any other relevant info

It seems the output file results-data-gen.csv cannot be written by the sql workload.
The same error also appears in a spark cluster without NFS instead of HDFS.

Note that this error does not appear when running the sql workload separately on already generated csv and parquet data

@georgemm01
Copy link

I get the same error. Any updates on this?

  • spark-bench_2.3.0_0.4.0-RELEASE
  • spark 2.4.6 (conda install pyspark==2.4.6)

@rajathc93
Copy link

Facing the same issue. Is this resolved?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants