Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue running sparket_parquet_to_avro.py on Spark 2.2.0 #4

Open
Kommius opened this issue Nov 27, 2017 · 1 comment
Open

Issue running sparket_parquet_to_avro.py on Spark 2.2.0 #4

Kommius opened this issue Nov 27, 2017 · 1 comment

Comments

@Kommius
Copy link

Kommius commented Nov 27, 2017

Hi,

First of all, I wanted to thank you for all the time you have spent developing these tools and sharing them with the community.

I'm a Data Manager working in the Big Data domain, and I've discovered your tools. I know the one i'm using hasn't been tested on any later version than Spark 2.0.0, so I know this is probably why I'm getting the error below.

As I'm not a developer, I don't why it's crashing : my guess is that the script is using com.databricks#spark-avro_2.10;2.0.1 instead of com.databricks#spark-avro_2.11;4.0.0 on my machine running Scala code runner version 2.11.6. Again, this is just a guess, but here's the error I'm getting :

2017-11-27 14:54:40,321 - spark_parquet_to_avro.py[run:106](2949) - INFO - Spark version detected as 2.2.0 Traceback (most recent call last): File "spark_parquet_to_avro.py", line 123, in <module> SparkParquetToAvro().main() File "pylib/harisekhon/cli.py", line 172, in main self.run() File "spark_parquet_to_avro.py", line 115, in run df.write.format('com.databricks.spark.avro').save(avro_dir) File "/usr/local/lib/python2.7/dist-packages/pyspark/sql/readwriter.py", line 595, in save self._jwrite.save(path) File "/usr/local/lib/python2.7/dist-packages/py4j/java_gateway.py", line 1133, in __call__ answer, self.gateway_client, self.target_id, self.name) File "/usr/local/lib/python2.7/dist-packages/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/usr/local/lib/python2.7/dist-packages/py4j/protocol.py", line 319, in get_return_value format(target_id, ".", name), value) py4j.protocol.Py4JJavaError: An error occurred while calling o30.save. : java.lang.ClassNotFoundException: org.apache.spark.sql.sources.HadoopFsRelationProvider was removed in Spark 2.0. Please check if your library is compatible with Spark 2.0 at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:560) at org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:86) at org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:86) at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:470) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:610) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:217) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.NoClassDefFoundError: org/apache/spark/sql/sources/HadoopFsRelationProvider at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:763) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:467) at java.net.URLClassLoader.access$100(URLClassLoader.java:73) at java.net.URLClassLoader$1.run(URLClassLoader.java:368) at java.net.URLClassLoader$1.run(URLClassLoader.java:362) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:361) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$21$$anonfun$apply$12.apply(DataSource.scala:533) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$21$$anonfun$apply$12.apply(DataSource.scala:533) at scala.util.Try$.apply(Try.scala:192) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$21.apply(DataSource.scala:533) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$21.apply(DataSource.scala:533) at scala.util.Try.orElse(Try.scala:84) at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:533) ... 29 more Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.sources.HadoopFsRelationProvider at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 47 more
Thanks for any help you might provide :)

@Kommius
Copy link
Author

Kommius commented Nov 27, 2017

Solved my problem by editing directly the script sparket_parquet_to_avro.py and replacing the line :
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-avro_2.10:2.0.1 %s' \

with
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-avro_2.11:4.0.0 %s' \

in order to stick with the necessary version required : https://github.com/databricks/spark-avro

@Kommius Kommius closed this as completed Nov 27, 2017
@HariSekhon HariSekhon reopened this Aug 21, 2018
@HariSekhon HariSekhon reopened this Mar 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants