Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue while running cottoncandy in AWS EMR as part of spark job #72

Open
DeepakSahoo-Reflektion opened this issue Nov 5, 2018 · 4 comments

Comments

@DeepakSahoo-Reflektion
Copy link

I am trying to use cottoncandy to upload my numpy arrays to s3 and this code is getting executed inside AWS EMR spark. But while running getting the below errors:-

OSError: [Errno [Errno 13] Permission denied: '/home/.config'] <function subimport at 0x7f87e0167320>: ('cottoncandy',)

at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

OSError: [Errno 13] Permission denied: '/home/.config'

at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
@anwarnunez
Copy link
Contributor

Hi, thanks for submitting the bug report!

Unfortunately, I am not familiar with AWS EMR at all. My guess is that the code is being executed by a user who does not have a $HOME. For this reason, cottoncandy is looking for the user configuration in an unusual place (/home/.config) and is unable to find the configuration. The configuration is normallly located in $HOME/.config/cottoncandy. Do you know whether AWS EMR submits the spark jobs under a system user?
There are a couple of ways of solving this issue. The simplest might be to fallback to the default configuration whenever the user configuration cannot be found.

@DeepakSahoo-Reflektion
Copy link
Author

I just fixed the issue by creating the home/.config directory. But then I have another issue when running inside spark.
"pickle.PicklingError: Could not serialize object: TypeError: can't pickle thread.lock objects". Looks like it can't be serialized.

@anwarnunez
Copy link
Contributor

Hello,

Great that you solved the issue by creating a /home/.config directory (though it's odd that the path is not /home/$USER/.config).

With respect to the PicklingError you ran into: Is it because the object you're trying to pickle is not serializable? Or is it because the Apache/Spark framework somehow pickles the cottoncandy.interface?

I'm sorry I don't have access to, nor am I familiar with, EMR, Apache, nor Spark. I'm happy to help as much as I can though! If you find a solution, I'd be happy to include fixes.

@anwarnunez
Copy link
Contributor

p.s. I found this:
https://stackoverflow.com/questions/40674544/apache-spark-reads-for-s3-cant-pickle-thread-lock-objects

It suggests that the issue is the serializing of the cottoncandy.interface. It suggests using mapPartitions instead of flatMap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants