You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To expand on the non-response of the previous issue, my workflow is as follows.
I have HostA with users and Rstudio, this host is used to connect to a cluster on HostB with a running spark standalone deployment.
The standalone deployment is started by user spark. Hence the spark processes are running as spark on the master and worker nodes.
The user can connect to the host ok, starting the sparklyr application and data can be loaded from disk, as long as the spark user has read permissions to the data location.
When using the spark_write_csv command, spark first creates a folder to hold the output files giving file permissions rwxr-x--- and the file belongs to the user who initiated the app on HostA. The spark_write_csv process fails due to inadequate permissions.
From what I understand, you cannot use proxy-user with Spark submit on the Stand-alone deployment, so the solution seems to be to make the owner of the output directory the same as the user which initiates the spark cluster.
Error msg:
Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 6.0 failed 4 times, most recent failure: Lost task 0.3 in stage 6.0 (TID 15) (10.1.0.175 executor 0): java.io.IOException: Mkdirs failed to create file:/work/shared/airlines.csv/_temporary/0/_temporary/attempt_202309071137551599550320799742752_0006_m_000000_15 (exists=false, cwd=file:/opt/spark/spark-3.4.1-bin-hadoop3/work/app-20230907112814-0015/0)
Hi @trhallam , thank you for the clarification on that. The only thing that I'm not sure of is if there is something for me to do in sparklyr to help. It seems like a current limitation of Spark. Is that right?
Tbh, I'm not entirely sure. I'm hoping there is a way I can better manage this, or a setting I can pass to spark via Sparklyr. It may be as you say, that I need to push this issue upstream to Spark itself, but it's not clear from the error for me which part of the Spark code base is causing the issue. My best guess is some of the Java routines which Sparklyr calls indirectly.
My issue is exactly the same as #3284.
To expand on the non-response of the previous issue, my workflow is as follows.
I have HostA with users and Rstudio, this host is used to connect to a cluster on HostB with a running spark standalone deployment.
The standalone deployment is started by user
spark
. Hence the spark processes are running asspark
on the master and worker nodes.The user can connect to the host ok, starting the sparklyr application and data can be loaded from disk, as long as the spark user has read permissions to the data location.
When using the
spark_write_csv
command, spark first creates a folder to hold the output files giving file permissionsrwxr-x---
and the file belongs to the user who initiated the app on HostA. Thespark_write_csv
process fails due to inadequate permissions.From what I understand, you cannot use
proxy-user
with Spark submit on the Stand-alone deployment, so the solution seems to be to make the owner of the output directory the same as the user which initiates the spark cluster.Error msg:
Session Info:
The text was updated successfully, but these errors were encountered: