Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark standalone write permissions failure. #3376

Open
trhallam opened this issue Sep 7, 2023 · 2 comments
Open

Spark standalone write permissions failure. #3376

trhallam opened this issue Sep 7, 2023 · 2 comments

Comments

@trhallam
Copy link

trhallam commented Sep 7, 2023

My issue is exactly the same as #3284.

To expand on the non-response of the previous issue, my workflow is as follows.

I have HostA with users and Rstudio, this host is used to connect to a cluster on HostB with a running spark standalone deployment.

The standalone deployment is started by user spark. Hence the spark processes are running as spark on the master and worker nodes.

The user can connect to the host ok, starting the sparklyr application and data can be loaded from disk, as long as the spark user has read permissions to the data location.

When using the spark_write_csv command, spark first creates a folder to hold the output files giving file permissions rwxr-x--- and the file belongs to the user who initiated the app on HostA. The spark_write_csv process fails due to inadequate permissions.

From what I understand, you cannot use proxy-user with Spark submit on the Stand-alone deployment, so the solution seems to be to make the owner of the output directory the same as the user which initiates the spark cluster.

Error msg:

Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 6.0 failed 4 times, most recent failure: Lost task 0.3 in stage 6.0 (TID 15) (10.1.0.175 executor 0): java.io.IOException: Mkdirs failed to create file:/work/shared/airlines.csv/_temporary/0/_temporary/attempt_202309071137551599550320799742752_0006_m_000000_15 (exists=false, cwd=file:/opt/spark/spark-3.4.1-bin-hadoop3/work/app-20230907112814-0015/0)

Session Info:

> sparklyr::spark_version(sc)
[1] ‘3.4.1’
> sessionInfo()
R version 4.2.3 (2023-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8        LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8   
 [6] LC_MESSAGES=C.UTF-8    LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C           LC_TELEPHONE=C        
[11] LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ggplot2_3.4.3  dplyr_1.1.2    sparklyr_1.8.2

loaded via a namespace (and not attached):
 [1] pillar_1.9.0       compiler_4.2.3     dbplyr_2.3.3       base64enc_0.1-3    tools_4.2.3        digest_0.6.33     
 [7] uuid_1.1-1         jsonlite_1.8.7     lifecycle_1.0.3    tibble_3.2.1       gtable_0.3.4       pkgconfig_2.0.3   
[13] rlang_1.1.1        DBI_1.1.3          cli_3.6.1          rstudioapi_0.15.0  yaml_2.3.7         parallel_4.2.3    
[19] withr_2.5.0        httr_1.4.7         generics_0.1.3     vctrs_0.6.3        askpass_1.1        grid_4.2.3        
[25] tidyselect_1.2.0   glue_1.6.2         R6_2.5.1           fansi_1.0.4        tidyr_1.3.0        purrr_1.0.2       
[31] magrittr_2.0.3     scales_1.2.1       ellipsis_0.3.2     nycflights13_1.0.2 colorspace_2.1-0   config_0.3.1      
[37] utf8_1.2.3         openssl_2.1.0      munsell_0.5.0  
@edgararuiz
Copy link
Collaborator

Hi @trhallam , thank you for the clarification on that. The only thing that I'm not sure of is if there is something for me to do in sparklyr to help. It seems like a current limitation of Spark. Is that right?

@trhallam
Copy link
Author

trhallam commented Sep 9, 2023

Tbh, I'm not entirely sure. I'm hoping there is a way I can better manage this, or a setting I can pass to spark via Sparklyr. It may be as you say, that I need to push this issue upstream to Spark itself, but it's not clear from the error for me which part of the Spark code base is causing the issue. My best guess is some of the Java routines which Sparklyr calls indirectly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants