Skip to content
This repository has been archived by the owner on Oct 12, 2023. It is now read-only.

Error generating azbatchenv rds file #374

Open
fermumen opened this issue Mar 26, 2021 · 3 comments
Open

Error generating azbatchenv rds file #374

fermumen opened this issue Mar 26, 2021 · 3 comments

Comments

@fermumen
Copy link

I got a bit a weird error trying to run some code in azure batch that was working correctly on regular doParallel.
This is the job's stderr

running
'/usr/local/lib/R/bin/R --no-echo --no-restore --no-save --no-environ --no-restore --no-site-file --file=/mnt/batch/tasks/workitems/job20210326153929/job-1/jobpreparation/wd/worker.R --args 10 10 0 pass'
.
Error in readRDS(paste0(batchJobPreparationDirectory, "/", batchJobEnvironment)) :
error reading from connection
Execution halted

I've downloaded the job.rds from Azure Blob Storage and indeed I can't read it on my computer either. How could I troubleshoot this?

R version 4.0.4 (2021-02-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS
.
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
.
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
.
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
.
other attached packages:
[1] mlflow_1.14.0 AzureBatchUtils_0.1.0 doParallel_1.0.16
[4] iterators_1.0.13 foreach_1.5.1 yardstick_0.0.7
[7] workflows_0.2.1 tune_0.1.3 tidyr_1.1.2
[10] tibble_3.1.0 rsample_0.0.9 recipes_0.1.15
[13] purrr_0.3.4 parsnip_0.1.5 modeldata_0.1.0
[16] infer_0.5.4 ggplot2_3.3.3 dplyr_1.0.4
[19] dials_0.0.9 scales_1.1.1 broom_0.7.5
[22] tidymodels_0.1.2
.
loaded via a namespace (and not attached):
[1] bitops_1.0-6 lubridate_1.7.9.2 DiceDesign_1.9
[4] httr_1.4.2 tools_4.0.4 backports_1.2.1
[7] utf8_1.1.4 R6_2.5.0 rpart_4.1-15
[10] DBI_1.1.0 colorspace_2.0-0 nnet_7.3-15
[13] withr_2.3.0 tidyselect_1.1.0 processx_3.4.5
[16] curl_4.3 compiler_4.0.4 cli_2.3.1
[19] swagger_3.33.1 forge_0.2.0 askpass_1.1
[22] stringr_1.4.0 digest_0.6.27 ini_0.3.1
[25] base64enc_0.1-3 pkgconfig_2.0.3 htmltools_0.5.0
[28] parallelly_1.22.0 lhs_1.1.1 fastmap_1.0.1
[31] rlang_0.4.10 doAzureParallel_0.8.0 rstudioapi_0.13
[34] shiny_1.5.0 generics_0.1.0 hwriter_1.3.2
[37] jsonlite_1.7.2 RCurl_1.98-1.3 magrittr_2.0.1
[40] Matrix_1.3-2 Rcpp_1.0.5 munsell_0.5.0
[43] fansi_0.4.1 GPfit_1.0-8 reticulate_1.18
[46] lifecycle_0.2.0 furrr_0.2.2 stringi_1.5.3
[49] yaml_2.2.1 pROC_1.16.2 snakecase_0.11.0
[52] MASS_7.3-53.1 plyr_1.8.6 grid_4.0.4
[55] listenv_0.8.0 promises_1.1.1 crayon_1.3.4
[58] lattice_0.20-41 splines_4.0.4 zeallot_0.1.0
[61] ps_1.5.0 pillar_1.5.0 ranger_0.12.1
[64] uuid_0.1-4 Rserve_1.8-7 rjson_0.2.20
[67] codetools_0.2-18 glue_1.4.2 rAzureBatch_0.7.0
[70] data.table_1.13.4 vctrs_0.3.5 httpuv_1.5.4
[73] gtable_0.3.0 openssl_1.4.3 future_1.21.0
[76] assertthat_0.2.1 TeachingDemos_2.10 gower_0.2.2
[79] mime_0.9 prodlim_2019.11.13 xtable_1.8-4
[82] later_1.1.0.1 class_7.3-18 survival_3.2-7
[85] timeDate_3043.102 SparkR_3.1.0 lava_1.6.8.1
[88] globals_0.14.0 ellipsis_0.3.1 hwriterPlus_1.0-3
[91] ipred_0.9-9

@fermumen
Copy link
Author

I've tried the same code with just a subset of the data (~10%) and it seems to work correctly. Is there a limit on how much data can be uploaded to storage from doAzureParallel?

@brnleehng
Copy link
Collaborator

Hi @fermumen,

Does the foreach loop finish without any errors?
Also are you using error handling option?

Thanks,
Brian

@fermumen
Copy link
Author

fermumen commented Apr 7, 2021

Hi, all the jobs finish with errors but I think in the job preparation stage. I have tried filtering the dataframe to ~60% of the size with different random samples and it works as it should, it's only when I use the full dataset (~900k observations) that it fails. The code I'm running is a tune grid which implements %dopar%

library(doAzureParallel)
cl <- make_azbatch_cluster("rf_pool3", cran_libraries = c("ranger", "tidymodels"),
                           CPU = 4, tasks_per_node = 1,
                           low_priority_nodes = list(min = 25,
                                                     max = 25))
registerDoAzureParallel(cl)
esc_grid_results <- esc_workflow %>%
  tune_grid(resamples, # %dopar%
            grid = esc_grid,
            control = tune::control_grid(verbose = TRUE,
                                         parallel_over = "everything"))


stopCluster(cl)

Maybe I can try to generate a randomised example for you to reproduce.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants