doazureparallel failing to load on certain nodes #295
Comments
Hi @ctlamb Are you running the installation for doAzureParallel on the cluster config installation or in the foreach? Thanks, |
In the foreach
This is ClusterConfig
|
I would recommend installing the R packages on the cluster configuration level so you don't need to install every single job. Also the job will not start if the start tasks of the cluster have failed. clusterConfig <- list(
"name" = "LambRaster",
"vmSize" = "Standard_D12_v2",
"maxTasksPerNode" = 1,
"poolSize" = list(
"dedicatedNodes" = list(
"min" = 1,
"max" = 200
),
"lowPriorityNodes" = list(
"min" = 0,
"max" = 0
),
"autoscaleFormula" = "QUEUE"
),
"containerImage" = "rocker/geospatial:latest",
"rPackages" = list(
"cran" = list("doParallel", "here", "dismo", "gbm", "snow"),
"github" = list("Azure/doAzureParallel"),
"bioconductor" = list()
),
"commandLine" = list()
) Move the doAzureParallel package name into the regular .packages vector. rast.results <- foreach(i = 1:nrow(bp),.packages = c("doParallel", "here", "dismo", "gbm", "snow", "doAzureParallel"), .errorhandling="pass",
.options.azure = list(enableCloudCombine=FALSE,
job = job_name)) %dopar% { I'll need to see the logs from the job preparation tasks of the batch node. However, the getClusterFile does not work for job preparation tasks. I've created a separate issue for this. If you have the portal for Azure Batch portal, you can go to: Thanks, |
Thanks, @brnleehng this makes better sense. I used the clusterConfig you made above (plus some debugging of my own after) but it seems to produce an error, which I can confirm is not present when I run without loading the packages in the clusterConfig
|
Hi @ctlamb It appears the cluster config file programmatically. Takes a character vector instead of a list for the R packages parameter, I'll update the docs for clarification. clusterConfig <- list(
"name" = "LambRaster",
"vmSize" = "Standard_D12_v2",
"maxTasksPerNode" = 1,
"poolSize" = list(
"dedicatedNodes" = list(
"min" = 1,
"max" = 200
),
"lowPriorityNodes" = list(
"min" = 0,
"max" = 0
),
"autoscaleFormula" = "QUEUE"
),
"containerImage" = "rocker/geospatial:latest",
"rPackages" = list(
"cran" = c("doParallel", "here", "dismo", "gbm", "snow"),
"github" = c("Azure/doAzureParallel"),
"bioconductor" = c()
),
"commandLine" = list()
) Thanks, |
Awesome, this is solved, thanks! |
I'm in the middle of running a big job: 200 VMs, 800 tasks. So far 500 tasks have completed but 120 have failed. I looked into the failures and can see that the stderr.txt files for failed nodes indicate doazureparallel failed to load.
stderr for failed job:
running
But then hundreds of the jobs worked, and produced the following with no errors.
The text was updated successfully, but these errors were encountered: