Skip to content
This repository has been archived by the owner on Oct 12, 2023. It is now read-only.

doazureparallel failing to load on certain nodes #295

Open
ctlamb opened this issue Aug 9, 2018 · 6 comments
Open

doazureparallel failing to load on certain nodes #295

ctlamb opened this issue Aug 9, 2018 · 6 comments

Comments

@ctlamb
Copy link

ctlamb commented Aug 9, 2018

I'm in the middle of running a big job: 200 VMs, 800 tasks. So far 500 tasks have completed but 120 have failed. I looked into the failures and can see that the stderr.txt files for failed nodes indicate doazureparallel failed to load.

stderr for failed job:
running

  '/usr/local/lib/R/bin/R --slave --no-restore --no-save --no-environ --no-restore --no-site-file --file=/mnt/batch/tasks/workitems/occpred09082018/job-1/jobpreparation/wd/worker.R --args 291 291 0 pass'

Loading required package: foreach
Loading required package: iterators
Loading required package: parallel
here() starts at /mnt/batch/tasks/workitems/occpred09082018/job-1/291/wd
Loading required package: raster
Loading required package: sp
Loading required package: survival
Loading required package: lattice
Loading required package: splines
Loaded gbm 2.1.3

Attaching package: ‘snow’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, clusterSplit, makeCluster, parApply,
    parCapply, parLapply, parRapply, parSapply, splitIndices,
    stopCluster

Error in library(packageName, character.only = TRUE) : 
  there is no package called ‘doAzureParallel’
Execution halted

But then hundreds of the jobs worked, and produced the following with no errors.

running
  '/usr/local/lib/R/bin/R --slave --no-restore --no-save --no-environ --no-restore --no-site-file --file=/mnt/batch/tasks/workitems/occpred09082018/job-1/jobpreparation/wd/worker.R --args 275 275 0 pass'

Loading required package: foreach
Loading required package: iterators
Loading required package: parallel
here() starts at /mnt/batch/tasks/workitems/occpred09082018/job-1/275/wd
Loading required package: raster
Loading required package: sp
Loading required package: survival
Loading required package: lattice
Loading required package: splines
Loaded gbm 2.1.3

Attaching package: ‘snow’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, clusterSplit, makeCluster, parApply,
    parCapply, parLapply, parRapply, parSapply, splitIndices,
    stopCluster


Attaching package: ‘doAzureParallel’

The following objects are masked from ‘package:snow’:

    makeCluster, stopCluster

The following object is masked from ‘package:raster’:

    getCluster

The following objects are masked from ‘package:parallel’:

    makeCluster, stopCluster
@brnleehng
Copy link
Collaborator

Hi @ctlamb

Are you running the installation for doAzureParallel on the cluster config installation or in the foreach?

Thanks,
Brian

@ctlamb
Copy link
Author

ctlamb commented Aug 9, 2018

In the foreach

  rast.results <- foreach(i = 1:nrow(bp),.packages = c("doParallel", "here", "dismo", "gbm", "snow"),
                        github = c("Azure/doAzureParallel"), .errorhandling="pass",
                        .options.azure = list(enableCloudCombine=FALSE,
                                              job = job_name)) %dopar% {

This is ClusterConfig


clusterConfig <- list(
  "name" = "LambRaster",
  "vmSize" = "Standard_D12_v2",
  "maxTasksPerNode" = 1,
  "poolSize" = list(
    "dedicatedNodes" = list(
      "min" = 1,
      "max" = 200
    ),
    "lowPriorityNodes" = list(
      "min" = 0,
      "max" = 0
    ),
    "autoscaleFormula" = "QUEUE"
  ),
  "containerImage" = "rocker/geospatial:latest",
  "rPackages" = list(
    "cran" = list(),
    "github" = list(),
    "bioconductor" = list()
  ),
  "commandLine" = list()
)

@brnleehng
Copy link
Collaborator

brnleehng commented Aug 9, 2018

I would recommend installing the R packages on the cluster configuration level so you don't need to install every single job. Also the job will not start if the start tasks of the cluster have failed.

clusterConfig <- list(
  "name" = "LambRaster",
  "vmSize" = "Standard_D12_v2",
  "maxTasksPerNode" = 1,
  "poolSize" = list(
    "dedicatedNodes" = list(
      "min" = 1,
      "max" = 200
    ),
    "lowPriorityNodes" = list(
      "min" = 0,
      "max" = 0
    ),
    "autoscaleFormula" = "QUEUE"
  ),
  "containerImage" = "rocker/geospatial:latest",
  "rPackages" = list(
    "cran" = list("doParallel", "here", "dismo", "gbm", "snow"),
    "github" = list("Azure/doAzureParallel"),
    "bioconductor" = list()
  ),
  "commandLine" = list()
)

Move the doAzureParallel package name into the regular .packages vector.

  rast.results <- foreach(i = 1:nrow(bp),.packages = c("doParallel", "here", "dismo", "gbm", "snow", "doAzureParallel"), .errorhandling="pass",
                        .options.azure = list(enableCloudCombine=FALSE,
                                              job = job_name)) %dopar% {

I'll need to see the logs from the job preparation tasks of the batch node. However, the getClusterFile does not work for job preparation tasks. I've created a separate issue for this.

If you have the portal for Azure Batch portal, you can go to:
Batch Pools > (Name of your pool) > Nodes > Click on the node > in the search bar "/workitems/<JOB_NAME>/job-1/jobpreparation/stderr.txt"

Thanks,
Brian

@ctlamb
Copy link
Author

ctlamb commented Aug 10, 2018

Thanks, @brnleehng this makes better sense.

I used the clusterConfig you made above (plus some debugging of my own after) but it seems to produce an error, which I can confirm is not present when I run without loading the packages in the clusterConfig

=======================================================================================================================================================================================
Name: LambRaster
Configuration:
	Docker Image: rocker/geospatial:latest
	MaxTasksPerNode: 1
	Node Size: Standard_D12_v2
cranPackages: 
	Error in cat(list(...), file, sep, fill, labels, append) : 
  argument 1 (type 'list') cannot be handled by 'cat'

@brnleehng
Copy link
Collaborator

brnleehng commented Aug 13, 2018

Hi @ctlamb

It appears the cluster config file programmatically. Takes a character vector instead of a list for the R packages parameter, I'll update the docs for clarification.

clusterConfig <- list(
  "name" = "LambRaster",
  "vmSize" = "Standard_D12_v2",
  "maxTasksPerNode" = 1,
  "poolSize" = list(
    "dedicatedNodes" = list(
      "min" = 1,
      "max" = 200
    ),
    "lowPriorityNodes" = list(
      "min" = 0,
      "max" = 0
    ),
    "autoscaleFormula" = "QUEUE"
  ),
  "containerImage" = "rocker/geospatial:latest",
  "rPackages" = list(
    "cran" = c("doParallel", "here", "dismo", "gbm", "snow"),
    "github" = c("Azure/doAzureParallel"),
    "bioconductor" = c()
  ),
  "commandLine" = list()
)

Thanks,
Brian

@ctlamb
Copy link
Author

ctlamb commented Nov 6, 2018

Awesome, this is solved, thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants