Jobs don't start after registering cluster with getCluster() #330
Comments
A workaround is just to run makeCluster(). You then get the following message and can run jobs on the cluster:
|
Thanks, |
@brnleehng yes I'm using the monte carlo sample cluster configuration with 2 low priority nodes. However, the same occurs with another cluster I am using with 5 dedicated nodes. Thanks |
What region are you currently in? I'm also having issues with nodes saying they are idle, there are both low priority and dedicated. I will be investigating the batch node logs. Thanks, |
I'm using westeurope. I'll try a different region and let you know if the same issue occurs. Thanks |
@brnleehng FYI this issue is still occurring. I've experienced this in every region I've tried including westeurope and southcentralus |
My workaround (using makeCluster) is also causing problems. If get the warning: The specified cluster 'rbscl' already exists. Cluster 'rbscl' will be used. |
Could this be that this has something to do with the docker container settings? I made a diff between the HTTP Verbose Log between registering an existing via
The next request is the |
Before submitting a bug please check the following:
sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.5 LTS
Matrix products: default
BLAS: /data/mlserver/9.3.0/runtime/R/lib/libRblas.so
LAPACK: /data/mlserver/9.3.0/runtime/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] doAzureParallel_0.7.2 iterators_1.0.9 foreach_1.4.5 RevoUtilsMath_10.0.1
[5] RevoUtils_10.0.7 RevoMods_11.0.0 MicrosoftML_9.3.0 RevoScaleR_9.3.0
[9] lattice_0.20-35 rpart_4.1-11
loaded via a namespace (and not attached):
[1] codetools_0.2-15 CompatibilityAPI_1.1.0 digest_0.6.17 rAzureBatch_0.6.2
[5] mime_0.5 bitops_1.0-6 grid_3.4.3 R6_2.2.2
[9] jsonlite_1.5 httr_1.3.1 curl_3.2 rjson_0.2.20
[13] tools_3.4.3 RCurl_1.95-4.11 yaml_2.2.0 compiler_3.4.3
[17] mrupdate_1.0.1
Description
I have an existing cluster created using the montecarlo_pricing_simulation.R script. In a fresh R session, I use getCluster as follows:
cluster <- getCluster("montecarlo", verbose = TRUE)
which outputs:
nodes:
idle: 2
creating: 0
starting: 0
waitingforstarttask: 0
starttaskfailed: 0
preempted: 0
running: 0
other: 0
Your cluster has been registered.
Dedicated Node Count: 0
Low Priority Node Count: 2
However, when I submit the job on batch, it hangs with the following message:
Id: job20181126153436
chunkSize: 13
enableCloudCombine: TRUE
errorHandling: stop
wait: TRUE
autoDeleteJob: TRUE
The cluster nodes on the portal remain idle. Eventually, I get the following error:
Error in curl::curl_fetch_memory(url, handle = handle) :
SSL read: error:00000000:lib(0):func(0):reason(0), errno 104
This happens to me with my own code also. I cannot successfully run jobs on an existing cluster that has been retrieved with getCluster().
Instruction to repro the problem if applicable
Create a cluster
Restart R session
Load cluster with getCluster
Try and submit a job
The text was updated successfully, but these errors were encountered: