Jobs don't start after registering cluster with getCluster() #330

angusrtaylor · 2018-11-26T15:46:04Z

Before submitting a bug please check the following:

[x ] Start a new R session
[x ] Check your credentials file
[x ] Install the latest doAzureParallel package
[x ] Submit a minimal, reproducible example
[x ] run sessionInfo()

R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.5 LTS

Matrix products: default
BLAS: /data/mlserver/9.3.0/runtime/R/lib/libRblas.so
LAPACK: /data/mlserver/9.3.0/runtime/R/lib/libRlapack.so

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] doAzureParallel_0.7.2 iterators_1.0.9 foreach_1.4.5 RevoUtilsMath_10.0.1
[5] RevoUtils_10.0.7 RevoMods_11.0.0 MicrosoftML_9.3.0 RevoScaleR_9.3.0
[9] lattice_0.20-35 rpart_4.1-11

loaded via a namespace (and not attached):
[1] codetools_0.2-15 CompatibilityAPI_1.1.0 digest_0.6.17 rAzureBatch_0.6.2
[5] mime_0.5 bitops_1.0-6 grid_3.4.3 R6_2.2.2
[9] jsonlite_1.5 httr_1.3.1 curl_3.2 rjson_0.2.20
[13] tools_3.4.3 RCurl_1.95-4.11 yaml_2.2.0 compiler_3.4.3
[17] mrupdate_1.0.1

Description

I have an existing cluster created using the montecarlo_pricing_simulation.R script. In a fresh R session, I use getCluster as follows:

cluster <- getCluster("montecarlo", verbose = TRUE)

which outputs:

nodes:
idle: 2
creating: 0
starting: 0
waitingforstarttask: 0
starttaskfailed: 0
preempted: 0
running: 0
other: 0
Your cluster has been registered.
Dedicated Node Count: 0
Low Priority Node Count: 2

However, when I submit the job on batch, it hangs with the following message:

Id: job20181126153436
chunkSize: 13
enableCloudCombine: TRUE
errorHandling: stop
wait: TRUE
autoDeleteJob: TRUE

The cluster nodes on the portal remain idle. Eventually, I get the following error:

Error in curl::curl_fetch_memory(url, handle = handle) :
SSL read: error:00000000:lib(0):func(0):reason(0), errno 104

This happens to me with my own code also. I cannot successfully run jobs on an existing cluster that has been retrieved with getCluster().

Instruction to repro the problem if applicable

Create a cluster
Restart R session
Load cluster with getCluster
Try and submit a job

The text was updated successfully, but these errors were encountered:

angusrtaylor · 2018-11-26T16:02:22Z

A workaround is just to run makeCluster(). You then get the following message and can run jobs on the cluster:

The specified cluster 'montecarlo' already exists. Cluster 'montecarlo' will be used.
Your cluster has been registered.

brnleehng · 2018-11-26T18:59:13Z

Hi @angusrtaylor

Are you using the monte carlo sample cluster configuration?
How many nodes are you using?

Thanks,
Brian

angusrtaylor · 2018-11-27T12:16:05Z

@brnleehng yes I'm using the monte carlo sample cluster configuration with 2 low priority nodes. However, the same occurs with another cluster I am using with 5 dedicated nodes.

Thanks
Angus

brnleehng · 2018-11-29T07:09:55Z

What region are you currently in?

I'm also having issues with nodes saying they are idle, there are both low priority and dedicated. I will be investigating the batch node logs.

Thanks,
Brian

angusrtaylor · 2018-11-29T08:15:30Z

I'm using westeurope. I'll try a different region and let you know if the same issue occurs. Thanks

angusrtaylor · 2019-03-23T11:03:44Z

@brnleehng FYI this issue is still occurring. I've experienced this in every region I've tried including westeurope and southcentralus

angusrtaylor · 2019-03-23T11:05:43Z

My workaround (using makeCluster) is also causing problems. If get the warning:

The specified cluster 'rbscl' already exists. Cluster 'rbscl' will be used.
Your cluster has been registered.
Dedicated Node Count: 0
Low Priority Node Count: 0
Warning message:
In self$client$extractAzureResponse(response, content) :
Conflict (HTTP 409).

zerweck · 2019-06-28T22:57:45Z

Could this be that this has something to do with the docker container settings? I made a diff between the HTTP Verbose Log between registering an existing via getCluster or via makeCluster. I found out two things:

When running getCluster in a session where makeCluster has been run succesfully, it also works without problem for me. However, after deleting the cluster object and restarting the session, i can only use makeCluster.
In the case of a non-working cluster object after running getCluster, all requests still work up to a certain point: When the following lines are printed:

============================
Id: job20190628223825
chunkSize: 1
enableCloudCombine: TRUE
errorHandling: pass
wait: FALSE
autoDeleteJob: TRUE
============================

The next request is the PUT for jobxxx-metadata.rds. This one is the last to work. The POST to /jobs/jobxxx/tasks?api-version=2018-12-01.8.0 HTTP/1.1 after this breaks. The only differences in the requests are the in the Authorization: SharedKey in the HTTP header and one strange difference in the JSON payload: The containerSettings imageName is empty if running getCluster, but filled when running makeCluster

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jobs don't start after registering cluster with getCluster() #330

Jobs don't start after registering cluster with getCluster() #330

angusrtaylor commented Nov 26, 2018

angusrtaylor commented Nov 26, 2018

brnleehng commented Nov 26, 2018

angusrtaylor commented Nov 27, 2018

brnleehng commented Nov 29, 2018

angusrtaylor commented Nov 29, 2018

angusrtaylor commented Mar 23, 2019

angusrtaylor commented Mar 23, 2019

zerweck commented Jun 28, 2019

Jobs don't start after registering cluster with getCluster() #330

Jobs don't start after registering cluster with getCluster() #330

Comments

angusrtaylor commented Nov 26, 2018

angusrtaylor commented Nov 26, 2018

brnleehng commented Nov 26, 2018

angusrtaylor commented Nov 27, 2018

brnleehng commented Nov 29, 2018

angusrtaylor commented Nov 29, 2018

angusrtaylor commented Mar 23, 2019

angusrtaylor commented Mar 23, 2019

zerweck commented Jun 28, 2019