Skip to content
This repository has been archived by the owner on Oct 12, 2023. It is now read-only.

Jobs don't start after registering cluster with getCluster() #330

Open
angusrtaylor opened this issue Nov 26, 2018 · 8 comments
Open

Jobs don't start after registering cluster with getCluster() #330

angusrtaylor opened this issue Nov 26, 2018 · 8 comments

Comments

@angusrtaylor
Copy link
Contributor

Before submitting a bug please check the following:

  • [x ] Start a new R session
  • [x ] Check your credentials file
  • [x ] Install the latest doAzureParallel package
  • [x ] Submit a minimal, reproducible example
  • [x ] run sessionInfo()

R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.5 LTS

Matrix products: default
BLAS: /data/mlserver/9.3.0/runtime/R/lib/libRblas.so
LAPACK: /data/mlserver/9.3.0/runtime/R/lib/libRlapack.so

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] doAzureParallel_0.7.2 iterators_1.0.9 foreach_1.4.5 RevoUtilsMath_10.0.1
[5] RevoUtils_10.0.7 RevoMods_11.0.0 MicrosoftML_9.3.0 RevoScaleR_9.3.0
[9] lattice_0.20-35 rpart_4.1-11

loaded via a namespace (and not attached):
[1] codetools_0.2-15 CompatibilityAPI_1.1.0 digest_0.6.17 rAzureBatch_0.6.2
[5] mime_0.5 bitops_1.0-6 grid_3.4.3 R6_2.2.2
[9] jsonlite_1.5 httr_1.3.1 curl_3.2 rjson_0.2.20
[13] tools_3.4.3 RCurl_1.95-4.11 yaml_2.2.0 compiler_3.4.3
[17] mrupdate_1.0.1

Description

I have an existing cluster created using the montecarlo_pricing_simulation.R script. In a fresh R session, I use getCluster as follows:

cluster <- getCluster("montecarlo", verbose = TRUE)

which outputs:

nodes:
idle: 2
creating: 0
starting: 0
waitingforstarttask: 0
starttaskfailed: 0
preempted: 0
running: 0
other: 0
Your cluster has been registered.
Dedicated Node Count: 0
Low Priority Node Count: 2

However, when I submit the job on batch, it hangs with the following message:

Id: job20181126153436
chunkSize: 13
enableCloudCombine: TRUE
errorHandling: stop
wait: TRUE
autoDeleteJob: TRUE

The cluster nodes on the portal remain idle. Eventually, I get the following error:

Error in curl::curl_fetch_memory(url, handle = handle) :
SSL read: error:00000000:lib(0):func(0):reason(0), errno 104

This happens to me with my own code also. I cannot successfully run jobs on an existing cluster that has been retrieved with getCluster().

Instruction to repro the problem if applicable

  • Create a cluster

  • Restart R session

  • Load cluster with getCluster

  • Try and submit a job

@angusrtaylor
Copy link
Contributor Author

A workaround is just to run makeCluster(). You then get the following message and can run jobs on the cluster:

The specified cluster 'montecarlo' already exists. Cluster 'montecarlo' will be used.
Your cluster has been registered.

@brnleehng
Copy link
Collaborator

Hi @angusrtaylor

  • Are you using the monte carlo sample cluster configuration?
  • How many nodes are you using?

Thanks,
Brian

@angusrtaylor
Copy link
Contributor Author

@brnleehng yes I'm using the monte carlo sample cluster configuration with 2 low priority nodes. However, the same occurs with another cluster I am using with 5 dedicated nodes.

Thanks
Angus

@brnleehng
Copy link
Collaborator

What region are you currently in?

I'm also having issues with nodes saying they are idle, there are both low priority and dedicated. I will be investigating the batch node logs.

Thanks,
Brian

@angusrtaylor
Copy link
Contributor Author

I'm using westeurope. I'll try a different region and let you know if the same issue occurs. Thanks

@angusrtaylor
Copy link
Contributor Author

@brnleehng FYI this issue is still occurring. I've experienced this in every region I've tried including westeurope and southcentralus

@angusrtaylor
Copy link
Contributor Author

My workaround (using makeCluster) is also causing problems. If get the warning:

The specified cluster 'rbscl' already exists. Cluster 'rbscl' will be used.
Your cluster has been registered.
Dedicated Node Count: 0
Low Priority Node Count: 0
Warning message:
In self$client$extractAzureResponse(response, content) :
Conflict (HTTP 409).

@zerweck
Copy link

zerweck commented Jun 28, 2019

Could this be that this has something to do with the docker container settings? I made a diff between the HTTP Verbose Log between registering an existing via getCluster or via makeCluster. I found out two things:

  1. When running getCluster in a session where makeCluster has been run succesfully, it also works without problem for me. However, after deleting the cluster object and restarting the session, i can only use makeCluster.
  2. In the case of a non-working cluster object after running getCluster, all requests still work up to a certain point: When the following lines are printed:
============================
Id: job20190628223825
chunkSize: 1
enableCloudCombine: TRUE
errorHandling: pass
wait: FALSE
autoDeleteJob: TRUE
============================

The next request is the PUT for jobxxx-metadata.rds. This one is the last to work. The POST to /jobs/jobxxx/tasks?api-version=2018-12-01.8.0 HTTP/1.1 after this breaks. The only differences in the requests are the in the Authorization: SharedKey in the HTTP header and one strange difference in the JSON payload: The containerSettings imageName is empty if running getCluster, but filled when running makeCluster

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants