You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
We want to use the R package - doParallel in order to use most of the cores on a node of Azure Compute cluster. However when we start an experiment by submitting the job on compute cluster, the experiment fails with the following message:
Failed to create bus connection: No such file or directory
Error in serialize(data, node$con) : error writing to connection
Calls: train ... postNode -> sendData -> sendData.SOCKnode -> serialize
Execution halted
To Reproduce
Steps to reproduce the behavior:
Identify the cores available in the compute cluster
Register half of the available cores for parallel processing.
Run xgboost training in parallel
Code is attached for additional insights.
Expected behavior
doParallel package is able to execute the xgboost training in parallel and the results should be obtained much faster than a result obtained through training on a single core.
Additional context
Based on other answers found over the internet, it looks like the problem was related to service socket bus but I am not sure how it is configured for a compute cluster.
Describe the bug
We want to use the R package - doParallel in order to use most of the cores on a node of Azure Compute cluster. However when we start an experiment by submitting the job on compute cluster, the experiment fails with the following message:
Failed to create bus connection: No such file or directory
Error in serialize(data, node$con) : error writing to connection
Calls: train ... postNode -> sendData -> sendData.SOCKnode -> serialize
Execution halted
To Reproduce
Steps to reproduce the behavior:
Code is attached for additional insights.
Expected behavior
doParallel package is able to execute the xgboost training in parallel and the results should be obtained much faster than a result obtained through training on a single core.
Additional context
Based on other answers found over the internet, it looks like the problem was related to service socket bus but I am not sure how it is configured for a compute cluster.
TrainingScript.txt
Estimator.txt
The text was updated successfully, but these errors were encountered: