Automatic provisioning of VM instances #19

mullerrwd · 2020-06-26T10:31:12Z

Checking the configuration (deepkit.yml) documentation I do not see an appropriate method to provision VM instances (E.g: MS Azure) as a node through a REST API.

A current solution is to:

Start a DL VM instance and provision the instance through deepkit as node.
Start an experiment.
When the experiment had ended or shut down per user request stop the instance.

However this does not prevent unnecessary idle time of the instance which will add up to the costs if one does not stop the instance directly after an experiment.

Preferred functionality would be:

Define a target VM instance within deepkit.yml experiment file through API.
Let deepkit start the instance, provision it automatically when the experiment has been started by the user.
When the experiment has ended stop the the instance.

Example config file:

vm: <API: start the instance>
image: tensorflow/tensorflow:1.15.2-gpu-py3
command: python model.py
vm_post: <API: stop the instance>

If you have a different work around in place I would be happy to hear about it!

The text was updated successfully, but these errors were encountered:

marcj · 2020-06-26T10:37:13Z

Hey, I already started (8f9ac31) cloud provider support which does exactly that. It works a bit different though: You first create a new cluster of type Azure/AWS/GC/Genesis and configure what instance types should be allowed and how many. Once you start a cloud experiment that configured cluster auto-scales up/down depending on the workload. It works without adding new deepkit.yml configuration options. I think we get the first version with this feature as experimental released next month.

mullerrwd · 2020-06-26T10:56:15Z

Hi @marcj, Awesome to hear! Ar you using horovodRunner to support distributed training or your own abstraction?

In any case, I would be happy to help and test the functionality with a Microsoft Azure cluster.

marcj · 2020-06-26T11:24:04Z

@mullerrwd currently we don't have our own concept for distributed learning. You can certainly configure in your deepkit.yml multiple instances using basically the pipeline feature, and then connect between each other, but that is something I'm still trying to improve. Not yet sure which framework, if any, will be used.

I'm currently planning to support GenesisCloud/AWS/GC at the beginning. Microsoft Azure is something I've never used personally, so this will probably be integrated last.

mullerrwd · 2020-06-26T12:33:31Z

In a first case I would be interested in a single job experiment provisioning a single VM instance. I have to check the cluster and auto scaling functionality of Azure myself as well since I mostly use GC myself. Although I haven't tested the cluster functionality from GC either. So I understand the primary support for GenesisCloud, GC and AWS.

I will dive into the process for Microsoft Azure in the mean time.

marcj · 2020-06-26T12:39:30Z

Deepkit doesn't use the auto-scaling function of any of these vendors. It implemented its own algo, and uses just most basic API calls createInstance/terminateInstance/getPublicIP of those vendors. So as long as Azure has equal API calls available it should be easy to integrate.

mullerrwd · 2020-06-26T12:47:13Z

Good to know, I'll keep that in mind when testing on MS Azure. I'll be able to help you with Azure support knowing how you support the primary cloud providers.

mullerrwd · 2020-07-13T11:11:40Z

Leaving this here for future reference: Microsoft Azure ML cluster

@marcj I assume you are using the api as described here: https://cloud.google.com/dataproc/docs/concepts/compute/gpus
Microsoft Azure seems to use a different basic api abstraction which can be interfaced with directly from within python. See the doc link above. Basically one registers and configures the vm instance or compute cluster using the azureml python sdk as below:

from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# Choose a name for your CPU cluster
cpu_cluster_name = "cpucluster"

# Verify that cluster does not exist already
try:
    cpu_cluster = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
                                                           max_nodes=4)
    cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)

cpu_cluster.wait_for_completion(show_output=True)

and

from azureml.core.runconfig import RunConfiguration
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core.runconfig import DEFAULT_CPU_IMAGE

# Create a new runconfig object
run_amlcompute = RunConfiguration()

# Use the cpu_cluster you created above. 
run_amlcompute.target = cpu_cluster

# Enable Docker
run_amlcompute.environment.docker.enabled = True

# Set Docker base image to the default CPU-based image
run_amlcompute.environment.docker.base_image = DEFAULT_CPU_IMAGE

# Use conda_dependencies.yml to create a conda environment in the Docker image for execution
run_amlcompute.environment.python.user_managed_dependencies = False

# Specify CondaDependencies obj, add necessary packages
run_amlcompute.environment.python.conda_dependencies = CondaDependencies.create(conda_packages=['scikit-learn'])

After which you can kick off the training experiment using the same azureml sdk. This document describes how to do that both with the python sdk as the azure cli interface.

I'm still working my way through the examples and documentation and my own tests. I will report my findings here for your convenience.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatic provisioning of VM instances #19

Automatic provisioning of VM instances #19

mullerrwd commented Jun 26, 2020

marcj commented Jun 26, 2020

mullerrwd commented Jun 26, 2020

marcj commented Jun 26, 2020

mullerrwd commented Jun 26, 2020

marcj commented Jun 26, 2020

mullerrwd commented Jun 26, 2020

mullerrwd commented Jul 13, 2020

Automatic provisioning of VM instances #19

Automatic provisioning of VM instances #19

Comments

mullerrwd commented Jun 26, 2020

marcj commented Jun 26, 2020

mullerrwd commented Jun 26, 2020

marcj commented Jun 26, 2020

mullerrwd commented Jun 26, 2020

marcj commented Jun 26, 2020

mullerrwd commented Jun 26, 2020

mullerrwd commented Jul 13, 2020