Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic provisioning of VM instances #19

Open
mullerrwd opened this issue Jun 26, 2020 · 7 comments
Open

Automatic provisioning of VM instances #19

mullerrwd opened this issue Jun 26, 2020 · 7 comments

Comments

@mullerrwd
Copy link

Checking the configuration (deepkit.yml) documentation I do not see an appropriate method to provision VM instances (E.g: MS Azure) as a node through a REST API.

A current solution is to:

  1. Start a DL VM instance and provision the instance through deepkit as node.
  2. Start an experiment.
  3. When the experiment had ended or shut down per user request stop the instance.

However this does not prevent unnecessary idle time of the instance which will add up to the costs if one does not stop the instance directly after an experiment.

Preferred functionality would be:

  1. Define a target VM instance within deepkit.yml experiment file through API.
  2. Let deepkit start the instance, provision it automatically when the experiment has been started by the user.
  3. When the experiment has ended stop the the instance.

Example config file:

vm: <API: start the instance>
image: tensorflow/tensorflow:1.15.2-gpu-py3
command: python model.py
vm_post: <API: stop the instance>

If you have a different work around in place I would be happy to hear about it!

@marcj
Copy link
Member

marcj commented Jun 26, 2020

Hey, I already started (8f9ac31) cloud provider support which does exactly that. It works a bit different though: You first create a new cluster of type Azure/AWS/GC/Genesis and configure what instance types should be allowed and how many. Once you start a cloud experiment that configured cluster auto-scales up/down depending on the workload. It works without adding new deepkit.yml configuration options. I think we get the first version with this feature as experimental released next month.

@mullerrwd
Copy link
Author

Hi @marcj, Awesome to hear! Ar you using horovodRunner to support distributed training or your own abstraction?

In any case, I would be happy to help and test the functionality with a Microsoft Azure cluster.

@marcj
Copy link
Member

marcj commented Jun 26, 2020

@mullerrwd currently we don't have our own concept for distributed learning. You can certainly configure in your deepkit.yml multiple instances using basically the pipeline feature, and then connect between each other, but that is something I'm still trying to improve. Not yet sure which framework, if any, will be used.

I'm currently planning to support GenesisCloud/AWS/GC at the beginning. Microsoft Azure is something I've never used personally, so this will probably be integrated last.

@mullerrwd
Copy link
Author

In a first case I would be interested in a single job experiment provisioning a single VM instance. I have to check the cluster and auto scaling functionality of Azure myself as well since I mostly use GC myself. Although I haven't tested the cluster functionality from GC either. So I understand the primary support for GenesisCloud, GC and AWS.

I will dive into the process for Microsoft Azure in the mean time.

@marcj
Copy link
Member

marcj commented Jun 26, 2020

Deepkit doesn't use the auto-scaling function of any of these vendors. It implemented its own algo, and uses just most basic API calls createInstance/terminateInstance/getPublicIP of those vendors. So as long as Azure has equal API calls available it should be easy to integrate.

@mullerrwd
Copy link
Author

Good to know, I'll keep that in mind when testing on MS Azure. I'll be able to help you with Azure support knowing how you support the primary cloud providers.

@mullerrwd
Copy link
Author

Leaving this here for future reference: Microsoft Azure ML cluster

@marcj I assume you are using the api as described here: https://cloud.google.com/dataproc/docs/concepts/compute/gpus
Microsoft Azure seems to use a different basic api abstraction which can be interfaced with directly from within python. See the doc link above. Basically one registers and configures the vm instance or compute cluster using the azureml python sdk as below:

from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# Choose a name for your CPU cluster
cpu_cluster_name = "cpucluster"

# Verify that cluster does not exist already
try:
    cpu_cluster = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
                                                           max_nodes=4)
    cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)

cpu_cluster.wait_for_completion(show_output=True)

and

from azureml.core.runconfig import RunConfiguration
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core.runconfig import DEFAULT_CPU_IMAGE

# Create a new runconfig object
run_amlcompute = RunConfiguration()

# Use the cpu_cluster you created above. 
run_amlcompute.target = cpu_cluster

# Enable Docker
run_amlcompute.environment.docker.enabled = True

# Set Docker base image to the default CPU-based image
run_amlcompute.environment.docker.base_image = DEFAULT_CPU_IMAGE

# Use conda_dependencies.yml to create a conda environment in the Docker image for execution
run_amlcompute.environment.python.user_managed_dependencies = False

# Specify CondaDependencies obj, add necessary packages
run_amlcompute.environment.python.conda_dependencies = CondaDependencies.create(conda_packages=['scikit-learn'])

After which you can kick off the training experiment using the same azureml sdk. This document describes how to do that both with the python sdk as the azure cli interface.

I'm still working my way through the examples and documentation and my own tests. I will report my findings here for your convenience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants