Skip to content

Latest commit

 

History

History
154 lines (84 loc) · 18.7 KB

AWS_Google_Azure.md

File metadata and controls

154 lines (84 loc) · 18.7 KB

Cloud Providers

Intro

The big cloud providers offer a vast range of services and a lot of breadth in the MLOps space. These offerings can feel like a bundle of disconnected services until one sees the approach that holds it together. At a high level the cloud provider offerings can appear very similar but in detail each provider has a different set of services and a different approach to stitching them together.

Google's approach with Vertex is differentiated by the way it is structured, using Vertex pipelines as an orchestrator (Vertex pipelines being a managed and integrated version of open source kubeflow pipelines). AWS Sagemaker is differentiated primarily by the range of its services and how they relate to the rest of AWS. Microsoft's strategy with Azure centers on developer experience and quality of integrations.

Google

Vertex is Google's newly-unified AI platform. The main ways in which it is unified are:

  • Everything falls logically under vertex headings in the google cloud console and the APIs to the services should be consistent. (Previously AutoML was visibly a separate function).
  • Pipelines can be used as an orchestrator for most of the workflow (including AutoML).

Pipelines for Orchestration

This idea of pipelines as an orchestrator across offerings is illustrated here (from TechCrunch):

Google Vertex AI Components with Pipelines as Orchestration

This could be confusing to those familiar with kubeflow pipelines (which is what vertex pipelines are under the hood) as kubeflow pipelines started out as a distributed training system, with each step executing in a separate container, along with a UI to inspect runs and ways to resume from a failed step. Pipelines are usable for distributed training but pipelines can also be used to perform other tasks beyond training. This is illustrated in the below screenshot:

​​

Google Vertex AI Pipelines screenshot

Here there is a conditional deployment decision to decide whether the model is good enough to deploy or not. If it passes the test then the model is deployed from the pipeline.

Revamped AutoML

The AutoML offerings are now more consistent with other parts of Google's AI stack. Basically different ways to input your data can lead to the same training path and different training paths can lead to the same deployment path. Here's a diagram from Henry Tappen and Brian Kobashikawa (via Lak Lakshmanan):

Google Vertex AI Components with progression flows

There is also some difference, as can be seen in how datasets are handled. The dataset concept is broken into managed (which has specific metadata and lives on specific google data products) or not managed. The managed datasets are mostly for just AutoML for now.

Training Models

We could think of Google as having three basic routes to training models - AutoML, pipelines (which is intended more as an orchestrator) and custom training jobs. Where pipelines use multiple containers with each step in a different container, custom training jobs are a single-container training function. It has automatic integration to TensorBoard for results. It can do distributed training via the underlying ML code framework, providing the framework supports it. It supports GPUs and you can watch/inspect runs, though inspecting runs looks a bit basic compared to pipelines.

There's a facility native to the custom training jobs for tuning hyperparameters. In addition to this, google has launched Vizier. Instead of being native to training jobs Vizier has an API and you tell it what you've tried and it makes suggestions for what to try next. This may be less integrated but Vizier is able to go deeper in what it can tune.

Deployment

To deploy your models to get predictions, you have two options. If your model is built using a natively supported framework, you can tell google to load your model (serialized artifact) into a pre-built container image. If your model framework is not supported or you have custom logic then you can supply a custom image meeting the specification. Google can then create an endpoint for you to call to get predictions. You can create this either through the API or using the web console wizard.

Monitoring

Running models can be monitored for training/serving skew. For skew you supply your training set when you create the monitoring job. The monitoring job stores data going into your model in a storage bucket and uses this for comparison to the training data. You can get alerts based on configurable thresholds for individual features. Drift is similar but doesn't require training data (it's monitoring change over time). Feature distributions are shown in the console for any alerts.

Amazon

SageMaker aims to be both comprehensive and integrated. It has services addressed at all parts of the ML lifecycle and a variety of ways to interact with them, including its own dedicated web-based IDE (SageMaker Studio - based on JupyterLab).

AWS SageMaker components. Screenshot from SageMaker website

The services marked as 'New' in the above were mostly announced at re:invent in December 2020.

Prepare

SageMaker Ground Truth is a labelling service, similar to Google's but with more automation features and less reliance on humans. SageMaker Ground Truth's automation makes it competitive with specialist labelling tools (a whole area in itself).

Data Wrangler allows data scientists to visualize, transform and analyze data from supported data sources from within SageMaker Studio:

AWS SageMaker Data Wrangler screenshot showing analysis - from SageMaker website

AWS SageMaker Data Wrangler screenshot showing dataset - from SageMaker website

Data Wrangler is also integrated with Clarify (which handles explainability), to highlight bias in data. This streamlines feature engineering and the resulting features can go directly to SageMaker Feature Store. Custom code can be added and SageMaker also separately has support for Spark processing jobs.

Once features are in the Feature Store, they are available to be searched for and used by other teams. They can also be used at the serving stage as well as the training stage.

Build

The 'Build' heading is for offerings that save time throughout the whole process. AutoPilot is SageMaker's AutoML service that covers automated feature engineering, model building and selection. The various models it builds are all visible so you can evaluate them and choose which to deploy.

JumpStart is a set of CloudFormation templates for common ML use cases.

Train and Tune

Training with SageMaker is typically done from the python sdk, which is used to invoke training jobs. A training job runs inside a container on an EC2 instance. You can use a pre-built docker image if your training job is for a natively supported algorithm and framework. Otherwise you can use your own docker image that conforms to the requirements. The SDK can also be used for distributed training jobs for frameworks that support distributed training.

Processing and training steps can be chained together in Pipelines. The resulting models can be registered in the model registry and you can track lineage on artifacts from steps. You can also view and execute pipelines from the SageMaker Studio UI.

Deploy and Manage

The SageMaker SDK has a 'deploy' operation for which you specify what type of instance you want your model deployed to. As with training, this can be either a custom image or a built-in one. The expectation is training and deployment will both happen with SageMaker but this can be worked around if you want to deploy a model that you've trained outside of SageMaker. Serving real-time HTTP requests is the typical case but you can also perform batch predictions and chain inference steps in inference pipelines.

Deployed models get some monitoring by default with integration to CloudWatch for basic invocation metrics. You can also set up scheduled monitoring jobs. SageMaker can be configured to capture request and response data and to perform various comparisons on that data such as comparing against training data or triggering alerts based on constraints.

Azure

The Azure Machine Learning offering is consciously pitched at multiple roles (especially Data Scientists and Developers) and different skill levels. It is aimed to support team collaboration and automate the key problems of MLOps, across the whole ML lifecycle. This comes across in the prominence Azure gives to workspaces and git repos and there's also increasing support for Azure ML with VSCode (along with the web-based GUI called Studio).

The cloud providers are all looking to leverage existing relationships in their MLOps offerings. For Microsoft this appears to be about developer relationships (with GitHub and VSCode) as well as a reputation as a compute provider. They seem keen on integrations and references to open source tools and integrations to the Databricks platform are prominent in the documentation.

Workspaces

Azure Machine Learning Components

With Azure Machine Learning everything belongs to a workspace by default and workspaces can be shared between users and teams. The assets under a workspace are shown in the studio web UI.

Azure Machine Learning assets in studio web UI

Let's walk through the key Azure ML concepts to get a feel for the platform.

Datasets

Datasets are references to where data is stored. The data itself isn't in the workspace but the dataset abstraction lets you work with the data through the workspace. Only metadata is copied to the workspace. Datasets can be FileDataSets or TabularDataSets. The data can be on a range of supported types of storage, including blob storage, databases or the Databricks file system.

Environments

An environment is a configuration with variables and library dependencies, used for training and for serving models. Plays a similar role to pipenv but is instantiated through docker under the hood.

Experiments

Experiments are groups of training runs. Each time we train a model with a set of parameters, that falls under an experiment that can be automatically recorded. This allows us to review what was trained when and by whom. Here's a simple script to submit a training run with the Azure ML Python SDK:

from azureml.core import ScriptRunConfig, Experiment
from azureml.core.environment import Environment

exp = Experiment(name="myexp", workspace = ws)
# Instantiate environment
myenv = Environment(name="myenv")

# Configure the ScriptRunConfig and specify the environment
src = ScriptRunConfig(source_directory=".", script="train.py", compute_target="local", environment=myenv)

# Submit run 
run = exp.submit(src)

Here we're referring to another script called "train.py" that contains typical model training code, nothing azure-specific. We name the experiment that will be used and also name the environment. Both are instantiated automatically and the submit operation runs the training job.

The above is run from the web studio with the files in the cloud already. Training can also be run from a notebook or from local by having the CLI configured and submitting a CLI command with a YAML specification for the environment image and to point to the code.

Training parameters and metrics can be automatically logged as Azure integrates with mlflow's open source approach to tracking. If you submit a run from a directory under git then git information is also tracked for the run.

Pipelines

Azure Machine Learning Pipelines are for training jobs that have multiple long-running steps. The steps can be chained to run in different containers so that some can run in parallel and if an individual step fails then you can retry/resume from there.

Models

Models are either created in Azure through training runs or you can register models created elsewhere. A registered model can be deployed as an endpoint.

Endpoints

An endpoint sets up hosting so that you can make requests to your model in the cloud and get predictions. Endpoint hosting is inside a container image - so basically an Environment, which could be the same image/Environment used for training. There are some prebuilt images available to use as a basis. Or you can build an image from scratch that conforms to the requirements.

Azure ML's managed endpoints have traffic-splitting features for rollout and can work with GPUs. Inference can be real-time or batch. There's also integration to monitoring features. Managed endpoints and monitoring are both in Preview/Beta release at the time of writing.