Skip to content

Latest commit

History

History
30 lines (24 loc) 路 2.51 KB

File metadata and controls

30 lines (24 loc) 路 2.51 KB

SageMaker Heterogeneous Clusters for Model Training

In July 2022, we launched heterogeneous clusters for Amazon SageMaker model training, which enables you to launch training jobs that use different instance types and families in a single job. A primary use case is offloading data preprocessing to compute-optimized instance types, whereas the deep neural network (DNN) process continues to run on GPU or ML accelerated instance types.

In this repository, you'll find TensorFlow (tf.data.service) and PyTorch (a custom gRPC based distributed data loading) examples which demonstrates how to use heterogeneous clusters in your SageMaker training jobs. You can use these examples with minimal code changes in your existing training scripts.

Hetero job diagram

Examples:

Hello world example

  • Heterogeneous Clusters - a hello world example: This basic example runs a heterogeneous training job consisting of two instance groups. Each group includes a different instance_type. Each instance prints its instance group information and exits. Note: This example only shows how to orchestrate the training job with instance type. For actual code to help with a distributed data loader, see the TensorFlow or PyTorch examples below.

TensorFlow examples

  • TensorFlow's tf.data.service based Amazon SageMaker Heterogeneous Clusters: This TensorFlow example runs both Homogeneous and Heterogeneous clusters SageMaker training job, and compares their results. The heterogeneous cluster training job runs with two instance groups:
    • data_group - this group has two ml.c5.18xlarge instances to which data preprocessing/augmentation is offloaded.
    • dnn_group - this group has one ml.p4d.24xlarge instance (8GPUs) in a horovod/MPI distribution.

PyTorch examples