Optimizing GPU utilization for AI/ML workloads on Amazon EC2

This lab is provided as part of AWS Innovate AI/ML Data Edition, it has been adapted from the blog post.

ℹ️ You will run this lab in your own AWS account and running this lab will incur some costs. Please follow directions at the end of the lab to remove resources to avoid future costs.

Overview
Architecture
Deploy VPC, Subnet and CloudWatch resources
[Optional] Subscribe to the Amazon Machine Images (AMI) with the NVIDIA drivers pre-installed
[Optional] Launch GPU-enabled EC2 instances
[Optional] Preparing your own EC2 instances
Deploying and installing the CloudWatch Agent with the AWS System Manager Console
Configuring the CloudWatch Agent with the AWS System Manager Console
Visualize your instance's GPU metrics in CloudWatch
Cleaning up lab resources
Conclusion

Overview

Machine learning workloads can be costly, and AI/ML teams can have a difficult time tracking and maintaining efficient resource utilization. ML workloads often utilize GPUs extensively, so typical application performance metrics such as CPU, memory, and disk usage don’t paint the full picture when it comes to system performance. Additionally, data scientists conduct long-running experiments and model training activities on existing compute instances that fit their unique specifications. Forcing these experiments to be run on newly provisioned infrastructure with proper monitoring systems installed might not be a viable option.

In this lab, you will learn how to track GPU utilization across all of your AI/ML workloads and enable accurate capacity planning without needing teams to use a custom Amazon Machine Image (AMI) or to re-deploy their existing infrastructure. You will use Amazon CloudWatch to track GPU utilization, and leverage AWS Systems Manager Run Command to install and configure the Amazon CloudWatch Agent across your existing fleet of GPU-enabled instances.

Architecture

This lab is structured in a way that allows you to follow the steps by making use of your exiting GPU-enabled EC2 instances, or, by deploying a set of new instances for testing purpose in this lab.

First, you will make sure that your existing GPU-enabled EC2 instances, or the new testing ones deployed for this lab, have the Systems Manager Agent installed, and also have the appropriate level of AWS Identity and Access Management (IAM) permissions to run the Amazon CloudWatch Agent.
During the lab, your CloudWatch Agent configuration will be stored in Systems Manager Parameter Store.
You will install and configured the CloudWatch Agent on your existing or testing GPU-enabled EC2 instances from the AWS System Manager, using Systems Manager Documents.
GPU metrics will be published to CloudWatch, which you can then visualize through the CloudWatch Metric Console and Dashboard.

Please note that all the steps described below for this lab will be using the US West (Oregon) (us-west-2) AWS Region.

Deploy VPC, Subnet and CloudWatch resources

In this section you will be deploying the CloudFormation template "optimizing-gpu-lab-infra.yml". Below resources will be deployed:

[OPTIONAL] Base Infrastructure: As described in the Architecture section above, you can complete this lab either by using your own set of GPU-enabled EC2 instances, or, you can follow the instructions to deploy a set of g4dn.xlarge instances for this lab. If you decide to launch a new instances for this lab, the CloudFormation template have a condition so it will deploy a simple VPC, Public Subnet and Security Group for the new instances to be launched on.
CloudWatch Resources: Together with the optional "Base Infrastructure" resources, the CloudFormation template will also deploy an AWS System Manager Parameter with the CloudWatch Agent configuration to gather GPU related metrics; a pre-configured CloudWatch Dashboard for visualizing the GPU metrics published to CloudWatch; and an IAM role to be attached to the GPU-enabled instances. The IAM role created by the template will allow the instance to communicate with the AWS System Manager (for completing the installation of the CloudWatch Agent) and to CloudWatch (for the agent metrics publication).

Download the CloudFormation template "optimizing-gpu-lab-infra.yml". Then, open the CloudFormation console and click on Create stack
Upload the downloaded template ("optimizing-gpu-lab-infra.yml") and click on Next
Name the stack as gpu-utilization-lab. If you've decided to use your existing GPU-enabled EC2 instance for this lab then set the parameter DeployInfra to false. Or, set it to true if you will be following the steps in this lab for launching a new set of GPU-enabled instances. Once the name and parameter are defined, click on Next
In the Configure stack options page, leave the default values and click on Next at the bottom of the page
In the Review and create page, scroll to the bottom of the page, check the acknowledgement box and click on Submit
Wait until the stack deployment is in the "CREATE_COMPLETE" status

[Optional] Subscribe to the Amazon Machine Images (AMI) with the NVIDIA drivers pre-installed

NOTE: You only need to follow this section if you would like to launch and use a new set of GPU-enabled EC2 instances for completing this lab. In the case you want to use your existing GPU-enabled instances, you can skip this section and continue with the Preparing your own EC2 instances section.

An instance with an attached NVIDIA GPU, such as a P3 or G4dn instance, must have the appropriate NVIDIA driver installed. Depending on the instance type, you can either download a public NVIDIA driver, download a driver from Amazon S3 that is available only to AWS customers, or use an AMI with the driver pre-installed. For the test instances deployed in this lab, you are going to use the Amazon Machine Image (AMI) Amazon Linux 2 AMI with NVIDIA TESLA GPU Driver with the driver pre-installed.

Before launching an EC2 instance with this AMI, you will need to subscribe to this AMI via the AWS Marketplace. Note that for this particular AWS Marketplace AMI, you will only get charged for the type of EC2 instance that is eventually launched using the AMI and not for the subscription alone.

Make sure that you are still logged in in your AWS Account. Open the AWS Marketplace Amazon Linux 2 page for AMI with NVIDIA TESLA GPU Driver
Click on the Continue to Subscribe button at the top-right of the page. Note that there no need to select any region or EC2 instance type in this page (is just for informational use)
On the next page, click on Accept Terms. Wait until the subscription is completed
It should take around 5 minutes for the top banner to show the subscription successful message: "Thank you for subscribing to this product! You can now configure your software.". Once you see that message, you can continue to the next section Launch GPU-enabled EC2 instances

[Optional] Launch GPU-enabled EC2 instances

NOTE: You only need to follow this section if you would like to launch and use a new set of GPU-enabled EC2 instances for completing this lab. In the case that you want to use your existing GPU-enabled instances, you can skip this section and continue with the Preparing your own EC2 instances section.

In this section, you are going to launch two EC2 Instances type g4dn.xlarge. The instances will be launched in the sample VPC and Public Subnet deployed during the Deploy VPC, Subnet and CloudWatch resources section. These instances will be using the AMI you have subscribed to from the AWS Marketplace in the previous section.

Note that the g4dn.xlarge instance type are priced as $0.526 (USD) per hour (more details in the EC2 pricing page). So don't forget to follow the Cleaning up lab resources section at the end of the lab.

Open the EC2 Launch Instance console
Change the Number of instances to "2" and add the Name for the instances as My-GPU-Instance

In the Application and OS Images (Amazon Machine Image) selection search box, search for the Marketplace AMI: Amazon Linux 2 AMI with NVIDIA TESLA GPU Driver

From the search results, open the AWS Marketplace AMIs tab and select the "Amazon Linux 2 AMI with NVIDIA TESLA GPU Driver" AMI

On the pop-up window click on Subscribe Now (As you would have completed the Marketplace subscription in the previous section, it will take you back to the EC2 Launch page to continue with the rest of the instance launch settings)

For the Instance type, select g4dn.xlarge. For Key pair dropdown, select "Proceed without a key pair"

For the Network settings, you need to select the VPC, Subnet and Security Group resources that were created as part of the CloudFormation stack in the Deploy VPC, Subnet and CloudWatch resources section of the lab. For this, open the VPC dropdown and type the same name you have used for the stack name (E.g. gpu-utilization-lab). Also, make sure to select the existing Security Group which name also starts as the stack name:

Scroll to the bottom of the page and open the Advanced details section. In the IAM instance profile dropdown, use the CloudFormation stack name (E.g. gpu-utilization-lab) to search for the IAM profile created by the stack. Make sure to select it

By selecting this IAM instance profile, you make sure that the instances are attached to the IAM role with the proper permissions to interact with AWS System Manager and Amazon CloudWatch.

Finally, scroll down to the bottom of the page, and click on Launch instance

Now, from the EC2 Instances console, you should wait for the two new GPU-enabled instances to be in a "Running" state with status checks "passed"

[Optional] Preparing your own EC2 instances

NOTE: You only need to follow this section if you are using your own GPU-enabled EC2 instances for completing this lab. In the case you are using the G4dn EC2 test instances launched in above steps, you can skip this section and continue with the Deploying and installing the CloudWatch Agent with the AWS System Manager Console section.

In this section of the lab you will make sure that your EC2 instances have the Systems Manager Agent installed and the instance are attached to an IAM role with the right permissions to interact with CloudWatch and System Manager.

Systems Manager Agent

Many AWS-provided AMIs already have the Systems Manager Agent installed. For a full list of the AMIs which have the Systems Manager Agent pre-installed, see Amazon Machine Images (AMIs) with SSM Agent preinstalled. If your AMI doesn’t have the Systems Manager Agent installed, see Working with SSM Agent for instructions on installing based on your operating system (OS).

IAM permissions

Once the Systems Manager Agent is installed, your EC2 instance needs certain permissions so that the CloudWatch Agent can accept commands from Systems Manager, read Systems Manager Parameter Store entries, and publish metrics to CloudWatch. You can make use of the IAM role created in the Deploy VPC, Subnet and CloudWatch resources section of the lab for this:

Open the EC2 Instances console
Click on Actions > Security > Modify IAM Role

Open the drop-down and type the same name you have used for the stack name (E.g. gpu-utilization-lab). Select the IAM role and click on Update IAM role

Outbound Internet access

Note that apart from the IAM permissions, your Amazon EC2 instances must have outbound internet access in order for the CloudWatch Agent to send data to CloudWatch or to interact with System Manager. For more information about how to configure internet access, see Internet Gateways in the Amazon VPC User Guide.

Deploying and installing the CloudWatch Agent with the AWS System Manager Console

In this section of the lab you will utilize AWS Systems Manager to deploy and install the CloudWatch Agent in your GPU-enabled instances.

Open the AWS Systems Manager console and go to the Run Command page. In this page make sure that the "AWS-ConfigureAWSPackage" document is selected

For the Command parameters, make sure that the parameters are set as below:
- Action: Install
- Installation Type: Uninstall and reinstall
- Name: AmazonCloudWatchAgent
- Version: latest
- Additional Arguments: Leave it as default ({})

In the Target selection section, select Choose instances manually and select the two "My-GPU-Instance" that you have launched in the previous section (Or, just select the existing GPU-enabled instances you have decided to use for this lab)

Leave the other parameters as default, scroll to the bottom of the page and click on Run
Once the installation finishes, you should see the command Overall status as "Success"

At this point the CloudWatch Agent is now installed in the selected instances.

Configuring the CloudWatch Agent with the AWS System Manager Console

In this section of the lab you will utilize AWS Systems Manager to configure the CloudWatch Agent with the necessary configuration for publishing the GPU consumption metrics into CloudWatch.

Note: You can download the CloudWatch Agent JSON configuration file cw-agent-gpu-conf.json and review it in your text editor. For this lab, this same configuration was already stored as a System Manager Parameter in the Deploy VPC, Subnet and CloudWatch resources section of the lab. Visit the public documentation for more details about the GPU metrics collected by the agent.

Open the AWS Systems Manager console and go to the Run Command page. In this page, make sure that the "AmazonCloudWatch-ManageAgent" document is selected

For the Command parameters, make sure that the parameters are set as below:
- Action: configure
- Mode: ec2
- Optional Configuration Source: ssm
- Optional Configuration Location: CloudWatch-Agent-Config-GPU-Lab (This is the System Manager Parameter store containing the CloudWatch Agent configuration)
- Optional Restart: yes

In the Target selection section, select Choose instances manually and select the two "My-GPU-Instance" that you have launched in the previous section (Or, just select the existing GPU-enabled instances you have decided to use for this lab)

Leave the other parameters as default, scroll to the bottom of the page and click on Run
Once the installation finishes, you should see the command Overall status as "Success"

At this point the CloudWatch Agent is now configured for collecting GPU metrics from the selected instances.

Visualize your instance's GPU metrics in CloudWatch

Now that your GPU-enabled EC2 instances are publishing their utilization metrics to CloudWatch, you can visualize and analyze these metrics to better understand your resource utilization patterns.

The GPU metrics collected by the CloudWatch Agent for this lab are within the CWAgentGPU namespace. You can explore these GPU metrics using the CloudWatch Metrics console, like below:

For exploring and comparing the GPU metrics, you can also use the sample CloudWatch Dashboard named My-GPU-Usage that was deployed as part of Deploy VPC, Subnet and CloudWatch resources section. Open the CloudWatch Dashboard My-GPU-Usage, and you will see the dashboard widgets as below:

Note that if you have decided to launch a new set of GPU-enabled EC2 testing instances for completing this lab, you won't see any GPU related utilization (it will show 0% utilization). This is expected, as the new testing instances deployed for this lab wouldn't be running any GPU-enabled application by default unless you configure and run such application. In the case you are using your existing instances with a actual GPU-enabled application running on them, then, you should see utilization metrics patterns as per above example.

Cleaning up lab resources

Only if you have launched new G4dn instances for this lab, navigate to the EC2 Console page, select the G4dn instances (named My-GPU-Instance as per instructions in the lab). Then click on Instance state > Terminate instance

Open the CloudFormation console, select the stack deployed for this lab (gpu-utilization-lab) and click on Delete

Conclusion

Throughout this lab, you have learn how to deploy and configure the CloudWatch Agent across your GPU-enabled EC2 instances to track GPU utilization without pausing in-progress experiments and model training. Then, you learn how to visualize the GPU utilization metrics of your instances with a CloudWatch Dashboard. Using this approach in a real scenario would allow you to better understand your workload's GPU usage and make more informed scaling and cost decisions.

As described in the AWS Well-Architected Framework, not collecting usage metrics for your accelerated computing GPU instances is a common anti-pattern against Well-Architected best practices.

Collecting performance-related metrics for your GPU-enabled workloads and instances will help you align application performance with your business requirements to ensure that you are meeting your workload needs. It can also help you to continually improve the resource performance and utilization in your workloads.

Also, collecting GPU utilization metrics will allow you to compare your workload actual usage with the anticipated usage level. This way you can make inform decisions in terms of cost optimization, and choose the correct resource type and size for your GPU workloads.

Visit the AWS Well-Architected Framework white paper and below related Best Practices for more information:

Survey

Let us know what you thought of this session and how we can improve the presentation experience for you in the future by completing this event session poll. Participants who complete the surveys from AWS Innovate - AI/ML and Data Edition will receive a gift code for USD25 in AWS credits ^{1, 2 & 3}. AWS credits will be sent via email by March 29, 2024. Note: Only registrants of AWS Innovate - AI/ML and Data Edition who complete the surveys will receive a gift code for USD25 in AWS credits via email.

¹AWS Promotional Credits Terms and conditions apply: https://aws.amazon.com/awscredits/

²Limited to 1 x USD25 AWS credits per participant.

³Participants will be required to provide their business email addresses to receive the gift code for AWS credits.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
cloudformation		cloudformation
images		images
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cloudformation

cloudformation

images

images

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Optimizing GPU utilization for AI/ML workloads on Amazon EC2

Table of Contents

Overview

Architecture

Deploy VPC, Subnet and CloudWatch resources

[Optional] Subscribe to the Amazon Machine Images (AMI) with the NVIDIA drivers pre-installed

[Optional] Launch GPU-enabled EC2 instances

[Optional] Preparing your own EC2 instances

Deploying and installing the CloudWatch Agent with the AWS System Manager Console

Configuring the CloudWatch Agent with the AWS System Manager Console

Visualize your instance's GPU metrics in CloudWatch

Cleaning up lab resources

Conclusion

Survey

About

Releases

Packages

License

eugeneteo/Optimizing-GPU-Utilization-for-AI-ML-Workloads

Folders and files

Latest commit

History

Repository files navigation

Optimizing GPU utilization for AI/ML workloads on Amazon EC2

Table of Contents

Overview

Architecture

Deploy VPC, Subnet and CloudWatch resources

[Optional] Subscribe to the Amazon Machine Images (AMI) with the NVIDIA drivers pre-installed

[Optional] Launch GPU-enabled EC2 instances

[Optional] Preparing your own EC2 instances

Deploying and installing the CloudWatch Agent with the AWS System Manager Console

Configuring the CloudWatch Agent with the AWS System Manager Console

Visualize your instance's GPU metrics in CloudWatch

Cleaning up lab resources

Conclusion

Survey

About

Resources

License

Stars

Watchers

Forks