Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(CustomSageMakerEndpoint): (node.add_dependency not working) #432

Open
1 task done
airmonitor opened this issue May 6, 2024 · 7 comments
Open
1 task done

(CustomSageMakerEndpoint): (node.add_dependency not working) #432

airmonitor opened this issue May 6, 2024 · 7 comments
Labels
bug Something isn't working needs-triage This issue or PR still needs to be triaged.

Comments

@airmonitor
Copy link

Describe the bug

The generated CloudFormation stack for the below 2 resources fails to deploy

        iam_role = iam.Role(
            self,
            id="sagemaker_execution_role",
            assumed_by=iam.ServicePrincipal("sagemaker.amazonaws.com"),
            managed_policies=[iam.ManagedPolicy.from_aws_managed_policy_name("AmazonSageMakerFullAccess")],
            role_name=f"{config_vars.stage}-{config_vars.project}-{model_name}-execution-role",
        )

        custom_sagemaker_endpoint = CustomSageMakerEndpoint(
            self,
            id=model_name,
            container=ContainerImage.from_ecr_repository(repository=self.whisper_ecr_repo, tag=config_vars.image_tag),
            enable_operational_metric=True,
            endpoint_name=f"{config_vars.stage}-{config_vars.project}-{model_name}",
            instance_count=1,
            instance_type=SageMakerInstanceType.ML_G5_2_XLARGE  # type: ignore
            model_data_url=f"s3://{self.model_data_s3_bucket.bucket_name}/model.tar.gz",
            model_id=f"{config_vars.stage}-{config_vars.project}-{model_name}",
            role=iam_role,
            environment={"SAGEMAKER_TRITON_DEFAULT_MODEL_NAME": f"whisper-{model_name}"},
        )

The stack is rolled back because the iam_role is not ready for custom_sagemaker_endpoint to be used.

Adding explicit dependency via the below code is also not helping

custom_sagemaker_endpoint.node.add_dependency(iam_role)

as the generated cloudformation output is missing DependsOn section in the resource manifest for "AWS::SageMaker::EndpointConfig"

Expected Behavior

The stack deployment will not fail due to missing/not ready IAM Role.

Current Behavior

Could not access model data at s3://dev-backend-whisper-model-data/model.tar.gz. Please ensure that the role "arn:aws:iam::112233:role/dev-backend-whisper-large-v2-vanilla-exe
cution-role" exists and that its trust relationship policy allows the action "sts:AssumeRole" for the service principal "sagemaker.amazonaws.com". Also ensure that the role has "s3:
GetObject" permissions and that the object is located in us-east-1. If your Model uses multiple models or uncompressed models, please ensure that the role has "s3:ListBucket" permis
sion. (Service: AmazonSageMaker; Status Code: 400; Error Code: ValidationException; Request ID: 7570c458-14c4-4c9a-bc97-a5287e74faaa; Proxy: null)

Reproduction Steps

Using the following code

        iam_role = iam.Role(
            self,
            id="sagemaker_execution_role",
            assumed_by=iam.ServicePrincipal("sagemaker.amazonaws.com"),
            managed_policies=[iam.ManagedPolicy.from_aws_managed_policy_name("AmazonSageMakerFullAccess")],
            role_name=f"{config_vars.stage}-{config_vars.project}-{model_name}-execution-role",
        )

        custom_sagemaker_endpoint = CustomSageMakerEndpoint(
            self,
            id=model_name,
            container=ContainerImage.from_ecr_repository(repository=self.whisper_ecr_repo, tag=config_vars.image_tag),
            enable_operational_metric=True,
            endpoint_name=f"{config_vars.stage}-{config_vars.project}-{model_name}",
            instance_count=1,
            instance_type=SageMakerInstanceType.ML_G5_2_XLARGE  # type: ignore
            model_data_url=f"s3://{self.model_data_s3_bucket.bucket_name}/model.tar.gz",
            model_id=f"{config_vars.stage}-{config_vars.project}-{model_name}",
            role=iam_role,
            environment={"SAGEMAKER_TRITON_DEFAULT_MODEL_NAME": f"whisper-{model_name}"},
        )

generate output and deploy via cdk deploy

Possible Solution

the method add_dependency will update the DependsOn section in the generated template correctly

Additional Information/Context

No response

CDK CLI Version

2.140.0 (build 46168aa)

Framework Version

No response

Node.js Version

v18.20.2

OS

macOS Sonoma 14.4.1

Language

Python

Language Version

No response

Region experiencing the issue

us-east-1

Code modification

N/A

Other information

No response

Service quota

  • I have reviewed the service quotas for this construct
@airmonitor airmonitor added bug Something isn't working needs-triage This issue or PR still needs to be triaged. labels May 6, 2024
@krokoko
Copy link
Collaborator

krokoko commented May 7, 2024

Hi @airmonitor , thank you for reporting this. Did you have a look at the associated sample we have for this construct ? https://github.com/aws-samples/generative-ai-cdk-constructs-samples/blob/main/samples/sagemaker_custom_endpoint/lib/sagemaker_custom_endpoint-stack.ts

@airmonitor
Copy link
Author

Hello @krokoko

Thx for reaching out.

I will try with your suggestion.

Do you have also a working example for adding application autoscaling?

I need to have scalability added there so the number of endpoint instances will scale base on metric.

I've tried with application autoscaling but it was failing due to not working add_dependency method.

In short the endpoint creation was still in progress when application autoscaling tried to add required functionality.

AWS support added DependsOn section to the synthesised template but I was not able to implement the same using CDK due to missing add dependency method.

Regards
Tom

@airmonitor
Copy link
Author

@krokoko

I've tested the proposed solution to use the IAM role that is created with CustomSageMakerEndpoint but apparently, I'm doing something wrong here

For the below scenario I'm getting errors from the cloudformation:

        custom_sagemaker_endpoint = CustomSageMakerEndpoint(
            self,
            id=model_name,
            container=ContainerImage.from_ecr_repository(repository=self.whisper_ecr_repo, tag=config_vars.image_tag),
            enable_operational_metric=True,
            endpoint_name=f"{config_vars.stage}-{config_vars.project}-{model_name}",
            instance_count=1,
            instance_type=SageMakerInstanceType.ML_G5_2_XLARGE  # type: ignore
            if config_vars.stage == "prod"
            else SageMakerInstanceType.ML_G5_XLARGE,  # type: ignore
            model_data_url=f"s3://{self.model_data_s3_bucket.bucket_name}/model.tar.gz",
            model_id=f"{config_vars.stage}-{config_vars.project}-{model_name}",
            environment={"SAGEMAKER_TRITON_DEFAULT_MODEL_NAME": f"whisper-{model_name}"},
        )

        self.whisper_ecr_repo.grant_pull(custom_sagemaker_endpoint.role) # ECR Repository
        self.model_data_s3_bucket.grant_read_write(custom_sagemaker_endpoint.role) # S3 bucket

Error from CloudFormation

9:56:57 AM | CREATE_FAILED        | AWS::SageMaker::Model                       | devbackendwhisperl...odellargev2vanilla
Could not access model data at s3://s3-bucket/model.tar.gz. Please ensure that the role "arn:aws:iam::11223344:role/dev-backend-whisper-core--largev2vanilla
RoleC7CE1EE-xaSDM2XaG6fE" exists and that its trust relationship policy allows the action "sts:AssumeRole" for the service principal "sagemaker.amazonaws.com". Also ensure that the
role has "s3:GetObject" permissions and that the object is located in us-east-1. If your Model uses multiple models or uncompressed models, please ensure that the role has "s3:ListB
ucket" permission. (Service: AmazonSageMaker; Status Code: 400; Error Code: ValidationException; Request ID: 591d86b1-0e24-4d83-a7ec-f3f12b30402d; Proxy: null)


 ❌  backend-whisper-pipeline/dev-backend-whisper-core-stage/core-stack (dev-backend-whisper-core-stage-core-stack) failed: Error: The stack named dev-backend-whisper-core-stage-core-stack failed creation, it may need to be manually deleted from the AWS console: ROLLBACK_COMPLETE: Could not access model data at s3://s3-bucket/model.tar.gz. Please ensure that the role "arn:aws:iam::11223344:role/dev-backend-whisper-core--largev2vanillaRoleC7CE1EE-xaSDM2XaG6fE" exists and that its trust relationship policy allows the action "sts:AssumeRole" for the service principal "sagemaker.amazonaws.com". Also ensure that the role has "s3:GetObject" permissions and that the object is located in us-east-1. If your Model uses multiple models or uncompressed models, please ensure that the role has "s3:ListBucket" permission. (Service: AmazonSageMaker; Status Code: 400; Error Code: ValidationException; Request ID: 591d86b1-0e24-4d83-a7ec-f3f12b30402d; Proxy: null)
    at FullCloudFormationDeployment.monitorDeployment (/opt/homebrew/lib/node_modules/aws-cdk/lib/index.js:435:10568)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async Object.deployStack2 [as deployStack] (/opt/homebrew/lib/node_modules/aws-cdk/lib/index.js:438:199515)
    at async /opt/homebrew/lib/node_modules/aws-cdk/lib/index.js:438:181237

 ❌ Deployment failed: Error: The stack named dev-backend-whisper-core-stage-core-stack failed creation, it may need to be manually deleted from the AWS console: ROLLBACK_COMPLETE: Could not access model data at s3://s3-bucket/model.tar.gz. Please ensure that the role "arn:aws:iam::11223344:role/dev-backend-whisper-core--largev2vanillaRoleC7CE1EE-xaSDM2XaG6fE" exists and that its trust relationship policy allows the action "sts:AssumeRole" for the service principal "sagemaker.amazonaws.com". Also ensure that the role has "s3:GetObject" permissions and that the object is located in us-east-1. If your Model uses multiple models or uncompressed models, please ensure that the role has "s3:ListBucket" permission. (Service: AmazonSageMaker; Status Code: 400; Error Code: ValidationException; Request ID: 591d86b1-0e24-4d83-a7ec-f3f12b30402d; Proxy: null)
    at FullCloudFormationDeployment.monitorDeployment (/opt/homebrew/lib/node_modules/aws-cdk/lib/index.js:435:10568)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async Object.deployStack2 [as deployStack] (/opt/homebrew/lib/node_modules/aws-cdk/lib/index.js:438:199515)
    at async /opt/homebrew/lib/node_modules/aws-cdk/lib/index.js:438:181237

The stack named dev-backend-whisper-core-stage-core-stack failed creation, it may need to be manually deleted from the AWS console: ROLLBACK_COMPLETE: Could not access model data at s3://s3-bucket/model.tar.gz. Please ensure that the role "arn:aws:iam::11223344:role/dev-backend-whisper-core--largev2vanillaRoleC7CE1EE-xaSDM2XaG6fE" exists and that its trust relationship policy allows the action "sts:AssumeRole" for the service principal "sagemaker.amazonaws.com". Also ensure that the role has "s3:GetObject" permissions and that the object is located in us-east-1. If your Model uses multiple models or uncompressed models, please ensure that the role has "s3:ListBucket" permission. (Service: AmazonSageMaker; Status Code: 400; Error Code: ValidationException; Request ID: 591d86b1-0e24-4d83-a7ec-f3f12b30402d; Proxy: null)

@krokoko
Copy link
Collaborator

krokoko commented May 8, 2024

Hi @airmonitor , did you provide your role with permissions to access the S3 bucket storing your artifacts ? We have this demonstrated in the sample here: https://github.com/aws-samples/generative-ai-cdk-constructs-samples/blob/main/samples/sagemaker_custom_endpoint/lib/sagemaker_custom_endpoint-stack.ts#L43C5-L57C7

@airmonitor
Copy link
Author

Hey @krokoko.

I'm not sure if I get your intention here.

The example for extending IAM permissions is using non optimal way for granting necessary permissions

    customEndpoint.addToRolePolicy(
      new iam.PolicyStatement({
        effect: iam.Effect.ALLOW,
        actions: [
          's3:GetObject',
          's3:GetObject*',
          's3:GetBucket*',
          's3:List*',
        ],
        resources: [
          'BUCKET_ARN',
          'BUCKET_ARN/*',
        ],
      }),
    );

In CDK we have a better way, shorter way to achieve the required results using the following examples

        self.whisper_ecr_repo.grant_pull(custom_sagemaker_endpoint.role) # ECR Repository
        self.model_data_s3_bucket.grant_read_write(custom_sagemaker_endpoint.role) # S3 bucket

grant_pull, grant_read_write method for ecr repository and s3 bucket should do the trick.
I'm not sure why we need to get back mentally a few years later when those methods weren't available in CDK?

Regards
Tom

@krokoko
Copy link
Collaborator

krokoko commented May 8, 2024

Hi @airmonitor , sorry I just missed the last line of your previous example, I was thinking that permissions were missing on the role to access the model artifacts in S3.
Is the endpoint created in the same region where the bucket was created ?

@airmonitor
Copy link
Author

Hello,

Yes, all resources are deployed in us-east-1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs-triage This issue or PR still needs to be triaged.
Development

No branches or pull requests

2 participants