Skip to content

A terraform module to productionalize MLflow on top of AWS (Fargate + Aurora Serverless + S3)

License

Notifications You must be signed in to change notification settings

humn-ai/tf-mod-aws-mlflow

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

52 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

The humnai logo.

tf-mod-aws-mlflow

πŸ‘‹ Introduction

This terraform module allows you to deploy a cluster of MLflow servers + UI using:

  • ECS Fargate as the compute engine
  • Amazon Aurora Serverless as the backend store
  • S3 as the default artifact root

πŸ‘¨β€πŸŽ¨ Design principles

When designing this module, we've made some decisions about technologies and configuration that might not apply to all use cases. In doing so, we've applied the following principles, in this order:

  • High availability and recovery. All components are meant to be highly available and provide backups so that important data can be recovered in case of a failure. Database back-ups are activated, and versioning is enabled for the S3 bucket.
  • Least privilege. We've created dedicated security groups and IAM roles, and restricted traffic/permissions to the minimum necessary to run MLflow.
  • Smallest maintenance overhead. We've chosen serverless technologies like Fargate and Aurora Serverless to minimize the cost of ownership of an MLflow cluster.
  • Smallest cost overhead. We've tried to choose technologies that minimize costs, under the assumption that MLflow will be an internal tool that is used during working hours, and with a very lightweight use of the database.
  • Private by default. As of version 1.9.1, MLflow doesn't provide native authentication/authorization mechanisms. When using the default values, the module will create resources that are not exposed to the Internet. Moreover, the module provides server-side encryption for the S3 bucket and the database through different KMS keys.
  • Flexibility. Where possible, we've tried to make this module usable under different circumstances. For instance, you can use it to deploy MLflow to a private VPN and access it within a VPN, or you can leverage ALB's integration with Cognito/OIDC to allow users to access MLflow from your SSO solution.

πŸ—οΈ Architecture

The following diagram illustrates the components the module creates with the default configuration:

Architecture Diagram

πŸ”¨ Usage

To use this module, you can simply:

module "mlflow" {
  source  = "glovo/mlflow/aws"
  version = "1.0.0"

  unique_name                       = "mlflow-team-x"
  vpc_id                            = "my-vpc"
  load_balancer_subnet_ids          = ["public-subnet-az-1", "public-subnet-az-2", "public-subnet-az-3"]
  load_balancer_ingress_cidr_blocks = ["192.0.2.0/24"]
  service_subnet_ids                = ["private-subnet-az-1", "private-subnet-az-2", "private-subnet-az-3"]
  database_subnet_ids               = ["db-private-subnet-az-1", "db-private-subnet-az-2", "db-private-subnet-az-3"]
  database_password_secret_arn      = "mlflow-team-x-db-password-arn"
}

You can find a more complete usage example in terratest/examples/main.tf.

Note you may also:

  • Add sidecar containers (e.g. a datadog agent for Fargate)
  • Provide your own bucket/path as the default artifact root
  • Attach an autoscaling policy to the service (for instance, you may scale down to 0 instances during the night)

Requirements

No requirements.

Providers

Name Version
aws n/a

Modules

No modules.

Resources

Name Type
aws_apigatewayv2_api.mlflow resource
aws_apigatewayv2_api_mapping.mlflow resource
aws_apigatewayv2_authorizer.lambda resource
aws_apigatewayv2_domain_name.mlflow resource
aws_apigatewayv2_integration.mlflow resource
aws_apigatewayv2_route.default resource
aws_apigatewayv2_stage.default resource
aws_apigatewayv2_vpc_link.mlflow resource
aws_appautoscaling_target.mlflow resource
aws_cloudwatch_log_group.mlflow resource
aws_db_subnet_group.rds resource
aws_ecs_cluster.mlflow resource
aws_ecs_service.mlflow resource
aws_ecs_task_definition.mlflow resource
aws_iam_role.ecs_execution resource
aws_iam_role.ecs_task resource
aws_iam_role.lambda resource
aws_iam_role_policy.db_secrets resource
aws_iam_role_policy.default_bucket resource
aws_iam_role_policy_attachment.ecs_execution resource
aws_iam_role_policy_attachment.lambda resource
aws_lambda_function.mlflow resource
aws_lb.mlflow resource
aws_lb_listener.http resource
aws_lb_listener.https resource
aws_lb_listener_rule.api resource
aws_lb_listener_rule.http resource
aws_lb_listener_rule.https resource
aws_lb_target_group.mlflow resource
aws_rds_cluster.backend_store resource
aws_route53_record.api resource
aws_route53_record.record resource
aws_s3_bucket.default resource
aws_security_group.ecs_service resource
aws_security_group.lb resource
aws_security_group.rds resource
aws_security_group_rule.lb_egress resource
aws_security_group_rule.lb_egress_idp resource
aws_security_group_rule.lb_ingress_http resource
aws_security_group_rule.lb_ingress_https resource
aws_availability_zones.available data source
aws_region.current data source
aws_secretsmanager_secret.db_password data source
aws_secretsmanager_secret_version.db_password data source

Inputs

Name Description Type Default Required
api_cert_arn (Required) - The ARN of AMC for api dns zone string n/a yes
api_zone_id (Required) - The ID of the hosted zone MlFlow api access will be hosted at. string n/a yes
api_zone_name (Required) - The name of the hosted zone MlFlow api access will be hosted at. string n/a yes
artifact_bucket_encryption_algorithm Algorithm used for encrypting the default bucket. string "AES256" no
artifact_bucket_encryption_key_arn ARN of the key used to encrypt the bucket. Only needed if you set aws:kms as encryption algorithm. string null no
artifact_bucket_id If specified, MLflow will use this bucket to store artifacts. Otherwise, this module will create a dedicated bucket. When overriding this value, you need to enable the task role to access the root you specified string null no
artifact_bucket_path The path within the bucket where MLflow will store its artifacts string "/" no
artifact_buckets_mlflow_will_read A list of bucket IDs MLflow will need read access to, in order to show the stored artifacts. It accepts any valid IAM resource, including ARNs with wildcards, so you can do something like arn:aws:s3:::bucket-prefix-* list(string) [] no
aws_account_id The AWS account id of the provider being deployed to (e.g. 12345678). Autoloaded from account.tfvars string "" no
aws_assume_role_arn (Optional) - ARN of the IAM role when optionally connecting to AWS via assumed role. Autoloaded from account.tfvars. string "" no
aws_cognito_user_pool_arn (Required) - AWS Cognito user pool arn string n/a yes
aws_cognito_user_pool_client_id (Required) - AWS Cognito user pool client id string n/a yes
aws_cognito_user_pool_domain (Required) - AWS Cognito user pool domain string n/a yes
aws_region The AWS region (e.g. ap-southeast-2). Autoloaded from region.tfvars. string "" no
certificate_arn (Required) - The ARN of the certificate MlFlow traffic will be encrypted with. string n/a yes
database_auto_pause Pause Aurora Serverless after a given amount of time with no activity. https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-serverless.how-it-works.html#aurora-serverless.how-it-works.pause-resume bool true no
database_max_capacity The maximum capacity for the Aurora Serverless cluster. Aurora will scale automatically in this range. See: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-serverless.how-it-works.html number 1 no
database_min_capacity The minimum capacity for the Aurora Serverless cluster. Aurora will scale automatically in this range. See: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-serverless.how-it-works.html number 1 no
database_password_secret_arn The ARN of the SecretManager secret that defines the database password. It needs to be created before calling the module string n/a yes
database_seconds_until_auto_pause The number of seconds without activity before Aurora Serverless is paused. https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-serverless.how-it-works.html#aurora-serverless.how-it-works.pause-resume number 300 no
database_skip_final_snapshot n/a bool false no
database_subnet_ids List of subnets where the RDS database will be deployed list(string) n/a yes
gunicorn_opts Additional command line options forwarded to gunicorn processes (https://mlflow.org/docs/latest/cli.html#cmdoption-mlflow-server-gunicorn-opts) string "" no
key_arn (Required) - The KMS key used to encrypt the secrets. string n/a yes
load_balancer_ingress_cidr_blocks CIDR blocks from where to allow traffic to the Load Balancer. With an internal LB, we've left this list(string) n/a yes
load_balancer_is_internal By default, the load balancer is internal. This is because as of v1.9.1, MLflow doesn't have native authentication or authorization. We recommend exposing MLflow behind a VPN or using OIDC/Cognito together with the LB listener. bool true no
load_balancer_subnet_ids List of subnets where the Load Balancer will be deployed list(string) n/a yes
rds_cluster_engine_version AWS RDS cluster engine version string "5.7.mysql_aurora.2.08.3" no
record_name (Required) - The name of the record MlFlow will use. string n/a yes
service_cpu The number of CPU units reserved for the MLflow container number 2048 no
service_image_tag The MLflow version to deploy. Note that this version has to be available as a tag here: https://hub.docker.com/r/larribas/mlflow string "1.9.1" no
service_log_retention_in_days The number of days to keep logs around number 90 no
service_max_capacity Maximum number of instances for the ecs service. This will create an aws_appautoscaling_target that can later on be used to autoscale the MLflow instance number 2 no
service_memory The amount (in MiB) of memory reserved for the MLflow container number 4096 no
service_min_capacity Minimum number of instances for the ecs service. This will create an aws_appautoscaling_target that can later on be used to autoscale the MLflow instance number 2 no
service_sidecar_container_definitions A list of container definitions to deploy alongside the main container. See: https://www.terraform.io/docs/providers/aws/r/ecs_task_definition.html#container_definitions list [] no
service_subnet_ids List of subnets where the MLflow ECS service will be deployed (the recommendation is to use subnets that cannot be accessed directly from the Internet) list(string) n/a yes
tags AWS Tags common to all the resources created map(string) {} no
unique_name A unique name for this application (e.g. mlflow-team-name) string n/a yes
vpc_id AWS VPC to deploy MLflow into string n/a yes
zone_id (Required) - The ID of the hosted zone MlFlow will be hosted at. string n/a yes
zone_name (Required) - The name of the hosted zone MlFlow will be hosted at. string n/a yes

Outputs

Name Description
artifact_bucket_id n/a
cluster_id n/a
load_balancer_arn n/a
load_balancer_dns_name n/a
load_balancer_target_group_id n/a
load_balancer_zone_id n/a
service_autoscaling_target_max_capacity n/a
service_autoscaling_target_min_capacity n/a
service_autoscaling_target_resource_id n/a
service_autoscaling_target_scalable_dimension n/a
service_autoscaling_target_service_namespace n/a
service_execution_role_id n/a
service_task_role_id n/a

πŸ”— Related Projects

You can find more Terraform Modules by vising the links below:

❓ Help

Got a question? File a Github issue, or message the DevOps team on Slack.

πŸ₯³ Contributors

Lorenzo
Lorenzo Arribas
Yosuke
Yosuke Adachi
Lawrence
Lawrence "Loz" Warren
Razvan
Razvan Tudorica
Alaa/
Alaa
Atlantis
Atlantis Bot

The humnai logo.

About

A terraform module to productionalize MLflow on top of AWS (Fargate + Aurora Serverless + S3)

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • HCL 93.2%
  • Go 6.8%