tf-mod-aws-mlflow

👋 Introduction

This terraform module allows you to deploy a cluster of MLflow servers + UI using:

ECS Fargate as the compute engine
Amazon Aurora Serverless as the backend store
S3 as the default artifact root

👨‍🎨 Design principles

When designing this module, we've made some decisions about technologies and configuration that might not apply to all use cases. In doing so, we've applied the following principles, in this order:

High availability and recovery. All components are meant to be highly available and provide backups so that important data can be recovered in case of a failure. Database back-ups are activated, and versioning is enabled for the S3 bucket.
Least privilege. We've created dedicated security groups and IAM roles, and restricted traffic/permissions to the minimum necessary to run MLflow.
Smallest maintenance overhead. We've chosen serverless technologies like Fargate and Aurora Serverless to minimize the cost of ownership of an MLflow cluster.
Smallest cost overhead. We've tried to choose technologies that minimize costs, under the assumption that MLflow will be an internal tool that is used during working hours, and with a very lightweight use of the database.
Private by default. As of version 1.9.1, MLflow doesn't provide native authentication/authorization mechanisms. When using the default values, the module will create resources that are not exposed to the Internet. Moreover, the module provides server-side encryption for the S3 bucket and the database through different KMS keys.
Flexibility. Where possible, we've tried to make this module usable under different circumstances. For instance, you can use it to deploy MLflow to a private VPN and access it within a VPN, or you can leverage ALB's integration with Cognito/OIDC to allow users to access MLflow from your SSO solution.

🏗️ Architecture

The following diagram illustrates the components the module creates with the default configuration:

🔨 Usage

To use this module, you can simply:

module "mlflow" {
  source  = "glovo/mlflow/aws"
  version = "1.0.0"

  unique_name                       = "mlflow-team-x"
  vpc_id                            = "my-vpc"
  load_balancer_subnet_ids          = ["public-subnet-az-1", "public-subnet-az-2", "public-subnet-az-3"]
  load_balancer_ingress_cidr_blocks = ["192.0.2.0/24"]
  service_subnet_ids                = ["private-subnet-az-1", "private-subnet-az-2", "private-subnet-az-3"]
  database_subnet_ids               = ["db-private-subnet-az-1", "db-private-subnet-az-2", "db-private-subnet-az-3"]
  database_password_secret_arn      = "mlflow-team-x-db-password-arn"
}

You can find a more complete usage example in terratest/examples/main.tf.

Note you may also:

Add sidecar containers (e.g. a datadog agent for Fargate)
Provide your own bucket/path as the default artifact root
Attach an autoscaling policy to the service (for instance, you may scale down to 0 instances during the night)

Requirements

No requirements.

Providers

Name	Version
aws	n/a

Modules

No modules.

Resources

Name	Type
aws_apigatewayv2_api.mlflow	resource
aws_apigatewayv2_api_mapping.mlflow	resource
aws_apigatewayv2_authorizer.lambda	resource
aws_apigatewayv2_domain_name.mlflow	resource
aws_apigatewayv2_integration.mlflow	resource
aws_apigatewayv2_route.default	resource
aws_apigatewayv2_stage.default	resource
aws_apigatewayv2_vpc_link.mlflow	resource
aws_appautoscaling_target.mlflow	resource
aws_cloudwatch_log_group.mlflow	resource
aws_db_subnet_group.rds	resource
aws_ecs_cluster.mlflow	resource
aws_ecs_service.mlflow	resource
aws_ecs_task_definition.mlflow	resource
aws_iam_role.ecs_execution	resource
aws_iam_role.ecs_task	resource
aws_iam_role.lambda	resource
aws_iam_role_policy.db_secrets	resource
aws_iam_role_policy.default_bucket	resource
aws_iam_role_policy_attachment.ecs_execution	resource
aws_iam_role_policy_attachment.lambda	resource
aws_lambda_function.mlflow	resource
aws_lb.mlflow	resource
aws_lb_listener.http	resource
aws_lb_listener.https	resource
aws_lb_listener_rule.api	resource
aws_lb_listener_rule.http	resource
aws_lb_listener_rule.https	resource
aws_lb_target_group.mlflow	resource
aws_rds_cluster.backend_store	resource
aws_route53_record.api	resource
aws_route53_record.record	resource
aws_s3_bucket.default	resource
aws_security_group.ecs_service	resource
aws_security_group.lb	resource
aws_security_group.rds	resource
aws_security_group_rule.lb_egress	resource
aws_security_group_rule.lb_egress_idp	resource
aws_security_group_rule.lb_ingress_http	resource
aws_security_group_rule.lb_ingress_https	resource
aws_availability_zones.available	data source
aws_region.current	data source
aws_secretsmanager_secret.db_password	data source
aws_secretsmanager_secret_version.db_password	data source

Inputs

Name	Description	Type	Default	Required
api_cert_arn	(Required) - The ARN of AMC for api dns zone	`string`	n/a	yes
api_zone_id	(Required) - The ID of the hosted zone MlFlow api access will be hosted at.	`string`	n/a	yes
api_zone_name	(Required) - The name of the hosted zone MlFlow api access will be hosted at.	`string`	n/a	yes
artifact_bucket_encryption_algorithm	Algorithm used for encrypting the default bucket.	`string`	`"AES256"`	no
artifact_bucket_encryption_key_arn	ARN of the key used to encrypt the bucket. Only needed if you set aws:kms as encryption algorithm.	`string`	`null`	no
artifact_bucket_id	If specified, MLflow will use this bucket to store artifacts. Otherwise, this module will create a dedicated bucket. When overriding this value, you need to enable the task role to access the root you specified	`string`	`null`	no
artifact_bucket_path	The path within the bucket where MLflow will store its artifacts	`string`	`"/"`	no
artifact_buckets_mlflow_will_read	A list of bucket IDs MLflow will need read access to, in order to show the stored artifacts. It accepts any valid IAM resource, including ARNs with wildcards, so you can do something like arn:aws:s3:::bucket-prefix-*	`list(string)`	`[]`	no
aws_account_id	The AWS account id of the provider being deployed to (e.g. 12345678). Autoloaded from account.tfvars	`string`	`""`	no
aws_assume_role_arn	(Optional) - ARN of the IAM role when optionally connecting to AWS via assumed role. Autoloaded from account.tfvars.	`string`	`""`	no
aws_cognito_user_pool_arn	(Required) - AWS Cognito user pool arn	`string`	n/a	yes
aws_cognito_user_pool_client_id	(Required) - AWS Cognito user pool client id	`string`	n/a	yes
aws_cognito_user_pool_domain	(Required) - AWS Cognito user pool domain	`string`	n/a	yes
aws_region	The AWS region (e.g. ap-southeast-2). Autoloaded from region.tfvars.	`string`	`""`	no
certificate_arn	(Required) - The ARN of the certificate MlFlow traffic will be encrypted with.	`string`	n/a	yes
database_auto_pause	Pause Aurora Serverless after a given amount of time with no activity. https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-serverless.how-it-works.html#aurora-serverless.how-it-works.pause-resume	`bool`	`true`	no
database_max_capacity	The maximum capacity for the Aurora Serverless cluster. Aurora will scale automatically in this range. See: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-serverless.how-it-works.html	`number`	`1`	no
database_min_capacity	The minimum capacity for the Aurora Serverless cluster. Aurora will scale automatically in this range. See: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-serverless.how-it-works.html	`number`	`1`	no
database_password_secret_arn	The ARN of the SecretManager secret that defines the database password. It needs to be created before calling the module	`string`	n/a	yes
database_seconds_until_auto_pause	The number of seconds without activity before Aurora Serverless is paused. https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-serverless.how-it-works.html#aurora-serverless.how-it-works.pause-resume	`number`	`300`	no
database_skip_final_snapshot	n/a	`bool`	`false`	no
database_subnet_ids	List of subnets where the RDS database will be deployed	`list(string)`	n/a	yes
gunicorn_opts	Additional command line options forwarded to gunicorn processes (https://mlflow.org/docs/latest/cli.html#cmdoption-mlflow-server-gunicorn-opts)	`string`	`""`	no
key_arn	(Required) - The KMS key used to encrypt the secrets.	`string`	n/a	yes
load_balancer_ingress_cidr_blocks	CIDR blocks from where to allow traffic to the Load Balancer. With an internal LB, we've left this	`list(string)`	n/a	yes
load_balancer_is_internal	By default, the load balancer is internal. This is because as of v1.9.1, MLflow doesn't have native authentication or authorization. We recommend exposing MLflow behind a VPN or using OIDC/Cognito together with the LB listener.	`bool`	`true`	no
load_balancer_subnet_ids	List of subnets where the Load Balancer will be deployed	`list(string)`	n/a	yes
rds_cluster_engine_version	AWS RDS cluster engine version	`string`	`"5.7.mysql_aurora.2.08.3"`	no
record_name	(Required) - The name of the record MlFlow will use.	`string`	n/a	yes
service_cpu	The number of CPU units reserved for the MLflow container	`number`	`2048`	no
service_image_tag	The MLflow version to deploy. Note that this version has to be available as a tag here: https://hub.docker.com/r/larribas/mlflow	`string`	`"1.9.1"`	no
service_log_retention_in_days	The number of days to keep logs around	`number`	`90`	no
service_max_capacity	Maximum number of instances for the ecs service. This will create an aws_appautoscaling_target that can later on be used to autoscale the MLflow instance	`number`	`2`	no
service_memory	The amount (in MiB) of memory reserved for the MLflow container	`number`	`4096`	no
service_min_capacity	Minimum number of instances for the ecs service. This will create an aws_appautoscaling_target that can later on be used to autoscale the MLflow instance	`number`	`2`	no
service_sidecar_container_definitions	A list of container definitions to deploy alongside the main container. See: https://www.terraform.io/docs/providers/aws/r/ecs_task_definition.html#container_definitions	`list`	`[]`	no
service_subnet_ids	List of subnets where the MLflow ECS service will be deployed (the recommendation is to use subnets that cannot be accessed directly from the Internet)	`list(string)`	n/a	yes
tags	AWS Tags common to all the resources created	`map(string)`	`{}`	no
unique_name	A unique name for this application (e.g. mlflow-team-name)	`string`	n/a	yes
vpc_id	AWS VPC to deploy MLflow into	`string`	n/a	yes
zone_id	(Required) - The ID of the hosted zone MlFlow will be hosted at.	`string`	n/a	yes
zone_name	(Required) - The name of the hosted zone MlFlow will be hosted at.	`string`	n/a	yes

Outputs

Name	Description
artifact_bucket_id	n/a
cluster_id	n/a
load_balancer_arn	n/a
load_balancer_dns_name	n/a
load_balancer_target_group_id	n/a
load_balancer_zone_id	n/a
service_autoscaling_target_max_capacity	n/a
service_autoscaling_target_min_capacity	n/a
service_autoscaling_target_resource_id	n/a
service_autoscaling_target_scalable_dimension	n/a
service_autoscaling_target_service_namespace	n/a
service_execution_role_id	n/a
service_task_role_id	n/a

🔗 Related Projects

You can find more Terraform Modules by vising the links below:

terraform-aws-mlflow - The upstream fork of this repository

❓ Help

Got a question? File a Github issue, or message the DevOps team on Slack.

🥳 Contributors

_{Lorenzo Arribas}

_{Yosuke Adachi}

_{Lawrence "Loz" Warren}

_{Razvan Tudorica}

_Alaa

_{Atlantis Bot}

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
.github		.github
docs		docs
src		src
terratest		terratest
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
auth.tf		auth.tf
locals.tf		locals.tf
outputs.tf		outputs.tf
rds.tf		rds.tf
s3.tf		s3.tf
server.tf		server.tf
variables.tf		variables.tf

License

humn-ai/tf-mod-aws-mlflow

Folders and files

Latest commit

History

Repository files navigation

tf-mod-aws-mlflow

👋 Introduction

👨‍🎨 Design principles

🏗️ Architecture

🔨 Usage

Requirements

Providers

Modules

Resources

Inputs

Outputs

🔗 Related Projects

❓ Help

🥳 Contributors

About

Topics

Resources

License

Stars

Watchers

Forks

Languages