This terraform module allows you to deploy a cluster of MLflow servers + UI using:
- ECS Fargate as the compute engine
- Amazon Aurora Serverless as the backend store
- S3 as the default artifact root
When designing this module, we've made some decisions about technologies and configuration that might not apply to all use cases. In doing so, we've applied the following principles, in this order:
- High availability and recovery. All components are meant to be highly available and provide backups so that important data can be recovered in case of a failure. Database back-ups are activated, and versioning is enabled for the S3 bucket.
- Least privilege. We've created dedicated security groups and IAM roles, and restricted traffic/permissions to the minimum necessary to run MLflow.
- Smallest maintenance overhead. We've chosen serverless technologies like Fargate and Aurora Serverless to minimize the cost of ownership of an MLflow cluster.
- Smallest cost overhead. We've tried to choose technologies that minimize costs, under the assumption that MLflow will be an internal tool that is used during working hours, and with a very lightweight use of the database.
- Private by default. As of version 1.9.1, MLflow doesn't provide native authentication/authorization mechanisms. When using the default values, the module will create resources that are not exposed to the Internet. Moreover, the module provides server-side encryption for the S3 bucket and the database through different KMS keys.
- Flexibility. Where possible, we've tried to make this module usable under different circumstances. For instance, you can use it to deploy MLflow to a private VPN and access it within a VPN, or you can leverage ALB's integration with Cognito/OIDC to allow users to access MLflow from your SSO solution.
The following diagram illustrates the components the module creates with the default configuration:
To use this module, you can simply:
module "mlflow" {
source = "glovo/mlflow/aws"
version = "1.0.0"
unique_name = "mlflow-team-x"
vpc_id = "my-vpc"
load_balancer_subnet_ids = ["public-subnet-az-1", "public-subnet-az-2", "public-subnet-az-3"]
load_balancer_ingress_cidr_blocks = ["192.0.2.0/24"]
service_subnet_ids = ["private-subnet-az-1", "private-subnet-az-2", "private-subnet-az-3"]
database_subnet_ids = ["db-private-subnet-az-1", "db-private-subnet-az-2", "db-private-subnet-az-3"]
database_password_secret_arn = "mlflow-team-x-db-password-arn"
}
You can find a more complete usage example in terratest/examples/main.tf
.
Note you may also:
- Add sidecar containers (e.g. a datadog agent for Fargate)
- Provide your own bucket/path as the default artifact root
- Attach an autoscaling policy to the service (for instance, you may scale down to 0 instances during the night)
No requirements.
Name | Version |
---|---|
aws | n/a |
No modules.
Name | Description | Type | Default | Required |
---|---|---|---|---|
api_cert_arn | (Required) - The ARN of AMC for api dns zone | string |
n/a | yes |
api_zone_id | (Required) - The ID of the hosted zone MlFlow api access will be hosted at. | string |
n/a | yes |
api_zone_name | (Required) - The name of the hosted zone MlFlow api access will be hosted at. | string |
n/a | yes |
artifact_bucket_encryption_algorithm | Algorithm used for encrypting the default bucket. | string |
"AES256" |
no |
artifact_bucket_encryption_key_arn | ARN of the key used to encrypt the bucket. Only needed if you set aws:kms as encryption algorithm. | string |
null |
no |
artifact_bucket_id | If specified, MLflow will use this bucket to store artifacts. Otherwise, this module will create a dedicated bucket. When overriding this value, you need to enable the task role to access the root you specified | string |
null |
no |
artifact_bucket_path | The path within the bucket where MLflow will store its artifacts | string |
"/" |
no |
artifact_buckets_mlflow_will_read | A list of bucket IDs MLflow will need read access to, in order to show the stored artifacts. It accepts any valid IAM resource, including ARNs with wildcards, so you can do something like arn:aws:s3:::bucket-prefix-* | list(string) |
[] |
no |
aws_account_id | The AWS account id of the provider being deployed to (e.g. 12345678). Autoloaded from account.tfvars | string |
"" |
no |
aws_assume_role_arn | (Optional) - ARN of the IAM role when optionally connecting to AWS via assumed role. Autoloaded from account.tfvars. | string |
"" |
no |
aws_cognito_user_pool_arn | (Required) - AWS Cognito user pool arn | string |
n/a | yes |
aws_cognito_user_pool_client_id | (Required) - AWS Cognito user pool client id | string |
n/a | yes |
aws_cognito_user_pool_domain | (Required) - AWS Cognito user pool domain | string |
n/a | yes |
aws_region | The AWS region (e.g. ap-southeast-2). Autoloaded from region.tfvars. | string |
"" |
no |
certificate_arn | (Required) - The ARN of the certificate MlFlow traffic will be encrypted with. | string |
n/a | yes |
database_auto_pause | Pause Aurora Serverless after a given amount of time with no activity. https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-serverless.how-it-works.html#aurora-serverless.how-it-works.pause-resume | bool |
true |
no |
database_max_capacity | The maximum capacity for the Aurora Serverless cluster. Aurora will scale automatically in this range. See: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-serverless.how-it-works.html | number |
1 |
no |
database_min_capacity | The minimum capacity for the Aurora Serverless cluster. Aurora will scale automatically in this range. See: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-serverless.how-it-works.html | number |
1 |
no |
database_password_secret_arn | The ARN of the SecretManager secret that defines the database password. It needs to be created before calling the module | string |
n/a | yes |
database_seconds_until_auto_pause | The number of seconds without activity before Aurora Serverless is paused. https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-serverless.how-it-works.html#aurora-serverless.how-it-works.pause-resume | number |
300 |
no |
database_skip_final_snapshot | n/a | bool |
false |
no |
database_subnet_ids | List of subnets where the RDS database will be deployed | list(string) |
n/a | yes |
gunicorn_opts | Additional command line options forwarded to gunicorn processes (https://mlflow.org/docs/latest/cli.html#cmdoption-mlflow-server-gunicorn-opts) | string |
"" |
no |
key_arn | (Required) - The KMS key used to encrypt the secrets. | string |
n/a | yes |
load_balancer_ingress_cidr_blocks | CIDR blocks from where to allow traffic to the Load Balancer. With an internal LB, we've left this | list(string) |
n/a | yes |
load_balancer_is_internal | By default, the load balancer is internal. This is because as of v1.9.1, MLflow doesn't have native authentication or authorization. We recommend exposing MLflow behind a VPN or using OIDC/Cognito together with the LB listener. | bool |
true |
no |
load_balancer_subnet_ids | List of subnets where the Load Balancer will be deployed | list(string) |
n/a | yes |
rds_cluster_engine_version | AWS RDS cluster engine version | string |
"5.7.mysql_aurora.2.08.3" |
no |
record_name | (Required) - The name of the record MlFlow will use. | string |
n/a | yes |
service_cpu | The number of CPU units reserved for the MLflow container | number |
2048 |
no |
service_image_tag | The MLflow version to deploy. Note that this version has to be available as a tag here: https://hub.docker.com/r/larribas/mlflow | string |
"1.9.1" |
no |
service_log_retention_in_days | The number of days to keep logs around | number |
90 |
no |
service_max_capacity | Maximum number of instances for the ecs service. This will create an aws_appautoscaling_target that can later on be used to autoscale the MLflow instance | number |
2 |
no |
service_memory | The amount (in MiB) of memory reserved for the MLflow container | number |
4096 |
no |
service_min_capacity | Minimum number of instances for the ecs service. This will create an aws_appautoscaling_target that can later on be used to autoscale the MLflow instance | number |
2 |
no |
service_sidecar_container_definitions | A list of container definitions to deploy alongside the main container. See: https://www.terraform.io/docs/providers/aws/r/ecs_task_definition.html#container_definitions | list |
[] |
no |
service_subnet_ids | List of subnets where the MLflow ECS service will be deployed (the recommendation is to use subnets that cannot be accessed directly from the Internet) | list(string) |
n/a | yes |
tags | AWS Tags common to all the resources created | map(string) |
{} |
no |
unique_name | A unique name for this application (e.g. mlflow-team-name) | string |
n/a | yes |
vpc_id | AWS VPC to deploy MLflow into | string |
n/a | yes |
zone_id | (Required) - The ID of the hosted zone MlFlow will be hosted at. | string |
n/a | yes |
zone_name | (Required) - The name of the hosted zone MlFlow will be hosted at. | string |
n/a | yes |
You can find more Terraform Modules by vising the links below:
- terraform-aws-mlflow - The upstream fork of this repository
Got a question? File a Github issue, or message the DevOps team on Slack.
Lorenzo Arribas |
Yosuke Adachi |
Lawrence "Loz" Warren |
Razvan Tudorica |
Alaa |
Atlantis Bot |