Release 18+ Upgrade Guide Breaks Existing Deployments #1744

jseiser · 2022-01-06T17:10:39Z

Description

Attempted to follow the upgrade guide to get to 18+. Our Terraform deployments generally run from a Jenkins worker pod, that exists ON the same cluster that we are upgrading. The pod has a service account on it, using the IRSA setup which grants it access to the cluster.
This all works/worked before the upgrade.

Reproduction

Attempt to follow the upgrade guide for 18.

Code Snippet to Reproduce

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "18.0.1"

  cluster_name    = format("eks-%s-%s-%s", var.layer, var.vpc_id_tag, var.platform_env)
  cluster_version = var.cluster_version

  subnet_ids = data.aws_subnet_ids.private.ids
  vpc_id     = data.terraform_remote_state.vpc.outputs.vpc_id

  cluster_endpoint_private_access = true
  cluster_endpoint_public_access  = false

  cluster_security_group_additional_rules = {
    admin_access = {
      description = "Admin ingress to Kubernetes API"
      cidr_blocks = [data.terraform_remote_state.vpc.outputs.vpc_cidr_block]
      protocol    = "tcp"
      from_port   = 443
      to_port     = 443
      type        = "ingress"
    }
  }

  eks_managed_node_group_defaults = {
    ami_type                   = "AL2_x86_64"
    disk_size                  = var.node_group_default_disk_size
    enable_bootstrap_user_data = true
    pre_bootstrap_user_data    = templatefile("${path.module}/templates/userdata.tpl", {})
    desired_size               = lower(var.platform_env) == "prod" ? 3 : 2
    max_size                   = lower(var.platform_env) == "prod" ? 6 : 3
    min_size                   = lower(var.platform_env) == "prod" ? 3 : 1
    instance_types             = lower(var.platform_env) == "prod" ? var.prod_instance_types : var.dev_instance_types
    capacity_type              = "ON_DEMAND"
    additional_tags = {
      Name = format("%s-%s-%s", var.layer, var.vpc_id_tag, var.platform_env)
    }
    update_config = {
      max_unavailable_percentage = 50
    }
    update_launch_template_default_version = true
    create_launch_template                 = true
    create_iam_role                        = true
    iam_role_name                          = format("iam-%s-%s-%s", var.layer, var.vpc_id_tag, var.platform_env)
    iam_role_use_name_prefix               = false
    iam_role_description                   = "EKS managed node group Role"
    iam_role_tags                          = local.tags
    iam_role_additional_policies = [
      "arn:aws-us-gov:iam::aws:policy/AmazonSSMManagedInstanceCore"
    ]
  }
  eks_managed_node_groups = {
    private1 = {
      subnet_ids = [tolist(data.aws_subnet_ids.private.ids)[0]]
    }
    private2 = {
      subnet_ids = [tolist(data.aws_subnet_ids.private.ids)[1]]
    }
    private3 = {
      subnet_ids = [tolist(data.aws_subnet_ids.private.ids)[2]]
    }
  }

  cluster_enabled_log_types              = ["api", "audit", "authenticator", "controllerManager", "scheduler"]
  cloudwatch_log_group_retention_in_days = 7

  enable_irsa = true

  cluster_encryption_config = [
    {
      provider_key_arn = aws_kms_key.eks.arn
      resources        = ["secrets"]
    }
  ]

  tags = merge(
    local.tags,
    {
      "Name"        = format("eks-%s-%s-%s", var.layer, var.vpc_id_tag, var.platform_env),
      "EKS_VERSION" = var.cluster_version
    }
  )

}

resource "null_resource" "patch" {
  triggers = {
    kubeconfig = base64encode(local.kubeconfig)
    cmd_patch  = "kubectl patch configmap/aws-auth --patch \"${local.aws_auth_configmap_yaml}\" -n kube-system --kubeconfig <(echo $KUBECONFIG | base64 --decode)"
  }

  provisioner "local-exec" {
    interpreter = ["/bin/bash", "-c"]
    environment = {
      KUBECONFIG = self.triggers.kubeconfig
    }
    command = self.triggers.cmd_patch
  }
}

locals

locals {

  kubeconfig = yamlencode({
    apiVersion      = "v1"
    kind            = "Config"
    current-context = "terraform"
    clusters = [{
      name = module.eks.cluster_id
      cluster = {
        certificate-authority-data = module.eks.cluster_certificate_authority_data
        server                     = module.eks.cluster_endpoint
      }
    }]
    contexts = [{
      name = "terraform"
      context = {
        cluster = module.eks.cluster_id
        user    = "terraform"
      }
    }]
    users = [{
      name = "terraform"
      user = {
        token = data.aws_eks_cluster_auth.eks.token
      }
    }]
  })

  aws_auth_configmap_yaml = <<-EOT
  ${chomp(module.eks.aws_auth_configmap_yaml)}
      - rolearn: arn:${var.iam_partition}:iam::${data.aws_caller_identity.current.account_id}:role/role-gitlab-runner-eks-${var.platform_env}
        username: gitlab:{{SessionName}}
        groups:
          - system:masters
      - rolearn: arn:${var.iam_partition}:iam::${data.aws_caller_identity.current.account_id}:role/role-jenkins-worker-eks-${var.platform_env}
        username: jenkins:{{SessionName}}
        groups:
          - system:masters
      - rolearn: arn:${var.iam_partition}:iam::${data.aws_caller_identity.current.account_id}:role/AWSReservedSSO_AdministratorAccess_f50fcd43baf05a89
        username: AWSAdministratorAccess:{{SessionName}}
        groups:
          - system:masters
  EOT
}

Expected behavior

Module will run to completion

Actual behavior

Current aws-auth

sh-4.2$ kubectl get configmap aws-auth -n kube-system -o yaml
apiVersion: v1
data:
  mapAccounts: |
    []
  mapRoles: |
    - "groups":
      - "system:bootstrappers"
      - "system:nodes"
      "rolearn": "arn:aws-us-gov:iam:::role/eks-ops-eks-dev20211104211936784200000009"
      "username": "system:node:{{EC2PrivateDNSName}}"
    - "groups":
      - "system:masters"
      "rolearn": "arn:aws-us-gov:iam:::role/role-gitlab-runner-eks-dev"
      "username": "gitlab-runner-dev"
    - "groups":
      - "system:masters"
      "rolearn": "arn:aws-us-gov:iam:::role/role-jenkins-worker-eks-dev"
      "username": "jenkins-dev"
  mapUsers: |
    - "groups":
      - "system:masters"
      "userarn": "arn:aws-us-gov:iam:::user/justin.seiser"
      "username": "jseiser"
kind: ConfigMap

The SA on the pod, that terraform is running from.

sh-4.2$ kubectl get sa jenkins-worker -n jenkins -o yaml
apiVersion: v1
automountServiceAccountToken: true
kind: ServiceAccount
metadata:
  annotations:
    eks.amazonaws.com/role-arn: arn:aws-us-gov:iam:::role/role-jenkins-worker-eks-dev

The error terraform returns

module.eks.kubernetes_config_map.aws_auth[0]: Refreshing state... [id=kube-system/aws-auth]

Error: configmaps "aws-auth" is forbidden: User "system:serviceaccount:jenkins:jenkins-worker" cannot get resource "configmaps" in API group "" in the namespace "kube-system"

Additional context

I do not doubt that im missing something, but that something does not appear to be covered in the documentation that I can find.

The text was updated successfully, but these errors were encountered:

jseiser · 2022-01-06T17:54:47Z

Doesnt work if I run from a bastion host, which is also granted access to the cluster.

Error

╷
│ Error: Get "http://localhost/api/v1/namespaces/kube-system/configmaps/aws-auth": dial tcp 127.0.0.1:80: connect: connection refused
│
│
╵

Same server showing access

sh-4.2$ kubectl get configmap aws-auth -n kube-system
NAME       DATA   AGE
aws-auth   3      62d

sh-4.2$ rm ~/.kube/config
sh-4.2$ terraform plan
module.eks.aws_iam_openid_connect_provider.oidc_provider[0]: Refreshing state... [id=arn:aws-us-gov:iam:::oidc-provider/oidc.eks.us-gov-west-1.amazonaws.com/id/]
╷
│ Error: Get "http://localhost/api/v1/namespaces/kube-system/configmaps/aws-auth": dial tcp 127.0.0.1:80: connect: connection refused
│
sh-4.2$ kubectl get pods -n jenkins
The connection to the server localhost:8080 was refused - did you specify the right host or port?
sh-4.2$ aws eks update-kubeconfig --name eks-ops-eks-dev
Added new context arn:aws-us-gov:eks:us-gov-west-1::cluster/eks-ops-eks-dev to /home/ssm-user/.kube/config
sh-4.2$ kubectl get pods -n jenkins
NAME        READY   STATUS    RESTARTS   AGE
jenkins-0   2/2     Running   0          23d

So it doesnt matter whether the server is configured to access the cluster or not, it fails the same way

jseiser · 2022-01-06T18:23:14Z

Ran the first few times with the Terraform kubernetes provider configured as it has been when it was working, and have re-ran it with that provider completely removed.

Ive also ran it with and without the null provider stuff.

jseiser · 2022-01-07T14:05:31Z

Are there any requirements to upgrade an existing cluster from 17 to 18?

Im not getting warnings 'variable not expected here' so I think I have everything renamed, but not making any progress on the aws-auth configmap. Does it need removed from the state?

jseiser · 2022-01-07T14:40:59Z

Also, not sure if this matters, but in the current working deployment we have

  kubeconfig_aws_authenticator_command = "aws"
  kubeconfig_aws_authenticator_command_args = [
    "eks",
    "get-token",
    "--cluster-name",
    format("eks-%s-%s-%s", var.layer, var.vpc_id_tag, var.platform_env),
    "--region",
    data.aws_region.current.name
  ]

sparqueur · 2022-01-07T15:33:41Z

Same Error: Get "http://localhost/api/v1/namespaces/kube-system/configmaps/aws-auth" error on my side. You are not alone man ;-)

bryantbiggs · 2022-01-07T16:06:47Z

I believe your error message states the issue The connection to the server localhost:8080 was refused - did you specify the right host or port?

By default, this port is not open on the security groups created by this module - https://github.com/terraform-aws-modules/terraform-aws-eks#security-groups

sparqueur · 2022-01-07T16:55:30Z

I believe your error message states the issue The connection to the server localhost:8080 was refused - did you specify the right host or port?

By default, this port is not open on the security groups created by this module - https://github.com/terraform-aws-modules/terraform-aws-eks#security-groups

It's more like if it's a wrong url error more than a security group issue, no ? I expect this call being done from my laptop, not from a remote server. But I may be wrong.

bryantbiggs · 2022-01-07T18:37:57Z

I believe your error message states the issue The connection to the server localhost:8080 was refused - did you specify the right host or port?
By default, this port is not open on the security groups created by this module - terraform-aws-modules/terraform-aws-eks#security-groups

It's more like if it's a wrong url error more than a security group issue, no ? I expect this call being done from my laptop, not from a remote server. But I may be wrong.

The error message I was referring to was the one provided by @jseiser further up.

The error message you have provided does not provide enough detail - the module does not construct any URLs so I would suspect its also a security group access issue you are facing as well, but thats just a hunch based on whats provided

GeoBSI · 2022-01-07T18:58:52Z

Hello, I'm having the exact same issue when upgrading from v17 to v18. It happens during the state refresh of the terraform plan command.

The problem disappears if I manually remove the module.eks-cluster.local_file.kubeconfig[0] and module.eks-cluster.kubernetes_config_map.aws_auth[0] from the v17 state file, but I'm not exactly sure if there are any consequences in doing this.

jseiser · 2022-01-07T18:58:58Z

I believe your error message states the issue The connection to the server localhost:8080 was refused - did you specify the right host or port?

By default, this port is not open on the security groups created by this module - https://github.com/terraform-aws-modules/terraform-aws-eks#security-groups

@bryantbiggs

I guess i dont follow. This is a cluster/TF deployment created using the <18.0 module version. We are wanting/trying to upgrade this enviroment to the latest module version. Im not able to run a plan because of the error. The security groups have not changed.

bryantbiggs · 2022-01-07T19:27:52Z

Hello, I'm having the exact same issue when upgrading from v17 to v18. It happens during the state refresh of the terraform plan command.

The problem disappears if I manually remove the module.eks-cluster.local_file.kubeconfig[0] and module.eks-cluster.kubernetes_config_map.aws_auth[0] from the v17 state file, but I'm not exactly sure if there are any consequences in doing this.

I believe that would be the appropriate change (remove those from your state) - v18.x removes native support for kubeconfig and aws-auth configmap https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/UPGRADE-18.0.md#list-of-backwards-incompatible-changes

PadillaBraulio · 2022-01-07T20:06:30Z

How should we handle config-map roles now?
i had a role we used to connect to the cluster but since v18.x drop the support to aws-auth configmap, how can we address that?

For example i had somethig like this

  mapRoles: |
.
.
.
    - "groups":
      - "system:masters"
      "rolearn": "arn:aws:iam::000000000000:role/assumedRole
      "username": "administrator"
and that role was set in the map_roles variable something like this

module "eks" {
  source                      = "terraform-aws-modules/eks/aws"
  version                     = "17.24.0"

  map_roles = [
    {
      rolearn = aws_iam_role.eks_administrator.arn
      username = "administrator"
      groups = [ "system:masters" ]
    }
  ]

}

Moving forward what would be the best way to adding that types of roles into the aws-auth?

bryantbiggs · 2022-01-07T20:10:54Z

@PadillaBraulio this is left up to users to decide what suites them best (Terraform, Helm, some flavor of GitOps, bash scripts, etc.)

martijnvdp · 2022-01-07T20:21:20Z

These errors also occur on certain changes when your the provider config depends on output of the eks module
i got the same when i used coalesce(module.eks.cluster_id, "cluster1") for name name filter in the datablock for the provider
after changing that to the static name the error is gone

example:

data "aws_eks_cluster" "cluster1" {
  // name= coalesce(module.cluster1.cluster_id, "cluster1")  will generate error on certain changes
  name  = "cluster1"// works
}

data "aws_eks_cluster_auth" "cluster1" {
  name  = "cluster1"
}

provider "kubernetes" {
  host                   = element(concat(data.aws_eks_cluster.cluster1.*.endpoint, [""]), 0)
  cluster_ca_certificate = base64decode(element(concat(data.aws_eks_cluster.cluster1.*.certificate_authority.0.data, [""]), 0))
  token                  = element(concat(data.aws_eks_cluster_auth.cluster1.*.token, [""]), 0)
}

johngmyers · 2022-01-08T00:06:06Z

Since I only had a test cluster, I went ahead and approved the plan to replace the cluster.

This is probably a provider issue, but the execution failed because it tried to create a new cluster before tearing down the old one. But it couldn't create the new cluster because an existing cluster, the old one, had the same name.

jseiser · 2022-01-08T00:50:45Z

These errors also occur on certain changes when your the provider config depends on output of the eks module i got the same when i used coalesce(module.eks.cluster_id, "cluster1") for name name filter in the datablock for the provider after changing that to the static name the error is gone

example:
data "aws_eks_cluster" "cluster1" {
  // name= coalesce(module.cluster1.cluster_id, "cluster1")  will generate error on certain changes
  name  = "cluster1"// works
}

data "aws_eks_cluster_auth" "cluster1" {
  name  = "cluster1"
}

provider "kubernetes" {
  host                   = element(concat(data.aws_eks_cluster.cluster1.*.endpoint, [""]), 0)
  cluster_ca_certificate = base64decode(element(concat(data.aws_eks_cluster.cluster1.*.certificate_authority.0.data, [""]), 0))
  token                  = element(concat(data.aws_eks_cluster_auth.cluster1.*.token, [""]), 0)
}

I get the error even when removing the k8s provider

they also have the data in there example..

https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/examples/complete/main.tf#L235

martijnvdp · 2022-01-08T08:49:38Z

I get the error even when removing the k8s provider

they also have the data in there example..

https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/examples/complete/main.tf#L235

Yes i normally have it like in the example, but if i encounter these kind of errors : dial tcp 127.0.0.1:80: connect: connection refused.
i change it to the static cluster name ,only to be able to run the plan.

i think it is caused by the cluster iam role arn changes or the cluster security group changes which triggers the replacement of the cluster.

In my case :
because i'm upgrading an existing cluster the cluster iam role and security group attached to the existing cluster may not be changed else it triggers a cluster replacement and than i can only run a plan with the static cluster name.

if i adjust the config of the 18.0.4 module so it keeps using the same iam cluster role and cluster security group
the cluster isnt effected and i can run a plan normally
i had to set the security group description and name to match the current config
cluster_security_group_description = "EKS cluster security group." (pre v18 the default description was with ".")
cluster_security_group_name = var.cluster_name
prefix_separator = ""

and the cluster iam role arn
iam_role_arn = "current cluster iam role arn"

joncolby · 2022-01-08T22:03:20Z

have you checked the security groups, as already suggested? You mentioned the jenkins pod runs in the same cluster or at least inside the same vpc as I understood. If you have the private api-server endpoint enabled, it could be the jenkins pod will try to connect to the private endpoint, and require a security group rule that is not provided by the eks module by default. i actually ran into this same problem.

jseiser · 2022-01-09T00:58:10Z

have you checked the security groups, as already suggested? You mentioned the jenkins pod runs in the same cluster or at least inside the same vpc as I understood. If you have the private api-server endpoint enabled, it could be the jenkins pod will try to connect to the private endpoint, and require a security group rule that is not provided by the eks module by default. i actually ran into this same problem.

Its def. not that. Nothing has changed yet, since this is a working 17.x deployment and only fails when trying to run the plan on the 18.x deployment.

I also showed that it errors out on an external server above as well.

Thanks.

jseiser · 2022-01-09T00:58:31Z

if i adjust the config of the 18.0.4 module so it keeps using the same iam cluster role and cluster security group

@martijnvdp

Ill have to give this a try on monday.

sepehrmavedati · 2022-01-10T04:12:31Z

Hoping that this would save time for others dealing with aws-auth configmap management change. This worked for us, modifying the example code from here.

locals {
  kubeconfig = ...

  current_auth_configmap = yamldecode(module.eks.aws_auth_configmap_yaml)

  updated_auth_configmap_data = {
    data = {
      mapRoles = yamlencode(
        distinct(concat(
        yamldecode(local.current_auth_configmap.data.mapRoles), local.map_roles, )
      ))
      mapUsers = yamlencode(local.map_users)
    }
  }
}

resource "null_resource" "patch_aws_auth_configmap" {
  triggers = {
    cmd_patch = "kubectl patch configmap/aws-auth -n kube-system --type merge -p '${chomp(jsonencode(local.updated_auth_configmap_data))}' --kubeconfig <(echo $KUBECONFIG | base64 --decode)"
  }

  provisioner "local-exec" {
    interpreter = ["/bin/bash", "-c"]
    command     = self.triggers.cmd_patch

    environment = {
      KUBECONFIG = base64encode(local.kubeconfig)
    }
  }
}

Users and roles following the syntax in 17.x version.

martijnvdp · 2022-01-10T07:17:19Z

i'm also still looking for a good solution for the aws-auth config map, have copied the aws-auth.tf from 17.24 for now which still works but it needs that forked http provider.
using it without the http wait will error with resource already exists,
only alternative i found is to use kubectl_manifest which seems to be able to patch as well but its a bit hacky
and i cant use the kubectl null resource workaround as we are using terraform cloud agents without kubectl.
i would rather not add kubectl to tthe tfc cloud container image

marcuz · 2022-01-10T08:29:04Z

i'm also still looking for a good solution for the aws-auth config map, have copied the aws-auth.tf from 17.24 for now which still works but it needs that forked http provider. using it without the http wait will error with resource already exists, only alternative i found is to use kubectl_manifest which seems to be able to patch as well but its a bit hacky and i cant use the kubectl null resource workaround as we are using terraform cloud agents without kubectl. i would rather not add kubectl to tthe tfc cloud container image

kubernetes_patch would have been useful to solve the aws-auth configmap problem, unfortunately it looks like it won't happen. 😞

jseiser · 2022-01-10T15:30:02Z

@bryantbiggs

When you guys were testing upgrades pre-merge, how were you all handling this situation? Im going to spin up a test env and try and walk through some of the suggestions above, but wanted to know your guys experience. If I get something working I have no issue creating a pull to update the documentation.

bryantbiggs · 2022-01-10T15:39:51Z

#1680 (comment)

again, it is a BREAKING change - if there was a clean and straightforward path to upgrade without change/disruption then it would not be a breaking change. this module had grown quite quickly and was carrying a lot of pre-0.12 syntax which was severely holding it back (extensive list of lists and index lookup, etc.) and the changes added (most notably due to the numerous changes of EKS itself) led to a patchwork of changes that built up over the years. I can't stress enough that this change was EXTENSIVE and I am sorry we cannot provide the copious amounts of details and upgrade steps to make the process smooth and seamless - the module is complex, EKS is complex, and the changes were substantial.

that said, this is how we generally test modules in this org:

Checkout master
terraform init ; terraform apply
Checkout feature/branch
terraform init ; terraform plan
At this point you have to start interpreting the plan and deducing what Terraform is trying to do and the impact of those changes, etc. With the change of adding the variable prefix_separator, that should have simplified a lot of the "breaking" aspects of the Terraform changes from v17.x to v18.x but that was after the PR and I did not test v17.x to current with the addition of that change

oofnikj · 2022-01-10T20:01:30Z

@PadillaBraulio this is left up to users to decide what suites them best (Terraform, Helm, some flavor of GitOps, bash scripts, etc.)

I understand the motive to introduce breaking changes in order to refactor this module to rid it of historic cruft, but I too am hesitant to upgrade our existing clusters from the latest 17.x version as we are making use of the managed aws-auth functionality in such a way that we cannot re-implement that functionality with 18.x in a straightforward manner.

IMO it would be greatly appreciated if instead of telling users that it's now up to them to figure out how to re-implement functionality that was removed in the interest of tidying up, explicit examples be provided that satisfy the same set of design constraints satisfied by the previous version, i.e., the ability to provision an accessible cluster exclusively with Terraform. Several users have already noted that it is not feasible to rely on local-exec calling kubectl in their deployment environments, us included.

Perhaps I missed the part of the discussion that led up to the removal of managed aws-auth, but I'm struggling to see the justification for removal of that functionality entirely.

Is it safe to rely on the forked HTTP provider and the pure Terraform implementation used in 17.x if users choose to do so?

bryantbiggs · 2022-01-10T20:33:51Z

@PadillaBraulio this is left up to users to decide what suites them best (Terraform, Helm, some flavor of GitOps, bash scripts, etc.)

I understand the motive to introduce breaking changes in order to refactor this module to rid it of historic cruft, but I too am hesitant to upgrade our existing clusters from the latest 17.x version as we are making use of the managed aws-auth functionality in such a way that we cannot re-implement that functionality with 18.x in a straightforward manner.

IMO it would be greatly appreciated if instead of telling users that it's now up to them to figure out how to re-implement functionality that was removed in the interest of tidying up, explicit examples be provided that satisfy the same set of design constraints satisfied by the previous version, i.e., the ability to provision an accessible cluster exclusively with Terraform. Several users have already noted that it is not feasible to rely on local-exec calling kubectl in their deployment environments, us included.

Perhaps I missed the part of the discussion that led up to the removal of managed aws-auth, but I'm struggling to see the justification for removal of that functionality entirely.

Is it safe to rely on the forked HTTP provider and the pure Terraform implementation used in 17.x if users choose to do so?

There is no need for me to go into the depths of aws-auth issues when we can just look at history https://github.com/terraform-aws-modules/terraform-aws-eks/issues?q=is%3Aissue+sort%3Aupdated-desc+aws-auth+is%3Aclosed

Again, a clear boundary line was created with this change and I understand its very controversial - this module provisions AWS infrastructure resources via the AWS API (via the Terraform AWS provider) and any internal cluster provisioning and management is left up to users

As for the forked http provider, I do not know what its fate is. Most likely what will end up happening is that it gets archived in its current state so users can continue to utilize it - if we're lucky, Hashicorp incorporates the change upstream and the fork can still be archived but users can move off the fork and onto the official provider

oofnikj · 2022-01-11T06:45:28Z

There is no need for me to go into the depths of aws-auth issues when we can just look at history https://github.com/terraform-aws-modules/terraform-aws-eks/issues?q=is%3Aissue+sort%3Aupdated-desc+aws-auth+is%3Aclosed

Fair enough. This is a complex problem due to the automatic creation of the aws-auth configmap by EKS in certain configurations.

Again, a clear boundary line was created with this change and I understand its very controversial - this module provisions AWS infrastructure resources via the AWS API (via the Terraform AWS provider) and any internal cluster provisioning and management is left up to users

Point taken, but I would venture to suggest that part of provisioning stateful resources such as Kubernetes clusters, EC2 instances, etc. includes ensuring access control is properly configured. I don't expect this module to install a monitoring and logging workload, for example, but I do expect it to provision my resources in such a way that I can connect to them. I respect your decision to remove this functionality but I'm just trying to determine the best course of action to avoid headaches going forward with the upgrade.

I appreciate the work that went in to refactoring and the local-exec examples provided. AFAICT there isn't really a good way to handle this case without using some external tooling. We'll have to figure out how to work that in to our existing CI infrastructure.

As for the forked http provider, I do not know what its fate is. Most likely what will end up happening is that it gets archived in its current state so users can continue to utilize it - if we're lucky, Hashicorp incorporates the change upstream and the fork can still be archived but users can move off the fork and onto the official provider

Re: your second point, I'm not holding my breath. https://www.hashicorp.com/blog/terraform-community-contributions

jcam · 2022-03-30T10:16:59Z

My solution when I had problems like you're running into (this was back on eks module v13) was to split things up... have one terraform run build and deploy the cluster itself, and a separate terraform run in a separate directory build and deploy all the helm charts and kubernetes resource entries onto it...

pen-pal · 2022-03-30T12:50:16Z

My solution when I had problems like you're running into (this was back on eks module v13) was to split things up... have one terraform run build and deploy the cluster itself, and a separate terraform run in a separate directory build and deploy all the helm charts and kubernetes resource entries onto it...

@jcam, I am all in support of that, but since its' on production right, I am approaching it as, first upgrade the eks module, and finally breaking the pieces and moving the resources outside of that eventually and keeping eks module independent.
Correct me if there is much better approach than this.

jcam · 2022-03-30T12:53:35Z

I would separate it first, and upgrade second. That way there's no chance the EKS cluster upgrade terraform run could impact all your deployed applications.

I don't use terraform cloud, but with my backend I simply did a terraform state pull, made the new folder for all the app components and did a terraform init there, did a terraform state push, then did terraform state rm for all the app-components in the cluster folder, and a terraform state rm for all the cluster components in the app folder

pen-pal · 2022-03-30T12:56:24Z

I would separate it first, and upgrade second. That way there's no chance the EKS cluster upgrade terraform run could impact all your deployed applications.

I don't use terraform cloud, but with my backend I simply did a terraform state pull, made the new folder for all the app components and did a terraform init there, did a terraform state push, then did terraform state rm for all the app-components in the cluster folder, and a terraform state rm for all the cluster components in the app folder

were there no conflicts with the resource names? for example for logging of alb-load-balancer is done in a s3 bucket and that bucket is also created alongside the eks cluster and is inside ingress.tf

jcam · 2022-03-30T20:42:11Z

I just needed to split things so they were in one place or the other and not both. in your case, I would put the logging bucket in the app deploy stage, or I would keep it in the cluster stage and use a data object in the app stage instead of a resource object

pen-pal · 2022-03-31T05:11:37Z

I don't use terraform cloud, but with my backend I simply did a terraform state pull, made the new folder for all the app components and did a terraform init there, did a terraform state push, then did terraform state rm for all the app-components in the cluster folder, and a terraform state rm for all the cluster components in the app folder

do you have any link or directions I can follow for the steps you mentioned below. I am kind of confused to be honest for the steps you mentioned

antonbabenko · 2022-04-04T14:35:39Z

This issue has been resolved in version 18.19.0 🎉

bpesics · 2022-06-09T08:41:59Z

BTW if someone needs to solve this via a PR without direct access to state commands you need to "fork" the module temporarily in order to be able to use the moved block:

To reduce coupling between separately-packaged modules, Terraform only allows declarations of moves between modules in the same package. In other words, Terraform would not have allowed moving into module.x above if the source address of that call had not been a local path.

AndreKR · 2022-06-16T01:08:27Z

With v18 I am unable to configure a cluster so that pods have network access.

Here's a simple cluster in v17:

provider "aws" {
  region = "eu-central-1"
}

data "aws_eks_cluster_auth" "cluster" {
  name = module.eks.cluster_id
}

provider "kubernetes" {
  host                   = module.eks.cluster_endpoint
  token                  = data.aws_eks_cluster_auth.cluster.token
  cluster_ca_certificate = base64decode(module.eks.cluster_certificate_authority_data)
}

data "aws_availability_zones" "available" {}

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "3.2.0"

  name                 = "test-cluster-vpc"
  cidr                 = "10.0.0.0/16"
  azs                  = data.aws_availability_zones.available.names
  private_subnets      = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
  public_subnets       = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]
  enable_nat_gateway   = true
  single_nat_gateway   = true
  enable_dns_hostnames = true
}

module "eks" {
  source          = "terraform-aws-modules/eks/aws"
  version         = "17.24.0"
  cluster_name    = "test-cluster"
  cluster_version = "1.22"
  subnets         = module.vpc.private_subnets

  vpc_id = module.vpc.vpc_id

  node_groups = [
    {
      instance_type = "t2.small"
      capacity_type = "SPOT"
    }
  ]
}

If I apply this, I can then run a pod on it and ping an internet host:

$ kubectl run -it --attach --rm --image=alpine/k8s:1.22.9 andre-temp
If you don't see a command prompt, try pressing enter.
/apps # ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8): 56 data bytes
64 bytes from 8.8.8.8: seq=0 ttl=53 time=2.157 ms

If I adapt the terraform file to v18:

[...]

module "eks" {
  source          = "terraform-aws-modules/eks/aws"
  version         = "~> 18"
  cluster_name    = "test-cluster"
  cluster_version = "1.22"
  subnet_ids         = module.vpc.private_subnets

  vpc_id = module.vpc.vpc_id

  eks_managed_node_groups = [
    main = {
      instance_type = "t2.small"
      capacity_type = "SPOT"
    }
  ]
}

This doesn't work anymore:

$ kubectl run -it --attach --rm --image=alpine/k8s:1.22.9 andre-temp
If you don't see a command prompt, try pressing enter.
/apps # ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8): 56 data bytes
^C
--- 8.8.8.8 ping statistics ---
55 packets transmitted, 0 packets received, 100% packet loss

bryantbiggs · 2022-06-16T01:20:34Z

@AndreKR https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/docs/network_connectivity.md#security-groups

AndreKR · 2022-06-16T01:41:45Z

Ah, there it is. The link in the migration guide is broken.

AndreKR · 2022-06-18T17:09:46Z

So, I think I finally worked out what security groups there are for a simple cluster with an eks_managed_node_group in version 18.23.0 of this module. Leaving an overview here for anyone interested:

Name	Created by	Attached to	Default rules	Opt out	Tagged¹
`default`	AWS EKS	Nothing	Allow all traffic	-
`CLUSTER-cluster`	Terraform module, `main.tf`	Nothing	Allow some traffic	`create_cluster_security_group`
`eks-cluster-sg-CLUSTER-NNN`	AWS EKS	Nothing	Allow all traffic	-	Yes
`NODEGROUP-eks-node-group`	Terraform module, `eks-managed-node-group/main.tf`	Nodes	None	`eks_managed_node_groups ->` `<nodegroup> ->` `create_security_group`
`CLUSTER-node`	Terraform module, `node_groups.tf`	Nodes	Allow some traffic	`create_node_security_group`	Yes

Whether the security group is tagged with kubernetes.io/cluster/<CLUSTER NAME>. This is relevant because the load balancer controller expects exactly one attached security group to have this tag. ↩

isatfg · 2022-06-24T12:01:03Z

Hi

I have tried to update from 17.24.0 to 18.x however terraform wants to destroy my cluster and recreate a new one. I have added all these variable as mentioned above but without any sucess.

  cluster_security_group_description = "EKS cluster security group."
  cluster_security_group_name = var.cluster_name
  prefix_separator = ""
  iam_role_arn = "$IAM_ROLE_ARN"

For testing I did not include workers

17.24.0 Config

module "eks" {
  source           = "terraform-aws-modules/eks/aws"
  version          = "17.24.0"

  cluster_name     = var.cluster_name
  cluster_version  = var.cluster_version
  vpc_id           = var.vpc_id
  subnets          = var.subnet
  tags = {
    GithubRepo  = "terraform-aws-eks"
    GithubOrg   = "terraform-aws-modules"
    environment = var.environment
  }


}
data "aws_eks_cluster" "cluster" {
  name = module.eks.cluster_id
}

data "aws_eks_cluster_auth" "cluster" {
  name = module.eks.cluster_id
}

18.24.1 config

module "eks" {
  source           = "terraform-aws-modules/eks/aws"
  version          = "18.24.1"

  cluster_name     = var.cluster_name
  cluster_version  = var.cluster_version
  vpc_id           = var.vpc_id
  subnet_ids          = var.subnet
  cluster_security_group_description = "EKS cluster security group."
  cluster_security_group_name = var.cluster_name
  prefix_separator = ""
  iam_role_arn = "$IAM_ROLE_ARN"
  tags = {
    GithubRepo  = "terraform-aws-eks"
    GithubOrg   = "terraform-aws-modules"
    environment = var.environment
  }


}
data "aws_eks_cluster" "cluster" {
  name = module.eks.cluster_id
}

data "aws_eks_cluster_auth" "cluster" {
  name = module.eks.cluster_id
}

This forces the replacement

        name                      = "sandbox"
      ~ platform_version          = "eks.7" -> (known after apply)
      ~ role_arn                  = "$IAM_ROLE_ARN" -> (known after apply) # forces replacement
      ~ status                    = "ACTIVE" -> (known after apply)

bryantbiggs · 2022-06-24T12:02:47Z

@isatfg see https://github.com/clowdhaus/eks-v17-v18-migrate#control-plane-changes

isatfg · 2022-06-27T09:00:28Z

Thank you @bryantbiggs that worked for me.

dusansusic · 2022-09-07T09:33:39Z

Thanks, @ArchiFleKs for steps! One additional step I had to define because I used:

manage_aws_auth_configmap = true


  aws_auth_roles = [
    {
      rolearn  = "arn:aws:iam::${data.aws_caller_identity.current.account_id}:role/OLD_NODE_GROUP_IAM_ROLE_NAME"
      username = "system:node:{{EC2PrivateDNSName}}"
      groups   = ["system:bootstrappers", "system:nodes"]
    },
  ]

If this is not defined, nodes will become unreachable and all deployments on it.

qlikcoe · 2022-09-27T15:24:20Z

If this is not defined, nodes will become unreachable and all deployments on it.

This is very important! I did this for one environment and it worked well, I was able to gradually drain and terminate old nodes. I forgot this step for another environment and right after terraform apply all the old nodes were lost instantly, major downtime 😱

junaid-ali · 2022-09-27T16:39:55Z

@qlikcoe @dusansusic it was mentioned later by couple of others as well, github has collapsed majority of that discussion. For example, this was my experience: #1744 (comment)

github-actions · 2022-11-08T02:30:34Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

jseiser changed the title ~~Release 18+ Breaks Existing Deployments~~ Release 18+ Upgrade Guide Breaks Existing Deployments Jan 6, 2022

martijnvdp mentioned this issue Jan 8, 2022

fix: Set default cluster sg description eq 17.24 #1754

Closed

1 task

bryantbiggs mentioned this issue Apr 3, 2022

docs: Re-organize documentation for easier navigation and support for references in issues/PRs #1981

Merged

3 tasks

antonbabenko closed this as completed in #1981 Apr 3, 2022

This was referenced Apr 8, 2022

fix: Adding a 'legacy_names' var to support older v17 name schemes #1996

Closed

Deleting an old node group is attempting to overwrite a newly created node group #2001

Closed

marcusschiesser mentioned this issue Apr 18, 2022

docs: Add IAM role state move to v18 upgrade doc #2024

Merged

magnusseptim mentioned this issue Apr 19, 2022

Error: The configmap "aws-auth" does not exist when deploying an EKS cluster with manage_aws_auth_configmap = true #2009

Closed

1 task

bryantbiggs mentioned this issue Apr 22, 2022

vpc_config.security_group_ids should not includes cluster_security_group_id #1978

Closed

Luwdo mentioned this issue May 4, 2022

Adding additional policy to node group IAM role #2053

Closed

zhujik mentioned this issue Jul 22, 2022

hardcoded dash in name-prefix forces replacement of node groups when upgrading to 18x #2153

Closed

1 task

razvan-moj mentioned this issue Oct 10, 2022

upgrade terraform EKS module ministryofjustice/cloud-platform#3884

Closed

github-actions bot locked as resolved and limited conversation to collaborators Nov 8, 2022

Release 18+ Upgrade Guide Breaks Existing Deployments #1744

Release 18+ Upgrade Guide Breaks Existing Deployments #1744

Comments

jseiser commented Jan 6, 2022 • edited

Description

Reproduction

Code Snippet to Reproduce

Expected behavior

Actual behavior

Additional context

jseiser commented Jan 6, 2022 • edited

jseiser commented Jan 6, 2022 • edited

jseiser commented Jan 7, 2022

jseiser commented Jan 7, 2022

sparqueur commented Jan 7, 2022

bryantbiggs commented Jan 7, 2022

sparqueur commented Jan 7, 2022

bryantbiggs commented Jan 7, 2022

GeoBSI commented Jan 7, 2022 • edited

jseiser commented Jan 7, 2022 • edited

bryantbiggs commented Jan 7, 2022

PadillaBraulio commented Jan 7, 2022

bryantbiggs commented Jan 7, 2022

martijnvdp commented Jan 7, 2022 • edited

johngmyers commented Jan 8, 2022

jseiser commented Jan 8, 2022 • edited

martijnvdp commented Jan 8, 2022 • edited

joncolby commented Jan 8, 2022

jseiser commented Jan 9, 2022

jseiser commented Jan 9, 2022

sepehrmavedati commented Jan 10, 2022

martijnvdp commented Jan 10, 2022 • edited

marcuz commented Jan 10, 2022

jseiser commented Jan 10, 2022

bryantbiggs commented Jan 10, 2022

oofnikj commented Jan 10, 2022

bryantbiggs commented Jan 10, 2022 • edited

oofnikj commented Jan 11, 2022

jcam commented Mar 30, 2022

pen-pal commented Mar 30, 2022

jcam commented Mar 30, 2022

pen-pal commented Mar 30, 2022

jcam commented Mar 30, 2022

pen-pal commented Mar 31, 2022 • edited

antonbabenko commented Apr 4, 2022

bpesics commented Jun 9, 2022

AndreKR commented Jun 16, 2022

bryantbiggs commented Jun 16, 2022

AndreKR commented Jun 16, 2022

AndreKR commented Jun 18, 2022 • edited

Footnotes

isatfg commented Jun 24, 2022

bryantbiggs commented Jun 24, 2022

isatfg commented Jun 27, 2022

dusansusic commented Sep 7, 2022

qlikcoe commented Sep 27, 2022

junaid-ali commented Sep 27, 2022

github-actions bot commented Nov 8, 2022

jseiser commented Jan 6, 2022 •

edited

jseiser commented Jan 6, 2022 •

edited

jseiser commented Jan 6, 2022 •

edited

GeoBSI commented Jan 7, 2022 •

edited

jseiser commented Jan 7, 2022 •

edited

martijnvdp commented Jan 7, 2022 •

edited

jseiser commented Jan 8, 2022 •

edited

martijnvdp commented Jan 8, 2022 •

edited

martijnvdp commented Jan 10, 2022 •

edited

bryantbiggs commented Jan 10, 2022 •

edited

pen-pal commented Mar 31, 2022 •

edited

AndreKR commented Jun 18, 2022 •

edited