Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release 18+ Upgrade Guide Breaks Existing Deployments #1744

Closed
jseiser opened this issue Jan 6, 2022 · 131 comments · Fixed by #1981
Closed

Release 18+ Upgrade Guide Breaks Existing Deployments #1744

jseiser opened this issue Jan 6, 2022 · 131 comments · Fixed by #1981
Labels

Comments

@jseiser
Copy link

jseiser commented Jan 6, 2022

Description

Attempted to follow the upgrade guide to get to 18+. Our Terraform deployments generally run from a Jenkins worker pod, that exists ON the same cluster that we are upgrading. The pod has a service account on it, using the IRSA setup which grants it access to the cluster.
This all works/worked before the upgrade.

Reproduction

Attempt to follow the upgrade guide for 18.

Code Snippet to Reproduce

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "18.0.1"

  cluster_name    = format("eks-%s-%s-%s", var.layer, var.vpc_id_tag, var.platform_env)
  cluster_version = var.cluster_version

  subnet_ids = data.aws_subnet_ids.private.ids
  vpc_id     = data.terraform_remote_state.vpc.outputs.vpc_id

  cluster_endpoint_private_access = true
  cluster_endpoint_public_access  = false

  cluster_security_group_additional_rules = {
    admin_access = {
      description = "Admin ingress to Kubernetes API"
      cidr_blocks = [data.terraform_remote_state.vpc.outputs.vpc_cidr_block]
      protocol    = "tcp"
      from_port   = 443
      to_port     = 443
      type        = "ingress"
    }
  }

  eks_managed_node_group_defaults = {
    ami_type                   = "AL2_x86_64"
    disk_size                  = var.node_group_default_disk_size
    enable_bootstrap_user_data = true
    pre_bootstrap_user_data    = templatefile("${path.module}/templates/userdata.tpl", {})
    desired_size               = lower(var.platform_env) == "prod" ? 3 : 2
    max_size                   = lower(var.platform_env) == "prod" ? 6 : 3
    min_size                   = lower(var.platform_env) == "prod" ? 3 : 1
    instance_types             = lower(var.platform_env) == "prod" ? var.prod_instance_types : var.dev_instance_types
    capacity_type              = "ON_DEMAND"
    additional_tags = {
      Name = format("%s-%s-%s", var.layer, var.vpc_id_tag, var.platform_env)
    }
    update_config = {
      max_unavailable_percentage = 50
    }
    update_launch_template_default_version = true
    create_launch_template                 = true
    create_iam_role                        = true
    iam_role_name                          = format("iam-%s-%s-%s", var.layer, var.vpc_id_tag, var.platform_env)
    iam_role_use_name_prefix               = false
    iam_role_description                   = "EKS managed node group Role"
    iam_role_tags                          = local.tags
    iam_role_additional_policies = [
      "arn:aws-us-gov:iam::aws:policy/AmazonSSMManagedInstanceCore"
    ]
  }
  eks_managed_node_groups = {
    private1 = {
      subnet_ids = [tolist(data.aws_subnet_ids.private.ids)[0]]
    }
    private2 = {
      subnet_ids = [tolist(data.aws_subnet_ids.private.ids)[1]]
    }
    private3 = {
      subnet_ids = [tolist(data.aws_subnet_ids.private.ids)[2]]
    }
  }

  cluster_enabled_log_types              = ["api", "audit", "authenticator", "controllerManager", "scheduler"]
  cloudwatch_log_group_retention_in_days = 7

  enable_irsa = true

  cluster_encryption_config = [
    {
      provider_key_arn = aws_kms_key.eks.arn
      resources        = ["secrets"]
    }
  ]

  tags = merge(
    local.tags,
    {
      "Name"        = format("eks-%s-%s-%s", var.layer, var.vpc_id_tag, var.platform_env),
      "EKS_VERSION" = var.cluster_version
    }
  )

}

resource "null_resource" "patch" {
  triggers = {
    kubeconfig = base64encode(local.kubeconfig)
    cmd_patch  = "kubectl patch configmap/aws-auth --patch \"${local.aws_auth_configmap_yaml}\" -n kube-system --kubeconfig <(echo $KUBECONFIG | base64 --decode)"
  }

  provisioner "local-exec" {
    interpreter = ["/bin/bash", "-c"]
    environment = {
      KUBECONFIG = self.triggers.kubeconfig
    }
    command = self.triggers.cmd_patch
  }
}

locals

locals {

  kubeconfig = yamlencode({
    apiVersion      = "v1"
    kind            = "Config"
    current-context = "terraform"
    clusters = [{
      name = module.eks.cluster_id
      cluster = {
        certificate-authority-data = module.eks.cluster_certificate_authority_data
        server                     = module.eks.cluster_endpoint
      }
    }]
    contexts = [{
      name = "terraform"
      context = {
        cluster = module.eks.cluster_id
        user    = "terraform"
      }
    }]
    users = [{
      name = "terraform"
      user = {
        token = data.aws_eks_cluster_auth.eks.token
      }
    }]
  })

  aws_auth_configmap_yaml = <<-EOT
  ${chomp(module.eks.aws_auth_configmap_yaml)}
      - rolearn: arn:${var.iam_partition}:iam::${data.aws_caller_identity.current.account_id}:role/role-gitlab-runner-eks-${var.platform_env}
        username: gitlab:{{SessionName}}
        groups:
          - system:masters
      - rolearn: arn:${var.iam_partition}:iam::${data.aws_caller_identity.current.account_id}:role/role-jenkins-worker-eks-${var.platform_env}
        username: jenkins:{{SessionName}}
        groups:
          - system:masters
      - rolearn: arn:${var.iam_partition}:iam::${data.aws_caller_identity.current.account_id}:role/AWSReservedSSO_AdministratorAccess_f50fcd43baf05a89
        username: AWSAdministratorAccess:{{SessionName}}
        groups:
          - system:masters
  EOT
}

Expected behavior

Module will run to completion

Actual behavior

Current aws-auth

sh-4.2$ kubectl get configmap aws-auth -n kube-system -o yaml
apiVersion: v1
data:
  mapAccounts: |
    []
  mapRoles: |
    - "groups":
      - "system:bootstrappers"
      - "system:nodes"
      "rolearn": "arn:aws-us-gov:iam:::role/eks-ops-eks-dev20211104211936784200000009"
      "username": "system:node:{{EC2PrivateDNSName}}"
    - "groups":
      - "system:masters"
      "rolearn": "arn:aws-us-gov:iam:::role/role-gitlab-runner-eks-dev"
      "username": "gitlab-runner-dev"
    - "groups":
      - "system:masters"
      "rolearn": "arn:aws-us-gov:iam:::role/role-jenkins-worker-eks-dev"
      "username": "jenkins-dev"
  mapUsers: |
    - "groups":
      - "system:masters"
      "userarn": "arn:aws-us-gov:iam:::user/justin.seiser"
      "username": "jseiser"
kind: ConfigMap

The SA on the pod, that terraform is running from.

sh-4.2$ kubectl get sa jenkins-worker -n jenkins -o yaml
apiVersion: v1
automountServiceAccountToken: true
kind: ServiceAccount
metadata:
  annotations:
    eks.amazonaws.com/role-arn: arn:aws-us-gov:iam:::role/role-jenkins-worker-eks-dev

The error terraform returns

module.eks.kubernetes_config_map.aws_auth[0]: Refreshing state... [id=kube-system/aws-auth]

Error: configmaps "aws-auth" is forbidden: User "system:serviceaccount:jenkins:jenkins-worker" cannot get resource "configmaps" in API group "" in the namespace "kube-system"

Additional context

I do not doubt that im missing something, but that something does not appear to be covered in the documentation that I can find.

@jseiser jseiser changed the title Release 18+ Breaks Existing Deployments Release 18+ Upgrade Guide Breaks Existing Deployments Jan 6, 2022
@jseiser
Copy link
Author

jseiser commented Jan 6, 2022

Doesnt work if I run from a bastion host, which is also granted access to the cluster.

Error

╷
│ Error: Get "http://localhost/api/v1/namespaces/kube-system/configmaps/aws-auth": dial tcp 127.0.0.1:80: connect: connection refused
│
│
╵

Same server showing access

sh-4.2$ kubectl get configmap aws-auth -n kube-system
NAME       DATA   AGE
aws-auth   3      62d
sh-4.2$ rm ~/.kube/config
sh-4.2$ terraform plan
module.eks.aws_iam_openid_connect_provider.oidc_provider[0]: Refreshing state... [id=arn:aws-us-gov:iam:::oidc-provider/oidc.eks.us-gov-west-1.amazonaws.com/id/]
╷
│ Error: Get "http://localhost/api/v1/namespaces/kube-system/configmaps/aws-auth": dial tcp 127.0.0.1:80: connect: connection refused
│
sh-4.2$ kubectl get pods -n jenkins
The connection to the server localhost:8080 was refused - did you specify the right host or port?
sh-4.2$ aws eks update-kubeconfig --name eks-ops-eks-dev
Added new context arn:aws-us-gov:eks:us-gov-west-1::cluster/eks-ops-eks-dev to /home/ssm-user/.kube/config
sh-4.2$ kubectl get pods -n jenkins
NAME        READY   STATUS    RESTARTS   AGE
jenkins-0   2/2     Running   0          23d

So it doesnt matter whether the server is configured to access the cluster or not, it fails the same way

@jseiser
Copy link
Author

jseiser commented Jan 6, 2022

Ran the first few times with the Terraform kubernetes provider configured as it has been when it was working, and have re-ran it with that provider completely removed.

Ive also ran it with and without the null provider stuff.

@jseiser
Copy link
Author

jseiser commented Jan 7, 2022

Are there any requirements to upgrade an existing cluster from 17 to 18?

Im not getting warnings 'variable not expected here' so I think I have everything renamed, but not making any progress on the aws-auth configmap. Does it need removed from the state?

@jseiser
Copy link
Author

jseiser commented Jan 7, 2022

Also, not sure if this matters, but in the current working deployment we have

  kubeconfig_aws_authenticator_command = "aws"
  kubeconfig_aws_authenticator_command_args = [
    "eks",
    "get-token",
    "--cluster-name",
    format("eks-%s-%s-%s", var.layer, var.vpc_id_tag, var.platform_env),
    "--region",
    data.aws_region.current.name
  ]

@sparqueur
Copy link

Same Error: Get "http://localhost/api/v1/namespaces/kube-system/configmaps/aws-auth" error on my side. You are not alone man ;-)

@bryantbiggs
Copy link
Member

I believe your error message states the issue The connection to the server localhost:8080 was refused - did you specify the right host or port?

By default, this port is not open on the security groups created by this module - https://github.com/terraform-aws-modules/terraform-aws-eks#security-groups

@sparqueur
Copy link

I believe your error message states the issue The connection to the server localhost:8080 was refused - did you specify the right host or port?

By default, this port is not open on the security groups created by this module - https://github.com/terraform-aws-modules/terraform-aws-eks#security-groups

It's more like if it's a wrong url error more than a security group issue, no ? I expect this call being done from my laptop, not from a remote server. But I may be wrong.

@bryantbiggs
Copy link
Member

I believe your error message states the issue The connection to the server localhost:8080 was refused - did you specify the right host or port?
By default, this port is not open on the security groups created by this module - terraform-aws-modules/terraform-aws-eks#security-groups

It's more like if it's a wrong url error more than a security group issue, no ? I expect this call being done from my laptop, not from a remote server. But I may be wrong.

The error message I was referring to was the one provided by @jseiser further up.

The error message you have provided does not provide enough detail - the module does not construct any URLs so I would suspect its also a security group access issue you are facing as well, but thats just a hunch based on whats provided

@GeoBSI
Copy link

GeoBSI commented Jan 7, 2022

Hello, I'm having the exact same issue when upgrading from v17 to v18. It happens during the state refresh of the terraform plan command.

The problem disappears if I manually remove the module.eks-cluster.local_file.kubeconfig[0] and module.eks-cluster.kubernetes_config_map.aws_auth[0] from the v17 state file, but I'm not exactly sure if there are any consequences in doing this.

@jseiser
Copy link
Author

jseiser commented Jan 7, 2022

I believe your error message states the issue The connection to the server localhost:8080 was refused - did you specify the right host or port?

By default, this port is not open on the security groups created by this module - https://github.com/terraform-aws-modules/terraform-aws-eks#security-groups

@bryantbiggs

I guess i dont follow. This is a cluster/TF deployment created using the <18.0 module version. We are wanting/trying to upgrade this enviroment to the latest module version. Im not able to run a plan because of the error. The security groups have not changed.

@bryantbiggs
Copy link
Member

Hello, I'm having the exact same issue when upgrading from v17 to v18. It happens during the state refresh of the terraform plan command.

The problem disappears if I manually remove the module.eks-cluster.local_file.kubeconfig[0] and module.eks-cluster.kubernetes_config_map.aws_auth[0] from the v17 state file, but I'm not exactly sure if there are any consequences in doing this.

I believe that would be the appropriate change (remove those from your state) - v18.x removes native support for kubeconfig and aws-auth configmap https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/UPGRADE-18.0.md#list-of-backwards-incompatible-changes

@PadillaBraulio
Copy link

How should we handle config-map roles now?
i had a role we used to connect to the cluster but since v18.x drop the support to aws-auth configmap, how can we address that?

For example i had somethig like this

  mapRoles: |
.
.
.
    - "groups":
      - "system:masters"
      "rolearn": "arn:aws:iam::000000000000:role/assumedRole
      "username": "administrator"
and that role was set in the map_roles variable something like this

module "eks" {
  source                      = "terraform-aws-modules/eks/aws"
  version                     = "17.24.0"

  map_roles = [
    {
      rolearn = aws_iam_role.eks_administrator.arn
      username = "administrator"
      groups = [ "system:masters" ]
    }
  ]

}

Moving forward what would be the best way to adding that types of roles into the aws-auth?

@bryantbiggs
Copy link
Member

@PadillaBraulio this is left up to users to decide what suites them best (Terraform, Helm, some flavor of GitOps, bash scripts, etc.)

@martijnvdp
Copy link
Contributor

martijnvdp commented Jan 7, 2022

These errors also occur on certain changes when your the provider config depends on output of the eks module
i got the same when i used coalesce(module.eks.cluster_id, "cluster1") for name name filter in the datablock for the provider
after changing that to the static name the error is gone

example:

data "aws_eks_cluster" "cluster1" {
  // name= coalesce(module.cluster1.cluster_id, "cluster1")  will generate error on certain changes
  name  = "cluster1"// works
}

data "aws_eks_cluster_auth" "cluster1" {
  name  = "cluster1"
}

provider "kubernetes" {
  host                   = element(concat(data.aws_eks_cluster.cluster1.*.endpoint, [""]), 0)
  cluster_ca_certificate = base64decode(element(concat(data.aws_eks_cluster.cluster1.*.certificate_authority.0.data, [""]), 0))
  token                  = element(concat(data.aws_eks_cluster_auth.cluster1.*.token, [""]), 0)
}

@johngmyers
Copy link

Since I only had a test cluster, I went ahead and approved the plan to replace the cluster.

This is probably a provider issue, but the execution failed because it tried to create a new cluster before tearing down the old one. But it couldn't create the new cluster because an existing cluster, the old one, had the same name.

@jseiser
Copy link
Author

jseiser commented Jan 8, 2022

These errors also occur on certain changes when your the provider config depends on output of the eks module i got the same when i used coalesce(module.eks.cluster_id, "cluster1") for name name filter in the datablock for the provider after changing that to the static name the error is gone

example:

data "aws_eks_cluster" "cluster1" {
  // name= coalesce(module.cluster1.cluster_id, "cluster1")  will generate error on certain changes
  name  = "cluster1"// works
}

data "aws_eks_cluster_auth" "cluster1" {
  name  = "cluster1"
}

provider "kubernetes" {
  host                   = element(concat(data.aws_eks_cluster.cluster1.*.endpoint, [""]), 0)
  cluster_ca_certificate = base64decode(element(concat(data.aws_eks_cluster.cluster1.*.certificate_authority.0.data, [""]), 0))
  token                  = element(concat(data.aws_eks_cluster_auth.cluster1.*.token, [""]), 0)
}

I get the error even when removing the k8s provider

they also have the data in there example..

https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/examples/complete/main.tf#L235

@martijnvdp
Copy link
Contributor

martijnvdp commented Jan 8, 2022

I get the error even when removing the k8s provider

they also have the data in there example..

https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/examples/complete/main.tf#L235

Yes i normally have it like in the example, but if i encounter these kind of errors : dial tcp 127.0.0.1:80: connect: connection refused.
i change it to the static cluster name ,only to be able to run the plan.

i think it is caused by the cluster iam role arn changes or the cluster security group changes which triggers the replacement of the cluster.

In my case :
because i'm upgrading an existing cluster the cluster iam role and security group attached to the existing cluster may not be changed else it triggers a cluster replacement and than i can only run a plan with the static cluster name.

if i adjust the config of the 18.0.4 module so it keeps using the same iam cluster role and cluster security group
the cluster isnt effected and i can run a plan normally
i had to set the security group description and name to match the current config
cluster_security_group_description = "EKS cluster security group." (pre v18 the default description was with ".")
cluster_security_group_name = var.cluster_name
prefix_separator = ""

and the cluster iam role arn
iam_role_arn = "current cluster iam role arn"

@joncolby
Copy link

joncolby commented Jan 8, 2022

have you checked the security groups, as already suggested? You mentioned the jenkins pod runs in the same cluster or at least inside the same vpc as I understood. If you have the private api-server endpoint enabled, it could be the jenkins pod will try to connect to the private endpoint, and require a security group rule that is not provided by the eks module by default. i actually ran into this same problem.

@jseiser
Copy link
Author

jseiser commented Jan 9, 2022

have you checked the security groups, as already suggested? You mentioned the jenkins pod runs in the same cluster or at least inside the same vpc as I understood. If you have the private api-server endpoint enabled, it could be the jenkins pod will try to connect to the private endpoint, and require a security group rule that is not provided by the eks module by default. i actually ran into this same problem.

Its def. not that. Nothing has changed yet, since this is a working 17.x deployment and only fails when trying to run the plan on the 18.x deployment.

I also showed that it errors out on an external server above as well.

Thanks.

@jseiser
Copy link
Author

jseiser commented Jan 9, 2022

if i adjust the config of the 18.0.4 module so it keeps using the same iam cluster role and cluster security group

@martijnvdp

Ill have to give this a try on monday.

@sepehrmavedati
Copy link

Hoping that this would save time for others dealing with aws-auth configmap management change. This worked for us, modifying the example code from here.

locals {
  kubeconfig = ...

  current_auth_configmap = yamldecode(module.eks.aws_auth_configmap_yaml)

  updated_auth_configmap_data = {
    data = {
      mapRoles = yamlencode(
        distinct(concat(
        yamldecode(local.current_auth_configmap.data.mapRoles), local.map_roles, )
      ))
      mapUsers = yamlencode(local.map_users)
    }
  }
}

resource "null_resource" "patch_aws_auth_configmap" {
  triggers = {
    cmd_patch = "kubectl patch configmap/aws-auth -n kube-system --type merge -p '${chomp(jsonencode(local.updated_auth_configmap_data))}' --kubeconfig <(echo $KUBECONFIG | base64 --decode)"
  }

  provisioner "local-exec" {
    interpreter = ["/bin/bash", "-c"]
    command     = self.triggers.cmd_patch

    environment = {
      KUBECONFIG = base64encode(local.kubeconfig)
    }
  }
}

Users and roles following the syntax in 17.x version.

@martijnvdp
Copy link
Contributor

martijnvdp commented Jan 10, 2022

i'm also still looking for a good solution for the aws-auth config map, have copied the aws-auth.tf from 17.24 for now which still works but it needs that forked http provider.
using it without the http wait will error with resource already exists,
only alternative i found is to use kubectl_manifest which seems to be able to patch as well but its a bit hacky
and i cant use the kubectl null resource workaround as we are using terraform cloud agents without kubectl.
i would rather not add kubectl to tthe tfc cloud container image

@marcuz
Copy link

marcuz commented Jan 10, 2022

i'm also still looking for a good solution for the aws-auth config map, have copied the aws-auth.tf from 17.24 for now which still works but it needs that forked http provider. using it without the http wait will error with resource already exists, only alternative i found is to use kubectl_manifest which seems to be able to patch as well but its a bit hacky and i cant use the kubectl null resource workaround as we are using terraform cloud agents without kubectl. i would rather not add kubectl to tthe tfc cloud container image

kubernetes_patch would have been useful to solve the aws-auth configmap problem, unfortunately it looks like it won't happen. 😞

@jseiser
Copy link
Author

jseiser commented Jan 10, 2022

@bryantbiggs

When you guys were testing upgrades pre-merge, how were you all handling this situation? Im going to spin up a test env and try and walk through some of the suggestions above, but wanted to know your guys experience. If I get something working I have no issue creating a pull to update the documentation.

@bryantbiggs
Copy link
Member

#1680 (comment)

again, it is a BREAKING change - if there was a clean and straightforward path to upgrade without change/disruption then it would not be a breaking change. this module had grown quite quickly and was carrying a lot of pre-0.12 syntax which was severely holding it back (extensive list of lists and index lookup, etc.) and the changes added (most notably due to the numerous changes of EKS itself) led to a patchwork of changes that built up over the years. I can't stress enough that this change was EXTENSIVE and I am sorry we cannot provide the copious amounts of details and upgrade steps to make the process smooth and seamless - the module is complex, EKS is complex, and the changes were substantial.

that said, this is how we generally test modules in this org:

  1. Checkout master
  2. terraform init ; terraform apply
  3. Checkout feature/branch
  4. terraform init ; terraform plan
  5. At this point you have to start interpreting the plan and deducing what Terraform is trying to do and the impact of those changes, etc. With the change of adding the variable prefix_separator, that should have simplified a lot of the "breaking" aspects of the Terraform changes from v17.x to v18.x but that was after the PR and I did not test v17.x to current with the addition of that change

@oofnikj
Copy link

oofnikj commented Jan 10, 2022

@PadillaBraulio this is left up to users to decide what suites them best (Terraform, Helm, some flavor of GitOps, bash scripts, etc.)

I understand the motive to introduce breaking changes in order to refactor this module to rid it of historic cruft, but I too am hesitant to upgrade our existing clusters from the latest 17.x version as we are making use of the managed aws-auth functionality in such a way that we cannot re-implement that functionality with 18.x in a straightforward manner.

IMO it would be greatly appreciated if instead of telling users that it's now up to them to figure out how to re-implement functionality that was removed in the interest of tidying up, explicit examples be provided that satisfy the same set of design constraints satisfied by the previous version, i.e., the ability to provision an accessible cluster exclusively with Terraform. Several users have already noted that it is not feasible to rely on local-exec calling kubectl in their deployment environments, us included.

Perhaps I missed the part of the discussion that led up to the removal of managed aws-auth, but I'm struggling to see the justification for removal of that functionality entirely.

Is it safe to rely on the forked HTTP provider and the pure Terraform implementation used in 17.x if users choose to do so?

@bryantbiggs
Copy link
Member

bryantbiggs commented Jan 10, 2022

@PadillaBraulio this is left up to users to decide what suites them best (Terraform, Helm, some flavor of GitOps, bash scripts, etc.)

I understand the motive to introduce breaking changes in order to refactor this module to rid it of historic cruft, but I too am hesitant to upgrade our existing clusters from the latest 17.x version as we are making use of the managed aws-auth functionality in such a way that we cannot re-implement that functionality with 18.x in a straightforward manner.

IMO it would be greatly appreciated if instead of telling users that it's now up to them to figure out how to re-implement functionality that was removed in the interest of tidying up, explicit examples be provided that satisfy the same set of design constraints satisfied by the previous version, i.e., the ability to provision an accessible cluster exclusively with Terraform. Several users have already noted that it is not feasible to rely on local-exec calling kubectl in their deployment environments, us included.

Perhaps I missed the part of the discussion that led up to the removal of managed aws-auth, but I'm struggling to see the justification for removal of that functionality entirely.

Is it safe to rely on the forked HTTP provider and the pure Terraform implementation used in 17.x if users choose to do so?

There is no need for me to go into the depths of aws-auth issues when we can just look at history https://github.com/terraform-aws-modules/terraform-aws-eks/issues?q=is%3Aissue+sort%3Aupdated-desc+aws-auth+is%3Aclosed

Again, a clear boundary line was created with this change and I understand its very controversial - this module provisions AWS infrastructure resources via the AWS API (via the Terraform AWS provider) and any internal cluster provisioning and management is left up to users

As for the forked http provider, I do not know what its fate is. Most likely what will end up happening is that it gets archived in its current state so users can continue to utilize it - if we're lucky, Hashicorp incorporates the change upstream and the fork can still be archived but users can move off the fork and onto the official provider

@oofnikj
Copy link

oofnikj commented Jan 11, 2022

There is no need for me to go into the depths of aws-auth issues when we can just look at history https://github.com/terraform-aws-modules/terraform-aws-eks/issues?q=is%3Aissue+sort%3Aupdated-desc+aws-auth+is%3Aclosed

Fair enough. This is a complex problem due to the automatic creation of the aws-auth configmap by EKS in certain configurations.

Again, a clear boundary line was created with this change and I understand its very controversial - this module provisions AWS infrastructure resources via the AWS API (via the Terraform AWS provider) and any internal cluster provisioning and management is left up to users

Point taken, but I would venture to suggest that part of provisioning stateful resources such as Kubernetes clusters, EC2 instances, etc. includes ensuring access control is properly configured. I don't expect this module to install a monitoring and logging workload, for example, but I do expect it to provision my resources in such a way that I can connect to them. I respect your decision to remove this functionality but I'm just trying to determine the best course of action to avoid headaches going forward with the upgrade.

I appreciate the work that went in to refactoring and the local-exec examples provided. AFAICT there isn't really a good way to handle this case without using some external tooling. We'll have to figure out how to work that in to our existing CI infrastructure.

As for the forked http provider, I do not know what its fate is. Most likely what will end up happening is that it gets archived in its current state so users can continue to utilize it - if we're lucky, Hashicorp incorporates the change upstream and the fork can still be archived but users can move off the fork and onto the official provider

Re: your second point, I'm not holding my breath. https://www.hashicorp.com/blog/terraform-community-contributions

@jcam
Copy link

jcam commented Mar 30, 2022

My solution when I had problems like you're running into (this was back on eks module v13) was to split things up... have one terraform run build and deploy the cluster itself, and a separate terraform run in a separate directory build and deploy all the helm charts and kubernetes resource entries onto it...

@pen-pal
Copy link
Contributor

pen-pal commented Mar 30, 2022

My solution when I had problems like you're running into (this was back on eks module v13) was to split things up... have one terraform run build and deploy the cluster itself, and a separate terraform run in a separate directory build and deploy all the helm charts and kubernetes resource entries onto it...

@jcam, I am all in support of that, but since its' on production right, I am approaching it as, first upgrade the eks module, and finally breaking the pieces and moving the resources outside of that eventually and keeping eks module independent.
Correct me if there is much better approach than this.

@jcam
Copy link

jcam commented Mar 30, 2022

I would separate it first, and upgrade second. That way there's no chance the EKS cluster upgrade terraform run could impact all your deployed applications.

I don't use terraform cloud, but with my backend I simply did a terraform state pull, made the new folder for all the app components and did a terraform init there, did a terraform state push, then did terraform state rm for all the app-components in the cluster folder, and a terraform state rm for all the cluster components in the app folder

@pen-pal
Copy link
Contributor

pen-pal commented Mar 30, 2022

I would separate it first, and upgrade second. That way there's no chance the EKS cluster upgrade terraform run could impact all your deployed applications.

I don't use terraform cloud, but with my backend I simply did a terraform state pull, made the new folder for all the app components and did a terraform init there, did a terraform state push, then did terraform state rm for all the app-components in the cluster folder, and a terraform state rm for all the cluster components in the app folder

were there no conflicts with the resource names? for example for logging of alb-load-balancer is done in a s3 bucket and that bucket is also created alongside the eks cluster and is inside ingress.tf

@jcam
Copy link

jcam commented Mar 30, 2022

I just needed to split things so they were in one place or the other and not both. in your case, I would put the logging bucket in the app deploy stage, or I would keep it in the cluster stage and use a data object in the app stage instead of a resource object

@pen-pal
Copy link
Contributor

pen-pal commented Mar 31, 2022

I don't use terraform cloud, but with my backend I simply did a terraform state pull, made the new folder for all the app components and did a terraform init there, did a terraform state push, then did terraform state rm for all the app-components in the cluster folder, and a terraform state rm for all the cluster components in the app folder

do you have any link or directions I can follow for the steps you mentioned below. I am kind of confused to be honest for the steps you mentioned

@antonbabenko
Copy link
Member

This issue has been resolved in version 18.19.0 🎉

@bpesics
Copy link

bpesics commented Jun 9, 2022

BTW if someone needs to solve this via a PR without direct access to state commands you need to "fork" the module temporarily in order to be able to use the moved block:

To reduce coupling between separately-packaged modules, Terraform only allows declarations of moves between modules in the same package. In other words, Terraform would not have allowed moving into module.x above if the source address of that call had not been a local path.

@AndreKR
Copy link

AndreKR commented Jun 16, 2022

With v18 I am unable to configure a cluster so that pods have network access.

Here's a simple cluster in v17:

provider "aws" {
  region = "eu-central-1"
}

data "aws_eks_cluster_auth" "cluster" {
  name = module.eks.cluster_id
}

provider "kubernetes" {
  host                   = module.eks.cluster_endpoint
  token                  = data.aws_eks_cluster_auth.cluster.token
  cluster_ca_certificate = base64decode(module.eks.cluster_certificate_authority_data)
}

data "aws_availability_zones" "available" {}

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "3.2.0"

  name                 = "test-cluster-vpc"
  cidr                 = "10.0.0.0/16"
  azs                  = data.aws_availability_zones.available.names
  private_subnets      = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
  public_subnets       = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]
  enable_nat_gateway   = true
  single_nat_gateway   = true
  enable_dns_hostnames = true
}

module "eks" {
  source          = "terraform-aws-modules/eks/aws"
  version         = "17.24.0"
  cluster_name    = "test-cluster"
  cluster_version = "1.22"
  subnets         = module.vpc.private_subnets

  vpc_id = module.vpc.vpc_id

  node_groups = [
    {
      instance_type = "t2.small"
      capacity_type = "SPOT"
    }
  ]
}

If I apply this, I can then run a pod on it and ping an internet host:

$ kubectl run -it --attach --rm --image=alpine/k8s:1.22.9 andre-temp
If you don't see a command prompt, try pressing enter.
/apps # ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8): 56 data bytes
64 bytes from 8.8.8.8: seq=0 ttl=53 time=2.157 ms

If I adapt the terraform file to v18:

[...]

module "eks" {
  source          = "terraform-aws-modules/eks/aws"
  version         = "~> 18"
  cluster_name    = "test-cluster"
  cluster_version = "1.22"
  subnet_ids         = module.vpc.private_subnets

  vpc_id = module.vpc.vpc_id

  eks_managed_node_groups = [
    main = {
      instance_type = "t2.small"
      capacity_type = "SPOT"
    }
  ]
}

This doesn't work anymore:

$ kubectl run -it --attach --rm --image=alpine/k8s:1.22.9 andre-temp
If you don't see a command prompt, try pressing enter.
/apps # ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8): 56 data bytes
^C
--- 8.8.8.8 ping statistics ---
55 packets transmitted, 0 packets received, 100% packet loss

@AndreKR
Copy link

AndreKR commented Jun 16, 2022

Ah, there it is. The link in the migration guide is broken.

@AndreKR
Copy link

AndreKR commented Jun 18, 2022

So, I think I finally worked out what security groups there are for a simple cluster with an eks_managed_node_group in version 18.23.0 of this module. Leaving an overview here for anyone interested:

Name Created by Attached to Default rules Opt out Tagged1
default AWS EKS Nothing Allow all traffic -
CLUSTER-cluster Terraform module,
main.tf
Nothing Allow some traffic create_cluster_security_group
eks-cluster-sg-CLUSTER-NNN AWS EKS Nothing Allow all traffic - Yes
NODEGROUP-eks-node-group Terraform module,
eks-managed-node-group/main.tf
Nodes None eks_managed_node_groups ->
<nodegroup> ->
create_security_group
CLUSTER-node Terraform module,
node_groups.tf
Nodes Allow some traffic create_node_security_group Yes

Footnotes

  1. Whether the security group is tagged with kubernetes.io/cluster/<CLUSTER NAME>. This is relevant because the load balancer controller expects exactly one attached security group to have this tag.

@isatfg
Copy link

isatfg commented Jun 24, 2022

Hi

I have tried to update from 17.24.0 to 18.x however terraform wants to destroy my cluster and recreate a new one. I have added all these variable as mentioned above but without any sucess.

  cluster_security_group_description = "EKS cluster security group."
  cluster_security_group_name = var.cluster_name
  prefix_separator = ""
  iam_role_arn = "$IAM_ROLE_ARN"

For testing I did not include workers

17.24.0 Config

module "eks" {
  source           = "terraform-aws-modules/eks/aws"
  version          = "17.24.0"

  cluster_name     = var.cluster_name
  cluster_version  = var.cluster_version
  vpc_id           = var.vpc_id
  subnets          = var.subnet
  tags = {
    GithubRepo  = "terraform-aws-eks"
    GithubOrg   = "terraform-aws-modules"
    environment = var.environment
  }


}
data "aws_eks_cluster" "cluster" {
  name = module.eks.cluster_id
}

data "aws_eks_cluster_auth" "cluster" {
  name = module.eks.cluster_id
}

18.24.1 config

module "eks" {
  source           = "terraform-aws-modules/eks/aws"
  version          = "18.24.1"

  cluster_name     = var.cluster_name
  cluster_version  = var.cluster_version
  vpc_id           = var.vpc_id
  subnet_ids          = var.subnet
  cluster_security_group_description = "EKS cluster security group."
  cluster_security_group_name = var.cluster_name
  prefix_separator = ""
  iam_role_arn = "$IAM_ROLE_ARN"
  tags = {
    GithubRepo  = "terraform-aws-eks"
    GithubOrg   = "terraform-aws-modules"
    environment = var.environment
  }


}
data "aws_eks_cluster" "cluster" {
  name = module.eks.cluster_id
}

data "aws_eks_cluster_auth" "cluster" {
  name = module.eks.cluster_id
}

This forces the replacement

        name                      = "sandbox"
      ~ platform_version          = "eks.7" -> (known after apply)
      ~ role_arn                  = "$IAM_ROLE_ARN" -> (known after apply) # forces replacement
      ~ status                    = "ACTIVE" -> (known after apply)

@bryantbiggs
Copy link
Member

@isatfg see https://github.com/clowdhaus/eks-v17-v18-migrate#control-plane-changes

@isatfg
Copy link

isatfg commented Jun 27, 2022

Thank you @bryantbiggs that worked for me.

@dusansusic
Copy link

Thanks, @ArchiFleKs for steps! One additional step I had to define because I used:

manage_aws_auth_configmap = true

  aws_auth_roles = [
    {
      rolearn  = "arn:aws:iam::${data.aws_caller_identity.current.account_id}:role/OLD_NODE_GROUP_IAM_ROLE_NAME"
      username = "system:node:{{EC2PrivateDNSName}}"
      groups   = ["system:bootstrappers", "system:nodes"]
    },
  ]

If this is not defined, nodes will become unreachable and all deployments on it.

@qlikcoe
Copy link

qlikcoe commented Sep 27, 2022

If this is not defined, nodes will become unreachable and all deployments on it.

This is very important! I did this for one environment and it worked well, I was able to gradually drain and terminate old nodes. I forgot this step for another environment and right after terraform apply all the old nodes were lost instantly, major downtime 😱

@junaid-ali
Copy link
Contributor

@qlikcoe @dusansusic it was mentioned later by couple of others as well, github has collapsed majority of that discussion. For example, this was my experience: #1744 (comment)

@github-actions
Copy link

github-actions bot commented Nov 8, 2022

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet