-
-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
After update to v20 with API_AND_CONFIG_MAP cluster cannot launch Fargate pods #2912
Comments
what steps did you follow when upgrading from v19 to v20? |
Hey, we saw this when upgrading our clusters to use the EKS API. Might be worth a try as recreating the Fargate Profile will not cause a loss of nodes (only a short window of no Autoscaling). |
@bryantbiggs doesn't really matters tbh, since issue reproduces on new cluster created from scratch with v20 module
Tried this, after Fargate profile recreation entries were added by EKS itself to
Fargate profile works so far (after removal entries from Update: after some time after deletion entries from configmap, Fargate profile stopped working again:
|
@dmitriishaburov are you saying that creating a brand new cluster using the latest v20 module and EKS Fargate Profiles with |
Yes, after creating a brand new cluster with v20 module and Fargate, pods initially launch, but later fail (existing pods keep running, but cannot launch any new pod) |
Ok thank you - let me dig into this |
@dmitriishaburov do you have a way to reproduce? I launched the Fargate example that we have in this module and scaled the sample deployment and still not seeing any issues so far:
|
@bryantbiggs have you checked the aws-auth configmap that it doesn't have entries for fargate? |
yes, there are configmap entries - these are created by EKS k get configmap -n kube-system aws-auth -o yaml apiVersion: v1
data:
mapRoles: |
- groups:
- system:bootstrappers
- system:nodes
- system:node-proxier
rolearn: arn:aws:iam::111111111111:role/kube-system-20240208133840563900000002
username: system:node:{{SessionName}}
- groups:
- system:bootstrappers
- system:nodes
- system:node-proxier
rolearn: arn:aws:iam::111111111111:role/karpenter-20240208133840563500000001
username: system:node:{{SessionName}}
kind: ConfigMap
metadata:
creationTimestamp: "2024-02-08T13:49:11Z"
name: aws-auth
namespace: kube-system
resourceVersion: "1442"
uid: 990a01cc-c9cb-4e5a-a0b5-e278ebfdefce |
I've manually deleted the |
still no signs of auth issues after an hour. for now I am going to park this, I don't think there is anything module related since I am unable to reproduce |
Yeah, seems like it's quite hard to replicate. I've created one more cluster to replicate, keeping configuration as small as possible, and was trying to restart coredns. Here's entire terraform code for the cluster:
|
@dmitriishaburov So moving from v19 -> v20 you need to add those to the roles mapping:
@bryantbiggs I see you posted the aws-auth configmap that was recreated after being deleted, could you paste the TF code you are using to create that using the aws-auth module? I'm assuming you are passing it the fargate_profiles similar to what I am but probably doing it by for_each (which would be better). |
This is not true - EKS will create both the |
It is documented, I created an entire replica of the module to make this transition easier https://github.com/clowdhaus/terraform-aws-eks-migrate-v19-to-v20 Unless users are using |
Docs are not entirely clear, but it seems like during migration to access entries you shouldn't actually remove Fargate (or managed node group) entries from ConfigMap https://docs.aws.amazon.com/eks/latest/userguide/migrating-access-entries.html In v19 configmap entries were created automatically in terraform, in v20 any change to ConfigMap via terraform removes the AWS-created entries from ConfigMap. Probably it would make sense keep behavior same in aws-auth module.
|
we cannot maintain the same functionality because that means we are keeping the Kubernetes provider in the module which we are absolutely not doing
This is not true. You need to understand how EKS handles access, as I've stated above.
Thats just the EKS portion, thats the behavior of the EKS API, both past and present. In terms of this module, the Coming back to this module, we automatically mapped the roles from both managed nodegroups and Fargate profiles created by this module into Finally, we come to the migration from
What is happening in step 2 is that we are enabling cluster access entry but not modifying the
Just for sake of completeness - if the |
I'm running into the same issues and reading all of this doesn't clear up what is happening. I've run the migration from v19 to v20 using the migration fork and the fargate pods are starting correctly. However now that I'm back on the standard eks module source with the version set to ~>20.0 a terraform plan is saying it would like to destroy the Any help clarifying this would greatly appreciated. |
You only need to move/remove the |
I faced the same challenge as well. I identified the issue. as per AWS doc we need to have the following policy
instead of the module is creating only
once i added the additional trust things started working. I am not sure why it is working on v19 though |
Hey @bryantbiggs , similar thing i'm observing and would like to have a clarification before i make the upgrade to avoid any disruption in acess. So when i'm going for v19 to v20 setting the version to ~>20.0 and having the the auth_roles creation using the sub module as below
the terraform plan says destruction & creation of the config map
My concern with this delete & create, will i lose the access for sometime? or EKS access entry (which i'm assuming will automatically created) will take care for the disruption? Can this create shooting on the foot scenario? And is it for this reason we need go with this approach https://github.com/clowdhaus/terraform-aws-eks-migrate-v19-to-v20 and no direct upgrade from v19 to v20? |
This is tough to reproduce, but I ran into it as well, in |
I also ran into the same issue and same error message Steps I followed:
|
that is far from what is outlined in the upgrade guide and I would expected issues when following that route |
i just brought up a brand new EKS cluster on 20.5.0 and its having the same issue:
so this has nothing to do with the upgrade
looked at the pod execution role for coredns and it has "AmazonEKS_CNI_Policy", "AmazonEKSFargatePodExecutionRolePolicy", "My additional role policy" with a trust of:
So it looks correct I think. I know above @kuntalkumarbasu said you need to add the condition to make it work but that doesn't seem correct does it?
I used to add the fargate executor arns here before as described in previous reply but since module creates access entry I removed those from being added here as |
As @bryantbiggs mentioned. I followed the following steps and I am not seeing this error anymore.
confirmed using kubectl rollout in 2 hours, 24 hours. EKS is able to deploy pods on fargate nodes. |
Team, I'm facing the same issue post migrating from v19 to v20. I get the following error as OP on my Karpenter Pending pods
What are some things I could try? eks-cluster
aws-auth
karpenter
|
I had the same issue:
The only thing that solved for me was manually adding the fargate pod execution role to aws-auth (using the new submodule) like this:
For some reason, it only worked by using |
for those on this issue/thread, can you open an AWS support case with your cluster ARN and the time period when you encountered this behavior, please |
We have encountered this issue on all of our ~12 clusters. It is definitely an EKS issue and not a terraform issue since deleting and recreating the fargate profile (either via terraform or the console) fixes it... temporarily. We've opened an AWS ticket for the matter. |
I've also created AWS support case in that case and got following response (trunkated most of the stuff, just conclusion, case ID 170852436301392):
I've just left cluster on |
@dmitriishaburov i tried this originally and it did not seem to work. and clearly this explanation does not make sense anyways because deleting and recreating the fargate profile allows it to schedule without adding the execution role to |
I'm seeing the same issues on our clusters when we try to finish off the migration steps. AWS support so far told me that access entries are not correctly created for existing Fargate profiles when doing the migration between authentication modes, with a hint to this bit in the documentation:
The support engineer also told me that he has reached out to internal teams, apparently the internal product team is tracking this exact GitHub issue. |
I'm facing the same issue. I don't know if this can help, but I want to share what I have acknowledged so far:
|
Also ran into the same. To resolve it, I had to re-apply the aws-auth configmap, delete all fargate profiles, re-apply again and rollout restart all deployments. After an hour it still looks stable. |
@hefnat curious as to why not keep the default |
@cdenneen it was an attempt to keep the setting as it was/not introduce anything new and check if that worked |
Description
After updating cluster from v19 to v20 with switching to
API_AND_CONFIG_MAP
auth mode, cluster cannot launch new Fargate pods.New clusters with
API_AND_CONFIG_MAP
mode cannot launch Fargate as well.Versions
Reproduction Code [Required]
Basic stripped down version of what we're using:
Expected behavior
Fargate pods should be able to launch
Actual behavior
After update to v20 we're seeing following errors when trying to launch new pods:
Additional context
From what I see, there is access entry created for Fargate, but no aws-auth ConfigMap entry.
While that's probably expected, maybe it affects the ability to run Fargate?
The text was updated successfully, but these errors were encountered: