You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
The aws-load-balancer-controller revoked the Shared Backend SG for LoadBalancer from the security group used by EKS nodes, causing an outage. After a few minutes, the controller added the security group back.
Steps to reproduce
Unable to reproduce this.
Expected outcome
SecurityGroups are not modified in a way that causes pods to be unreachable.
Environment
AWS Load Balancer controller version: v2.4.1
Kubernetes version: 1.24
Using EKS (yes/no), if so version? yes v1.24.17-eks-508b6b3
Additional Context:
I recently had an issue where the load balancer controller revoked the Shared Backend SecurityGroup for LoadBalancer (sg-0fe###) as an Inbound rule from the security group used by nodes in our EKS cluster (sg-085###). This caused requests to the cluster to return a 503 error.
This is the CloudTrail log for the breaking security group change:
A few minutes later, the ingress controller added the Shared Backend SecurityGroup for LoadBalancer back as an inbound rule and the cluster started serving requests again.
This is the CloudTrail log for the security group change that fixed the issue:
@cjhawkins
Is the logs of the controller pod still around? the controller pod logs should have the reason why it decided to remove the security group rule from worker nodes.
Would you mind cut a ticket to EKS support with the controller log?
Describe the bug
The aws-load-balancer-controller revoked the
Shared Backend SG for LoadBalancer
from the security group used by EKS nodes, causing an outage. After a few minutes, the controller added the security group back.Steps to reproduce
Unable to reproduce this.
Expected outcome
SecurityGroups are not modified in a way that causes pods to be unreachable.
Environment
Additional Context:
I recently had an issue where the load balancer controller revoked the
Shared Backend SecurityGroup for LoadBalancer
(sg-0fe###) as an Inbound rule from the security group used by nodes in our EKS cluster (sg-085###). This caused requests to the cluster to return a 503 error.This is the CloudTrail log for the breaking security group change:
A few minutes later, the ingress controller added the
Shared Backend SecurityGroup for LoadBalancer
back as an inbound rule and the cluster started serving requests again.This is the CloudTrail log for the security group change that fixed the issue:
This probably doesn't add much, but these are the logs from around the time the security group changes were made:
Not sure if its relevant, but the EKS cluster has 5 different ALB's in front of it, each with their own domain, certificate and OIDC configuration.
Any insights into what caused this change, and how to make sure it doesn't happen again? Thank you.
The text was updated successfully, but these errors were encountered: