Security Group unexpectedly revoked by Controller #3637

cjhawkins · 2024-04-05T21:51:58Z

Describe the bug
The aws-load-balancer-controller revoked the Shared Backend SG for LoadBalancer from the security group used by EKS nodes, causing an outage. After a few minutes, the controller added the security group back.

Steps to reproduce
Unable to reproduce this.

Expected outcome
SecurityGroups are not modified in a way that causes pods to be unreachable.

Environment

AWS Load Balancer controller version: v2.4.1
Kubernetes version: 1.24
Using EKS (yes/no), if so version? yes v1.24.17-eks-508b6b3

Additional Context:
I recently had an issue where the load balancer controller revoked the Shared Backend SecurityGroup for LoadBalancer (sg-0fe###) as an Inbound rule from the security group used by nodes in our EKS cluster (sg-085###). This caused requests to the cluster to return a 503 error.

This is the CloudTrail log for the breaking security group change:

{
    "eventVersion": "1.09",
    "userIdentity": {
        "type": "AssumedRole",
        "principalId": "#####################:######",
        "arn": "arn:aws:sts::############:assumed-role/irsa-production-alb-load-balancer-controller/######",
        "accountId": "############",
        "accessKeyId": "#####################",
        "sessionContext": {
            "sessionIssuer": {
                "type": "Role",
                "principalId": "#####################",
                "arn": "arn:aws:iam::############:role/irsa-production-alb-load-balancer-controller",
                "accountId": "############",
                "userName": "irsa-production-alb-load-balancer-controller"
            },
            "webIdFederationData": {
                "federatedProvider": "arn:aws:iam::############:oidc-provider/oidc.eks.ca-central-1.amazonaws.com/id/#####################",
                "attributes": {}
            },
            "attributes": {
                "creationDate": "2024-03-24T14:32:16Z",
                "mfaAuthenticated": "false"
            }
        }
    },
    "eventTime": "2024-03-24T14:39:04Z",
    "eventSource": "ec2.amazonaws.com",
    "eventName": "RevokeSecurityGroupIngress",
    "awsRegion": "ca-central-1",
    "sourceIPAddress": "###.###.###.###",
    "userAgent": "elbv2.k8s.aws/v2.4.1 aws-sdk-go/1.42.27 (go1.17.8; linux; amd64)",
    "requestParameters": {
        "groupId": "sg-085###",
        "ipPermissions": {
            "items": [
                {
                    "ipProtocol": "tcp",
                    "fromPort": 80,
                    "toPort": 80,
                    "groups": {
                        "items": [
                            {
                                "userId": "############",
                                "groupId": "sg-0fe###",
                                "description": "elbv2.k8s.aws/targetGroupBinding=shared"
                            }
                        ]
                    },
                    "ipRanges": {},
                    "ipv6Ranges": {},
                    "prefixListIds": {}
                }
            ]
        }
    },
    "responseElements": {
        "requestId": "c0db6584-48ed-415b-9762-3feccf1789fa",
        "_return": true,
        "revokedSecurityGroupRuleSet": {
            "items": [
                {
                    "groupId": "sg-085###",
                    "securityGroupRuleId": "sgr-0f1bc2e98ba0f9786",
                    "description": "elbv2.k8s.aws/targetGroupBinding=shared",
                    "isEgress": false,
                    "ipProtocol": "tcp",
                    "fromPort": 80,
                    "toPort": 80,
                    "referencedGroupId": "sg-0fe###"
                }
            ]
        }
    },
    "requestID": "c0db6584-48ed-415b-9762-3feccf1789fa",
    "eventID": "ac2601f1-5ed9-42fa-ba40-99d92d7e53bf",
    "readOnly": false,
    "eventType": "AwsApiCall",
    "managementEvent": true,
    "recipientAccountId": "############",
    "eventCategory": "Management",
    "tlsDetails": {
        "tlsVersion": "TLSv1.3",
        "cipherSuite": "TLS_AES_128_GCM_SHA256",
        "clientProvidedHostHeader": "ec2.ca-central-1.amazonaws.com"
    }
}

A few minutes later, the ingress controller added the Shared Backend SecurityGroup for LoadBalancer back as an inbound rule and the cluster started serving requests again.

This is the CloudTrail log for the security group change that fixed the issue:

{
    "eventVersion": "1.09",
    "userIdentity": {
        "type": "AssumedRole",
        "principalId": "#####################:######",
        "arn": "arn:aws:sts::############:assumed-role/irsa-production-alb-load-balancer-controller/######",
        "accountId": "############",
        "accessKeyId": "#####################",
        "sessionContext": {
            "sessionIssuer": {
                "type": "Role",
                "principalId": "#####################",
                "arn": "arn:aws:iam::############:role/irsa-production-alb-load-balancer-controller",
                "accountId": "############",
                "userName": "irsa-production-alb-load-balancer-controller"
            },
            "webIdFederationData": {
                "federatedProvider": "arn:aws:iam::############:oidc-provider/oidc.eks.ca-central-1.amazonaws.com/id/#####################",
                "attributes": {}
            },
            "attributes": {
                "creationDate": "2024-03-24T14:32:16Z",
                "mfaAuthenticated": "false"
            }
        }
    },
    "eventTime": "2024-03-24T14:47:32Z",
    "eventSource": "ec2.amazonaws.com",
    "eventName": "AuthorizeSecurityGroupIngress",
    "awsRegion": "ca-central-1",
    "sourceIPAddress": "###.###.###.###",
    "userAgent": "elbv2.k8s.aws/v2.4.1 aws-sdk-go/1.42.27 (go1.17.8; linux; amd64)",
    "requestParameters": {
        "groupId": "sg-085###",
        "ipPermissions": {
            "items": [
                {
                    "ipProtocol": "tcp",
                    "fromPort": 80,
                    "toPort": 80,
                    "groups": {
                        "items": [
                            {
                                "groupId": "sg-0fe###",
                                "description": "elbv2.k8s.aws/targetGroupBinding=shared"
                            }
                        ]
                    },
                    "ipRanges": {},
                    "ipv6Ranges": {},
                    "prefixListIds": {}
                }
            ]
        }
    },
    "responseElements": {
        "requestId": "ac85fba4-09b3-4a76-95be-21de5879907a",
        "_return": true,
        "securityGroupRuleSet": {
            "items": [
                {
                    "groupOwnerId": "############",
                    "groupId": "sg-085###",
                    "securityGroupRuleId": "sgr-066###",
                    "description": "elbv2.k8s.aws/targetGroupBinding=shared",
                    "isEgress": false,
                    "ipProtocol": "tcp",
                    "fromPort": 80,
                    "toPort": 80,
                    "referencedGroupInfo": {
                        "userId": "############",
                        "groupId": "sg-0fe###"
                    }
                }
            ]
        }
    },
    "requestID": "ac85fba4-09b3-4a76-95be-21de5879907a",
    "eventID": "9de31dfa-9982-4af8-a5cb-0fc3eef65dee",
    "readOnly": false,
    "eventType": "AwsApiCall",
    "managementEvent": true,
    "recipientAccountId": "############",
    "eventCategory": "Management",
    "tlsDetails": {
        "tlsVersion": "TLSv1.3",
        "cipherSuite": "TLS_AES_128_GCM_SHA256",
        "clientProvidedHostHeader": "ec2.ca-central-1.amazonaws.com"
    }
}

This probably doesn't add much, but these are the logs from around the time the security group changes were made:

Not sure if its relevant, but the EKS cluster has 5 different ALB's in front of it, each with their own domain, certificate and OIDC configuration.

Any insights into what caused this change, and how to make sure it doesn't happen again? Thank you.

The text was updated successfully, but these errors were encountered:

M00nF1sh · 2024-04-10T22:20:21Z

@cjhawkins
Is the logs of the controller pod still around? the controller pod logs should have the reason why it decided to remove the security group rule from worker nodes.

Would you mind cut a ticket to EKS support with the controller log?

shraddhabang added the triage/needs-information Indicates an issue needs more information in order to work on it. label Apr 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Security Group unexpectedly revoked by Controller #3637

Security Group unexpectedly revoked by Controller #3637

cjhawkins commented Apr 5, 2024

M00nF1sh commented Apr 10, 2024

Security Group unexpectedly revoked by Controller #3637

Security Group unexpectedly revoked by Controller #3637

Comments

cjhawkins commented Apr 5, 2024

M00nF1sh commented Apr 10, 2024