Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Security Group unexpectedly revoked by Controller #3637

Open
cjhawkins opened this issue Apr 5, 2024 · 1 comment
Open

Security Group unexpectedly revoked by Controller #3637

cjhawkins opened this issue Apr 5, 2024 · 1 comment
Labels
triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@cjhawkins
Copy link

Describe the bug
The aws-load-balancer-controller revoked the Shared Backend SG for LoadBalancer from the security group used by EKS nodes, causing an outage. After a few minutes, the controller added the security group back.

Steps to reproduce
Unable to reproduce this.

Expected outcome
SecurityGroups are not modified in a way that causes pods to be unreachable.

Environment

  • AWS Load Balancer controller version: v2.4.1
  • Kubernetes version: 1.24
  • Using EKS (yes/no), if so version? yes v1.24.17-eks-508b6b3

Additional Context:
I recently had an issue where the load balancer controller revoked the Shared Backend SecurityGroup for LoadBalancer (sg-0fe###) as an Inbound rule from the security group used by nodes in our EKS cluster (sg-085###). This caused requests to the cluster to return a 503 error.

This is the CloudTrail log for the breaking security group change:

{
    "eventVersion": "1.09",
    "userIdentity": {
        "type": "AssumedRole",
        "principalId": "#####################:######",
        "arn": "arn:aws:sts::############:assumed-role/irsa-production-alb-load-balancer-controller/######",
        "accountId": "############",
        "accessKeyId": "#####################",
        "sessionContext": {
            "sessionIssuer": {
                "type": "Role",
                "principalId": "#####################",
                "arn": "arn:aws:iam::############:role/irsa-production-alb-load-balancer-controller",
                "accountId": "############",
                "userName": "irsa-production-alb-load-balancer-controller"
            },
            "webIdFederationData": {
                "federatedProvider": "arn:aws:iam::############:oidc-provider/oidc.eks.ca-central-1.amazonaws.com/id/#####################",
                "attributes": {}
            },
            "attributes": {
                "creationDate": "2024-03-24T14:32:16Z",
                "mfaAuthenticated": "false"
            }
        }
    },
    "eventTime": "2024-03-24T14:39:04Z",
    "eventSource": "ec2.amazonaws.com",
    "eventName": "RevokeSecurityGroupIngress",
    "awsRegion": "ca-central-1",
    "sourceIPAddress": "###.###.###.###",
    "userAgent": "elbv2.k8s.aws/v2.4.1 aws-sdk-go/1.42.27 (go1.17.8; linux; amd64)",
    "requestParameters": {
        "groupId": "sg-085###",
        "ipPermissions": {
            "items": [
                {
                    "ipProtocol": "tcp",
                    "fromPort": 80,
                    "toPort": 80,
                    "groups": {
                        "items": [
                            {
                                "userId": "############",
                                "groupId": "sg-0fe###",
                                "description": "elbv2.k8s.aws/targetGroupBinding=shared"
                            }
                        ]
                    },
                    "ipRanges": {},
                    "ipv6Ranges": {},
                    "prefixListIds": {}
                }
            ]
        }
    },
    "responseElements": {
        "requestId": "c0db6584-48ed-415b-9762-3feccf1789fa",
        "_return": true,
        "revokedSecurityGroupRuleSet": {
            "items": [
                {
                    "groupId": "sg-085###",
                    "securityGroupRuleId": "sgr-0f1bc2e98ba0f9786",
                    "description": "elbv2.k8s.aws/targetGroupBinding=shared",
                    "isEgress": false,
                    "ipProtocol": "tcp",
                    "fromPort": 80,
                    "toPort": 80,
                    "referencedGroupId": "sg-0fe###"
                }
            ]
        }
    },
    "requestID": "c0db6584-48ed-415b-9762-3feccf1789fa",
    "eventID": "ac2601f1-5ed9-42fa-ba40-99d92d7e53bf",
    "readOnly": false,
    "eventType": "AwsApiCall",
    "managementEvent": true,
    "recipientAccountId": "############",
    "eventCategory": "Management",
    "tlsDetails": {
        "tlsVersion": "TLSv1.3",
        "cipherSuite": "TLS_AES_128_GCM_SHA256",
        "clientProvidedHostHeader": "ec2.ca-central-1.amazonaws.com"
    }
}

A few minutes later, the ingress controller added the Shared Backend SecurityGroup for LoadBalancer back as an inbound rule and the cluster started serving requests again.

This is the CloudTrail log for the security group change that fixed the issue:

{
    "eventVersion": "1.09",
    "userIdentity": {
        "type": "AssumedRole",
        "principalId": "#####################:######",
        "arn": "arn:aws:sts::############:assumed-role/irsa-production-alb-load-balancer-controller/######",
        "accountId": "############",
        "accessKeyId": "#####################",
        "sessionContext": {
            "sessionIssuer": {
                "type": "Role",
                "principalId": "#####################",
                "arn": "arn:aws:iam::############:role/irsa-production-alb-load-balancer-controller",
                "accountId": "############",
                "userName": "irsa-production-alb-load-balancer-controller"
            },
            "webIdFederationData": {
                "federatedProvider": "arn:aws:iam::############:oidc-provider/oidc.eks.ca-central-1.amazonaws.com/id/#####################",
                "attributes": {}
            },
            "attributes": {
                "creationDate": "2024-03-24T14:32:16Z",
                "mfaAuthenticated": "false"
            }
        }
    },
    "eventTime": "2024-03-24T14:47:32Z",
    "eventSource": "ec2.amazonaws.com",
    "eventName": "AuthorizeSecurityGroupIngress",
    "awsRegion": "ca-central-1",
    "sourceIPAddress": "###.###.###.###",
    "userAgent": "elbv2.k8s.aws/v2.4.1 aws-sdk-go/1.42.27 (go1.17.8; linux; amd64)",
    "requestParameters": {
        "groupId": "sg-085###",
        "ipPermissions": {
            "items": [
                {
                    "ipProtocol": "tcp",
                    "fromPort": 80,
                    "toPort": 80,
                    "groups": {
                        "items": [
                            {
                                "groupId": "sg-0fe###",
                                "description": "elbv2.k8s.aws/targetGroupBinding=shared"
                            }
                        ]
                    },
                    "ipRanges": {},
                    "ipv6Ranges": {},
                    "prefixListIds": {}
                }
            ]
        }
    },
    "responseElements": {
        "requestId": "ac85fba4-09b3-4a76-95be-21de5879907a",
        "_return": true,
        "securityGroupRuleSet": {
            "items": [
                {
                    "groupOwnerId": "############",
                    "groupId": "sg-085###",
                    "securityGroupRuleId": "sgr-066###",
                    "description": "elbv2.k8s.aws/targetGroupBinding=shared",
                    "isEgress": false,
                    "ipProtocol": "tcp",
                    "fromPort": 80,
                    "toPort": 80,
                    "referencedGroupInfo": {
                        "userId": "############",
                        "groupId": "sg-0fe###"
                    }
                }
            ]
        }
    },
    "requestID": "ac85fba4-09b3-4a76-95be-21de5879907a",
    "eventID": "9de31dfa-9982-4af8-a5cb-0fc3eef65dee",
    "readOnly": false,
    "eventType": "AwsApiCall",
    "managementEvent": true,
    "recipientAccountId": "############",
    "eventCategory": "Management",
    "tlsDetails": {
        "tlsVersion": "TLSv1.3",
        "cipherSuite": "TLS_AES_128_GCM_SHA256",
        "clientProvidedHostHeader": "ec2.ca-central-1.amazonaws.com"
    }
}

This probably doesn't add much, but these are the logs from around the time the security group changes were made:
Screenshot 2024-04-05 at 11 18 42 AM

Not sure if its relevant, but the EKS cluster has 5 different ALB's in front of it, each with their own domain, certificate and OIDC configuration.

Any insights into what caused this change, and how to make sure it doesn't happen again? Thank you.

@M00nF1sh
Copy link
Collaborator

@cjhawkins
Is the logs of the controller pod still around? the controller pod logs should have the reason why it decided to remove the security group rule from worker nodes.

Would you mind cut a ticket to EKS support with the controller log?

@shraddhabang shraddhabang added the triage/needs-information Indicates an issue needs more information in order to work on it. label Apr 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

No branches or pull requests

3 participants