Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] eksctl delete cluster leakes network interface, subnet and vpc #7589

Open
mauriciovasquezbernal opened this issue Feb 22, 2024 · 12 comments
Labels

Comments

@mauriciovasquezbernal
Copy link

What were you trying to accomplish?

We're using ekscli on as part of the CI system of Inspektor Gadget. For each CI run, we need to create a cluster and destroy it after running our tests.

What happened?

After some days, it's not possible to create new clusters:

"Resource handler returned message: "The maximum number of VPCs has been reached. (Service: Ec2, Status Code: 400, Request ID: xxx)"

This is happening because the deletion of the cluster is failing some times leaking some resources. The eksctl delete cluster logs don't have any relevant information:

Run eksctl delete cluster --name ig-ci-eks-amd64-8741 --wait=false
2024-02-16 16:32:33 [ℹ] deleting EKS cluster "ig-ci-eks-amd64-8741"
2024-02-16 16:32:33 [ℹ] will drain 0 unmanaged nodegroup(s) in cluster "ig-ci-eks-amd64-8741"
2024-02-16 16:32:33 [ℹ] starting parallel draining, max in-flight of 1
2024-02-16 16:32:33 [ℹ] deleted 0 Fargate profile(s)
2024-02-16 16:32:34 [✔] kubeconfig has been updated
2024-02-16 16:32:34 [ℹ] cleaning up AWS load balancers created by Kubernetes objects of Kind Service or Ingress
2024-02-16 16:32:34 [ℹ]
2 sequential tasks: { delete nodegroup "ng-f74722d8", delete cluster control plane "ig-ci-eks-amd64-8741" [async]
}
2024-02-16 16:32:34 [ℹ] will delete stack "eksctl-ig-ci-eks-amd64-8741-nodegroup-ng-f74722d8"
2024-02-16 16:32:34 [ℹ] waiting for stack "eksctl-ig-ci-eks-amd64-8741-nodegroup-ng-f74722d8" to get deleted
2024-02-16 16:32:34 [ℹ] waiting for CloudFormation stack "eksctl-ig-ci-eks-amd64-8741-nodegroup-ng-f74722d8"
2024-02-16 16:33:05 [ℹ] waiting for CloudFormation stack "eksctl-ig-ci-eks-amd64-8741-nodegroup-ng-f74722d8"
2024-02-16 16:33:49 [ℹ] waiting for CloudFormation stack "eksctl-ig-ci-eks-amd64-8741-nodegroup-ng-f74722d8"
2024-02-16 16:35:31 [ℹ] waiting for CloudFormation stack "eksctl-ig-ci-eks-amd64-8741-nodegroup-ng-f74722d8"
2024-02-16 16:36:10 [ℹ] waiting for CloudFormation stack "eksctl-ig-ci-eks-amd64-8741-nodegroup-ng-f74722d8"
2024-02-16 16:36:51 [ℹ] waiting for CloudFormation stack "eksctl-ig-ci-eks-amd64-8741-nodegroup-ng-f74722d8"
2024-02-16 16:38:07 [ℹ] waiting for CloudFormation stack "eksctl-ig-ci-eks-amd64-8741-nodegroup-ng-f74722d8"
2024-02-16 16:39:14 [ℹ] waiting for CloudFormation stack "eksctl-ig-ci-eks-amd64-8741-nodegroup-ng-f74722d8"
2024-02-16 16:40:08 [ℹ] waiting for CloudFormation stack "eksctl-ig-ci-eks-amd64-8741-nodegroup-ng-f74722d8"
2024-02-16 16:40:08 [ℹ] will delete stack "eksctl-ig-ci-eks-amd64-8741-cluster"
2024-02-16 16:40:08 [✔] all cluster resources were deleted

But the logs from cloud formation indicate a subnet couldn't be deleted:

Resource handler returned message: "The subnet 'subnet-foo-bar' has dependencies and cannot be deleted. (Service: Ec2, Status Code: 400, Request ID: xxx (RequestToken: yyy HandlerErrorCode: InvalidRequest)

The subnet can't be deleted because it has a network interface attached:

Selection_394

I can manually remove the network interface and then the CloudFormation stack.

This is something that happens often, after one week or so our limit of 20VPCs is reached:

Selection_395

How to reproduce it?

I suppose trying to create and remove a cluster multiple times will reproduce this behavior.

Logs

Anything else we need to know?

eksctl downloaded from latest release from this repository.

Versions

I don't have access to this eksctl instance as it was running on GitHub Actions, but the version reported was 0.171.0.

Copy link
Contributor

Hello mauriciovasquezbernal 👋 Thank you for opening an issue in eksctl project. The team will review the issue and aim to respond within 1-5 business days. Meanwhile, please read about the Contribution and Code of Conduct guidelines here. You can find out more information about eksctl on our website

@yuxiang-zhang
Copy link
Member

Hi @mauriciovasquezbernal I believe the error isn't surfaced because you set the --wait=false

@eatmyrust
Copy link

I am also having this issue when attempting to delete clusters. We use the --wait flag so the command is correctly exiting with a failure status code, but the real issue is why eksctl is failing to properly tear down a VPC every time. It looks like the delete command is leaving behind endpoints which still have network interfaces attached to them. This results in the subnet failing to delete and a VPC being left behind.

@eiffel-fl
Copy link

Hi @yuxiang-zhang!

Can you please share more context on --wait=false?
Indeed, we used this flag before but removed it as it is the default behavior:
inspektor-gadget/inspektor-gadget#2534 (comment)

cmd.Wait = false

Best regards.

@yuxiang-zhang
Copy link
Member

Hi @eiffel-fl, we'll need more details on how you configure the cluster and what does your tests do to the VPC/subnets. We also create clusters to run tests and tear down afterwards, but we haven't seen this issue occur.

If you set --wait or equivalently --wait=true, as @eatmyrust suggested

using the --wait flag so the command is correctly exiting with a failure status code

@eiffel-fl
Copy link

Hi!

Hi @eiffel-fl, we'll need more details on how you configure the cluster and what does your tests do to the VPC/subnets. We also create clusters to run tests and tear down afterwards, but we haven't seen this issue occur.

Sure!
We are using EKS to run our integration tests:
https://github.com/inspektor-gadget/inspektor-gadget/blob/362c82d462264b3dd0ab753efc53987aebe9791a/.github/workflows/inspektor-gadget.yml#L1544
After some configuration, we create the cluster using eksctl:
https://github.com/inspektor-gadget/inspektor-gadget/blob/362c82d462264b3dd0ab753efc53987aebe9791a/.github/workflows/inspektor-gadget.yml#L1606
Then we run our tests and we finally remove everything using eksctl:
https://github.com/inspektor-gadget/inspektor-gadget/blob/362c82d462264b3dd0ab753efc53987aebe9791a/.github/workflows/inspektor-gadget.yml#L1621

Particularly, we are not creating ourselves VPC and they are created by the eksctl call.
Same for deletion, and it seems something wrong happens here as some VPCs cannot be deleted which lead them to be dangling and us not being able to run tests later.

If you set --wait or equivalently --wait=true, as @eatmyrust suggested

using the --wait flag so the command is correctly exiting with a failure status code

I would like to avoid using --wait=true, indeed, the cluster creation already takes around 15 minutes:
https://github.com/inspektor-gadget/inspektor-gadget/actions/runs/8479044411/job/23238203432#step:8:1
So, if we use --wait, we would add extra time waiting for everything to be deleted.
Also, I do not really understand, as eksctl should send an order to the server behind and all the delete operation could be handled async by the server?

If you need any other information, please let me know.

Best regards.

@TiberiuGC
Copy link
Collaborator

Hi @eiffel-fl - thanks for explaining your workflow!

I would like to avoid using --wait=true, indeed, the cluster creation already takes around 15 minutes:
https://github.com/inspektor-gadget/inspektor-gadget/actions/runs/8479044411/job/23238203432#step:8:1
So, if we use --wait, we would add extra time waiting for everything to be deleted.
Also, I do not really understand, as eksctl should send an order to the server behind and all the delete operation could be handled async by the server?

Setting --wait=true would've just helped to reveal that an error occurred (not fix it) during cluster control plane CFN stack deletion, but since you're already aware of it, there's no point in setting it now. Indeed, the deletion is being handled asynchronously when --wait=false.

We're trying to determine the underlying problem, hence why we are looking for some details as in - what actually happens inside those integration tests? I understand there may be a lot of things going on, but we should try to make some guesses as to what can influence eksctl delete cluster behaviour.

How to reproduce it?

I suppose trying to create and remove a cluster multiple times will reproduce this behavior.

Simply running eksctl create cluster followed by eksctl delete cluster many times will most likely not reproduce the issue. My suspicion is that some integration test alters the cluster configuration in a way that makes the delete command subsequently fail. Hence, first questions that come to mind:

  • is the subnet that fails to be deleted one of the subnets created during eksctl create cluster, or is it created later as part of an integration test?
  • is the network interface attached to this subnet created during eksctl create cluster, or is it created later as part of an integration test?

@eiffel-fl
Copy link

Hi!

I appreciate your reply 😄!

Hi @eiffel-fl - thanks for explaining your workflow!

I would like to avoid using --wait=true, indeed, the cluster creation already takes around 15 minutes:
https://github.com/inspektor-gadget/inspektor-gadget/actions/runs/8479044411/job/23238203432#step:8:1
So, if we use --wait, we would add extra time waiting for everything to be deleted.
Also, I do not really understand, as eksctl should send an order to the server behind and all the delete operation could be handled async by the server?

Setting --wait=true would've just helped to reveal that an error occurred (not fix it) during cluster control plane CFN stack deletion, but since you're already aware of it, there's no point in setting it now. Indeed, the deletion is being handled asynchronously when --wait=false.

OK, this makes sense, thank you for shedding some light.

We're trying to determine the underlying problem, hence why we are looking for some details as in - what actually happens inside those integration tests? I understand there may be a lot of things going on, but we should try to make some guesses as to what can influence eksctl delete cluster behaviour.

Basically, we are deploying Inspektor Gadget to the cluster, and then run our integration tests.
These tests mainly consist of deploying a test pod which generates some events (e.g. syscalls or I/Os) and monitoring everything with Inspektor Gadget.
The test succeeds if the expected event was monitored, and fails otherwise.
Note that, Inspektor Gadget relies on eBPF to monitor all these events.

How to reproduce it?
I suppose trying to create and remove a cluster multiple times will reproduce this behavior.

Simply running eksctl create cluster followed by eksctl delete cluster many times will most likely not reproduce the issue. My suspicion is that some integration test alters the cluster configuration in a way that makes the delete command subsequently fail. Hence, first questions that come to mind:

* is the subnet that fails to be deleted one of the subnets created during `eksctl create cluster`, or is it created later as part of an integration test?

I did not dive into which subnet fails and which one succeeds to be deleted.
Can you please indicate me how I can list of these subnets? I may add some debug command to list these subnets after we create the cluster and right before we delete it, this may help the understanding.

* is the network interface attached to this subnet created during `eksctl create cluster`, or is it created later as part of an integration test?

We only call eksctl at two moments:

  1. Before running the test to create the cluster.
  2. After running the test to delete the cluster.

So, unless some kubectl may have a side effect to create a subnet, we do not create a subnet on our own.
Please, note that we also run these integration tests on other cloud platforms and we do not have issues related to network resources not being deleted (I am not comparing, just that our integration tests are "platform agnostic").

If you have ideas of what I can check, please share.
Also, if you need further information, I will provide them trying to abstract as much as possible so you do not need to deep dive in our integration tests.

Best regards.

@mauriciovasquezbernal
Copy link
Author

My suspicion is that some integration test alters the cluster configuration in a way that makes the delete command subsequently fail. Hence, first questions that come to mind:

* is the subnet that fails to be deleted one of the subnets created during `eksctl create cluster`, or is it created later as part of an integration test?

* is the network interface attached to this subnet created during `eksctl create cluster`, or is it created later as part of an integration test?

We don't create any subnet or anything related to the networking stack of the clusters during the integration tests. We only deploy Inspektor Gadget (there is nothing specially about it that could affect the cluster networking) and some workloads to generate events (network traffic, dns requests, opening files, executing process, etc). I'll try to create a reproducer without Inspektor Gadget

@burak-ok
Copy link

In our logs we see the following (X's added by me):

Resource handler returned message: "The subnet 'subnet-0f2381XXXXXX' has dependencies and cannot be deleted. (Service: Ec2, Status Code: 400, Request ID: XXXXXX-XXX-XXX-XXXX-XXXXX)" (RequestToken: XXXXXXXXXXX HandlerErrorCode: InvalidRequest)

This is the error we see in the events of the CloudFormation stack. Since it can't delete this subnet (but deleted other subnets succesfully), it lets the VPC alive and the CloudFormation stack stays in the DELETE_FAILED state

Copy link
Contributor

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the stale label May 27, 2024
@mauriciovasquezbernal
Copy link
Author

I haven't have the chance to test it more. We implemented a work around (inspektor-gadget/inspektor-gadget#2686) to clean the leaked resources.

@github-actions github-actions bot removed the stale label May 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants