[Bug] eksctl delete cluster leakes network interface, subnet and vpc #7589

mauriciovasquezbernal · 2024-02-22T16:28:18Z

What were you trying to accomplish?

We're using ekscli on as part of the CI system of Inspektor Gadget. For each CI run, we need to create a cluster and destroy it after running our tests.

What happened?

After some days, it's not possible to create new clusters:

"Resource handler returned message: "The maximum number of VPCs has been reached. (Service: Ec2, Status Code: 400, Request ID: xxx)"

This is happening because the deletion of the cluster is failing some times leaking some resources. The eksctl delete cluster logs don't have any relevant information:

Run eksctl delete cluster --name ig-ci-eks-amd64-8741 --wait=false
2024-02-16 16:32:33 [ℹ] deleting EKS cluster "ig-ci-eks-amd64-8741"
2024-02-16 16:32:33 [ℹ] will drain 0 unmanaged nodegroup(s) in cluster "ig-ci-eks-amd64-8741"
2024-02-16 16:32:33 [ℹ] starting parallel draining, max in-flight of 1
2024-02-16 16:32:33 [ℹ] deleted 0 Fargate profile(s)
2024-02-16 16:32:34 [✔] kubeconfig has been updated
2024-02-16 16:32:34 [ℹ] cleaning up AWS load balancers created by Kubernetes objects of Kind Service or Ingress
2024-02-16 16:32:34 [ℹ]
2 sequential tasks: { delete nodegroup "ng-f74722d8", delete cluster control plane "ig-ci-eks-amd64-8741" [async]
}
2024-02-16 16:32:34 [ℹ] will delete stack "eksctl-ig-ci-eks-amd64-8741-nodegroup-ng-f74722d8"
2024-02-16 16:32:34 [ℹ] waiting for stack "eksctl-ig-ci-eks-amd64-8741-nodegroup-ng-f74722d8" to get deleted
2024-02-16 16:32:34 [ℹ] waiting for CloudFormation stack "eksctl-ig-ci-eks-amd64-8741-nodegroup-ng-f74722d8"
2024-02-16 16:33:05 [ℹ] waiting for CloudFormation stack "eksctl-ig-ci-eks-amd64-8741-nodegroup-ng-f74722d8"
2024-02-16 16:33:49 [ℹ] waiting for CloudFormation stack "eksctl-ig-ci-eks-amd64-8741-nodegroup-ng-f74722d8"
2024-02-16 16:35:31 [ℹ] waiting for CloudFormation stack "eksctl-ig-ci-eks-amd64-8741-nodegroup-ng-f74722d8"
2024-02-16 16:36:10 [ℹ] waiting for CloudFormation stack "eksctl-ig-ci-eks-amd64-8741-nodegroup-ng-f74722d8"
2024-02-16 16:36:51 [ℹ] waiting for CloudFormation stack "eksctl-ig-ci-eks-amd64-8741-nodegroup-ng-f74722d8"
2024-02-16 16:38:07 [ℹ] waiting for CloudFormation stack "eksctl-ig-ci-eks-amd64-8741-nodegroup-ng-f74722d8"
2024-02-16 16:39:14 [ℹ] waiting for CloudFormation stack "eksctl-ig-ci-eks-amd64-8741-nodegroup-ng-f74722d8"
2024-02-16 16:40:08 [ℹ] waiting for CloudFormation stack "eksctl-ig-ci-eks-amd64-8741-nodegroup-ng-f74722d8"
2024-02-16 16:40:08 [ℹ] will delete stack "eksctl-ig-ci-eks-amd64-8741-cluster"
2024-02-16 16:40:08 [✔] all cluster resources were deleted

But the logs from cloud formation indicate a subnet couldn't be deleted:

Resource handler returned message: "The subnet 'subnet-foo-bar' has dependencies and cannot be deleted. (Service: Ec2, Status Code: 400, Request ID: xxx (RequestToken: yyy HandlerErrorCode: InvalidRequest)

The subnet can't be deleted because it has a network interface attached:

I can manually remove the network interface and then the CloudFormation stack.

This is something that happens often, after one week or so our limit of 20VPCs is reached:

How to reproduce it?

I suppose trying to create and remove a cluster multiple times will reproduce this behavior.

Logs

Anything else we need to know?

eksctl downloaded from latest release from this repository.

Versions

I don't have access to this eksctl instance as it was running on GitHub Actions, but the version reported was 0.171.0.

The text was updated successfully, but these errors were encountered:

github-actions · 2024-02-22T16:28:44Z

Hello mauriciovasquezbernal 👋 Thank you for opening an issue in eksctl project. The team will review the issue and aim to respond within 1-5 business days. Meanwhile, please read about the Contribution and Code of Conduct guidelines here. You can find out more information about eksctl on our website

yuxiang-zhang · 2024-02-26T17:48:12Z

Hi @mauriciovasquezbernal I believe the error isn't surfaced because you set the --wait=false

eatmyrust · 2024-03-15T13:13:14Z

I am also having this issue when attempting to delete clusters. We use the --wait flag so the command is correctly exiting with a failure status code, but the real issue is why eksctl is failing to properly tear down a VPC every time. It looks like the delete command is leaving behind endpoints which still have network interfaces attached to them. This results in the subnet failing to delete and a VPC being left behind.

eiffel-fl · 2024-04-02T13:20:22Z

Hi @yuxiang-zhang!

Can you please share more context on --wait=false?
Indeed, we used this flag before but removed it as it is the default behavior:
inspektor-gadget/inspektor-gadget#2534 (comment)

eksctl/pkg/ctl/delete/cluster.go

Line 45 in 025550a

cmd.Wait = false

Best regards.

yuxiang-zhang · 2024-04-02T17:21:42Z

Hi @eiffel-fl, we'll need more details on how you configure the cluster and what does your tests do to the VPC/subnets. We also create clusters to run tests and tear down afterwards, but we haven't seen this issue occur.

If you set --wait or equivalently --wait=true, as @eatmyrust suggested

using the --wait flag so the command is correctly exiting with a failure status code

eiffel-fl · 2024-04-03T07:28:14Z

Hi!

Hi @eiffel-fl, we'll need more details on how you configure the cluster and what does your tests do to the VPC/subnets. We also create clusters to run tests and tear down afterwards, but we haven't seen this issue occur.

Sure!
We are using EKS to run our integration tests:
https://github.com/inspektor-gadget/inspektor-gadget/blob/362c82d462264b3dd0ab753efc53987aebe9791a/.github/workflows/inspektor-gadget.yml#L1544
After some configuration, we create the cluster using eksctl:
https://github.com/inspektor-gadget/inspektor-gadget/blob/362c82d462264b3dd0ab753efc53987aebe9791a/.github/workflows/inspektor-gadget.yml#L1606
Then we run our tests and we finally remove everything using eksctl:
https://github.com/inspektor-gadget/inspektor-gadget/blob/362c82d462264b3dd0ab753efc53987aebe9791a/.github/workflows/inspektor-gadget.yml#L1621

Particularly, we are not creating ourselves VPC and they are created by the eksctl call.
Same for deletion, and it seems something wrong happens here as some VPCs cannot be deleted which lead them to be dangling and us not being able to run tests later.

If you set --wait or equivalently --wait=true, as @eatmyrust suggested

using the --wait flag so the command is correctly exiting with a failure status code

I would like to avoid using --wait=true, indeed, the cluster creation already takes around 15 minutes:
https://github.com/inspektor-gadget/inspektor-gadget/actions/runs/8479044411/job/23238203432#step:8:1
So, if we use --wait, we would add extra time waiting for everything to be deleted.
Also, I do not really understand, as eksctl should send an order to the server behind and all the delete operation could be handled async by the server?

If you need any other information, please let me know.

Best regards.

TiberiuGC · 2024-04-03T11:23:05Z

Hi @eiffel-fl - thanks for explaining your workflow!

I would like to avoid using --wait=true, indeed, the cluster creation already takes around 15 minutes:
https://github.com/inspektor-gadget/inspektor-gadget/actions/runs/8479044411/job/23238203432#step:8:1
So, if we use --wait, we would add extra time waiting for everything to be deleted.
Also, I do not really understand, as eksctl should send an order to the server behind and all the delete operation could be handled async by the server?

Setting --wait=true would've just helped to reveal that an error occurred (not fix it) during cluster control plane CFN stack deletion, but since you're already aware of it, there's no point in setting it now. Indeed, the deletion is being handled asynchronously when --wait=false.

We're trying to determine the underlying problem, hence why we are looking for some details as in - what actually happens inside those integration tests? I understand there may be a lot of things going on, but we should try to make some guesses as to what can influence eksctl delete cluster behaviour.

How to reproduce it?

I suppose trying to create and remove a cluster multiple times will reproduce this behavior.

Simply running eksctl create cluster followed by eksctl delete cluster many times will most likely not reproduce the issue. My suspicion is that some integration test alters the cluster configuration in a way that makes the delete command subsequently fail. Hence, first questions that come to mind:

is the subnet that fails to be deleted one of the subnets created during eksctl create cluster, or is it created later as part of an integration test?
is the network interface attached to this subnet created during eksctl create cluster, or is it created later as part of an integration test?

eiffel-fl · 2024-04-03T12:09:47Z

Hi!

I appreciate your reply 😄!

Hi @eiffel-fl - thanks for explaining your workflow!

I would like to avoid using --wait=true, indeed, the cluster creation already takes around 15 minutes:
https://github.com/inspektor-gadget/inspektor-gadget/actions/runs/8479044411/job/23238203432#step:8:1
So, if we use --wait, we would add extra time waiting for everything to be deleted.
Also, I do not really understand, as eksctl should send an order to the server behind and all the delete operation could be handled async by the server?

Setting --wait=true would've just helped to reveal that an error occurred (not fix it) during cluster control plane CFN stack deletion, but since you're already aware of it, there's no point in setting it now. Indeed, the deletion is being handled asynchronously when --wait=false.

OK, this makes sense, thank you for shedding some light.

We're trying to determine the underlying problem, hence why we are looking for some details as in - what actually happens inside those integration tests? I understand there may be a lot of things going on, but we should try to make some guesses as to what can influence eksctl delete cluster behaviour.

Basically, we are deploying Inspektor Gadget to the cluster, and then run our integration tests.
These tests mainly consist of deploying a test pod which generates some events (e.g. syscalls or I/Os) and monitoring everything with Inspektor Gadget.
The test succeeds if the expected event was monitored, and fails otherwise.
Note that, Inspektor Gadget relies on eBPF to monitor all these events.

How to reproduce it?
I suppose trying to create and remove a cluster multiple times will reproduce this behavior.

Simply running eksctl create cluster followed by eksctl delete cluster many times will most likely not reproduce the issue. My suspicion is that some integration test alters the cluster configuration in a way that makes the delete command subsequently fail. Hence, first questions that come to mind:
* is the subnet that fails to be deleted one of the subnets created during `eksctl create cluster`, or is it created later as part of an integration test?

I did not dive into which subnet fails and which one succeeds to be deleted.
Can you please indicate me how I can list of these subnets? I may add some debug command to list these subnets after we create the cluster and right before we delete it, this may help the understanding.

* is the network interface attached to this subnet created during `eksctl create cluster`, or is it created later as part of an integration test?

We only call eksctl at two moments:

Before running the test to create the cluster.
After running the test to delete the cluster.

So, unless some kubectl may have a side effect to create a subnet, we do not create a subnet on our own.
Please, note that we also run these integration tests on other cloud platforms and we do not have issues related to network resources not being deleted (I am not comparing, just that our integration tests are "platform agnostic").

If you have ideas of what I can check, please share.
Also, if you need further information, I will provide them trying to abstract as much as possible so you do not need to deep dive in our integration tests.

Best regards.

mauriciovasquezbernal · 2024-04-08T15:39:47Z

My suspicion is that some integration test alters the cluster configuration in a way that makes the delete command subsequently fail. Hence, first questions that come to mind:
* is the subnet that fails to be deleted one of the subnets created during `eksctl create cluster`, or is it created later as part of an integration test?

* is the network interface attached to this subnet created during `eksctl create cluster`, or is it created later as part of an integration test?

We don't create any subnet or anything related to the networking stack of the clusters during the integration tests. We only deploy Inspektor Gadget (there is nothing specially about it that could affect the cluster networking) and some workloads to generate events (network traffic, dns requests, opening files, executing process, etc). I'll try to create a reproducer without Inspektor Gadget

burak-ok · 2024-04-26T11:20:26Z

In our logs we see the following (X's added by me):

Resource handler returned message: "The subnet 'subnet-0f2381XXXXXX' has dependencies and cannot be deleted. (Service: Ec2, Status Code: 400, Request ID: XXXXXX-XXX-XXX-XXXX-XXXXX)" (RequestToken: XXXXXXXXXXX HandlerErrorCode: InvalidRequest)

This is the error we see in the events of the CloudFormation stack. Since it can't delete this subnet (but deleted other subnets succesfully), it lets the VPC alive and the CloudFormation stack stays in the DELETE_FAILED state

github-actions · 2024-05-27T01:48:28Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

mauriciovasquezbernal · 2024-05-27T12:16:51Z

I haven't have the chance to test it more. We implemented a work around (inspektor-gadget/inspektor-gadget#2686) to clean the leaked resources.

mauriciovasquezbernal added the kind/bug label Feb 22, 2024

mauriciovasquezbernal mentioned this issue Feb 22, 2024

Testing on EKS start failing after some days because VPCs aren't removed inspektor-gadget/inspektor-gadget#2533

Closed

eiffel-fl mentioned this issue Apr 5, 2024

ci: Add workflow to clean eks resources inspektor-gadget/inspektor-gadget#2686

Merged

github-actions bot added the stale label May 27, 2024

github-actions bot removed the stale label May 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] eksctl delete cluster leakes network interface, subnet and vpc #7589

[Bug] eksctl delete cluster leakes network interface, subnet and vpc #7589

mauriciovasquezbernal commented Feb 22, 2024

github-actions bot commented Feb 22, 2024

yuxiang-zhang commented Feb 26, 2024

eatmyrust commented Mar 15, 2024

eiffel-fl commented Apr 2, 2024

yuxiang-zhang commented Apr 2, 2024

eiffel-fl commented Apr 3, 2024

TiberiuGC commented Apr 3, 2024

eiffel-fl commented Apr 3, 2024

mauriciovasquezbernal commented Apr 8, 2024

burak-ok commented Apr 26, 2024

github-actions bot commented May 27, 2024

mauriciovasquezbernal commented May 27, 2024

[Bug] eksctl delete cluster leakes network interface, subnet and vpc #7589

[Bug] eksctl delete cluster leakes network interface, subnet and vpc #7589

Comments

mauriciovasquezbernal commented Feb 22, 2024

What were you trying to accomplish?

What happened?

How to reproduce it?

github-actions bot commented Feb 22, 2024

yuxiang-zhang commented Feb 26, 2024

eatmyrust commented Mar 15, 2024

eiffel-fl commented Apr 2, 2024

yuxiang-zhang commented Apr 2, 2024

eiffel-fl commented Apr 3, 2024

TiberiuGC commented Apr 3, 2024

eiffel-fl commented Apr 3, 2024

mauriciovasquezbernal commented Apr 8, 2024

burak-ok commented Apr 26, 2024

github-actions bot commented May 27, 2024

mauriciovasquezbernal commented May 27, 2024