Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instance Type "" not found #416

Open
zackgalbreath opened this issue Mar 22, 2023 · 2 comments
Open

Instance Type "" not found #416

zackgalbreath opened this issue Mar 22, 2023 · 2 comments

Comments

@zackgalbreath
Copy link
Collaborator

Description

We've noticed some GitLab CI worker pods failing to get scheduled. The typical output you see after the job times out is:

ERROR: Job failed (system failure): prepare environment: waiting for pod running:
timed out waiting for pod to start.

I caught one such pod before it timed out and ran kubectl describe on it. I saw:

Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Normal   Nominated         55s   karpenter          Pod should schedule on ip-10-0-168-134.ec2.internal

Hmm, that's suspicious. Karpenter sees a node where it can schedule this pod, but it never actually happens. And sure enough, when I ran kubectl describe on that node, I see some suspicious output:

Lease:              Failed to get lease: leases.coordination.k8s.io "ip-10-0-168-134.ec2.internal" not found

Conditions:
  Type             Status    LastHeartbeatTime                 LastTransitionTime                Reason                   Message
  ----             ------    -----------------                 ------------------                ------                   -------
  Ready            Unknown   Fri, 17 Mar 2023 05:26:56 -0400   Fri, 17 Mar 2023 05:28:00 -0400   NodeStatusNeverUpdated   Kubelet never posted node status.

Events:
  Type     Reason               Age                    From       Message
  ----     ------               ----                   ----       -------
  Warning  FailedInflightCheck  4m3s (x737 over 5d2h)  karpenter  Instance Type "" not found

Relevant upstream issues

kubernetes-sigs/karpenter#750

aws/karpenter-provider-aws#3156

aws/karpenter-provider-aws#3311

Mitigation

For now, I manually found the affected node in the AWS web console and terminated it. If we can't properly resolve this issue then we should strive to automatically detect and terminate such nodes.

@jjnesbitt
Copy link
Collaborator

I suppose it wouldn't be that hard to setup a cronjob/deployment that checked for stale NotReady nodes and spun them down, if it came to that.

@bollig
Copy link
Member

bollig commented Mar 24, 2023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants