Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing to schedule pod with default configuration #529

Open
1 task done
JM322 opened this issue May 13, 2024 · 2 comments
Open
1 task done

Failing to schedule pod with default configuration #529

JM322 opened this issue May 13, 2024 · 2 comments

Comments

@JM322
Copy link

JM322 commented May 13, 2024

Description

After applying the jupyterhub on EKS blueprint I am able to access the home screen via port forwarding. I can then select one of the provided options to setup a server:

  • Data Engineering (CPU)
  • Trainium (trn1)
  • Inferentia (inf2)
  • Data Science ...
  • ...

All of these options immediately fail with the same/similar error messages:

Server requested
2024-05-13T13:55:13.094920Z [Warning] 0/4 nodes are available: 4 node(s) didn't match Pod's node affinity/selector. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling..
2024-05-13T13:55:14Z [Warning] Failed to schedule pod, incompatible with nodepool "trainium", daemonset overhead={"cpu":"710m","memory":"740Mi","pods":"8"}, did not tolerate aws.amazon.com/neuroncore=true:NoSchedule; did not tolerate aws.amazon.com/neuron=true:NoSchedule; incompatible with nodepool "inferentia", daemonset overhead={"cpu":"710m","memory":"740Mi","pods":"8"}, did not tolerate aws.amazon.com/neuroncore=true:NoSchedule; did not tolerate aws.amazon.com/neuron=true:NoSchedule; incompatible with nodepool "gpu-ts", daemonset overhead={"cpu":"710m","memory":"740Mi","pods":"8"}, did not tolerate nvidia.com/gpu=:NoSchedule; incompatible with nodepool "gpu", daemonset overhead={"cpu":"710m","memory":"740Mi","pods":"8"}, did not tolerate nvidia.com/gpu=:NoSchedule; incompatible with nodepool "default", daemonset overhead={"cpu":"710m","memory":"740Mi","pods":"8"}, no instance type satisfied resources {"cpu":"2710m","memory":"8932Mi","pods":"9"} and requirements NodeGroupType In [default], NodePool In [default], hub.jupyter.org/node-purpose In [user], karpenter.k8s.aws/instance-family In [c5 m5 r5], karpenter.k8s.aws/instance-size In [16xlarge 24xlarge 2xlarge 4xlarge 8xlarge and 1 others], karpenter.sh/capacity-type In [on-demand spot], karpenter.sh/nodepool In [default], kubernetes.io/arch In [amd64] (no instance type met the scheduling requirements or had a required offering)
2024-05-13T13:55:23Z [Normal] pod didn't trigger scale-up: 1 node(s) didn't match Pod's node affinity/selector
2024-05-13T14:00:35.378113Z [Warning] 0/4 nodes are available: 4 node(s) didn't match Pod's node affinity/selector. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling..
Spawn failed: pod jupyterhub/jupyter-user1 did not start in 1200 seconds!

The documentation does not mention to customize any node labels. The only changes made to the blueprint were replacing the VPC module by a VPC data source and updating the references to the VPC module in the other terraform files.

  • ✋ I have searched the open/closed issues and my issue is not listed.

Versions

  • Module version [Required]: v1.0.2

  • Terraform version: v1.8.3

  • Provider version(s):

+ provider registry.terraform.io/hashicorp/archive v2.4.0
+ provider registry.terraform.io/hashicorp/aws v5.49.0
+ provider registry.terraform.io/hashicorp/cloudinit v2.3.4
+ provider registry.terraform.io/hashicorp/helm v2.13.2
+ provider registry.terraform.io/hashicorp/kubernetes v2.30.0
+ provider registry.terraform.io/hashicorp/random v3.1.0
+ provider registry.terraform.io/hashicorp/time v0.11.1
+ provider registry.terraform.io/hashicorp/tls v4.0.5

Reproduction Code [Required]

Steps to reproduce the behavior:

  1. Update the vpc.tf to use a VPC data source
  2. Update references to the VPC module and specify the subnet to deploy the workloads to
  3. terraform apply

Expected behavior

  1. JupyterHub resources are created
  2. JupyterHub is reachable
  3. Starting a server (e.g. Data Engineering (CPU)) creates a pod on a matching node or provisions a new node

Actual behavior

  1. JupyterHub resources are created
  2. JupyterHub is reachable
  3. Starting a server (e.g. Data Engineering (CPU)) fails

Terminal Output Screenshot(s)

Server requested
2024-05-13T13:55:13.094920Z [Warning] 0/4 nodes are available: 4 node(s) didn't match Pod's node affinity/selector. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling..
2024-05-13T13:55:14Z [Warning] Failed to schedule pod, incompatible with nodepool "trainium", daemonset overhead={"cpu":"710m","memory":"740Mi","pods":"8"}, did not tolerate aws.amazon.com/neuroncore=true:NoSchedule; did not tolerate aws.amazon.com/neuron=true:NoSchedule; incompatible with nodepool "inferentia", daemonset overhead={"cpu":"710m","memory":"740Mi","pods":"8"}, did not tolerate aws.amazon.com/neuroncore=true:NoSchedule; did not tolerate aws.amazon.com/neuron=true:NoSchedule; incompatible with nodepool "gpu-ts", daemonset overhead={"cpu":"710m","memory":"740Mi","pods":"8"}, did not tolerate nvidia.com/gpu=:NoSchedule; incompatible with nodepool "gpu", daemonset overhead={"cpu":"710m","memory":"740Mi","pods":"8"}, did not tolerate nvidia.com/gpu=:NoSchedule; incompatible with nodepool "default", daemonset overhead={"cpu":"710m","memory":"740Mi","pods":"8"}, no instance type satisfied resources {"cpu":"2710m","memory":"8932Mi","pods":"9"} and requirements NodeGroupType In [default], NodePool In [default], hub.jupyter.org/node-purpose In [user], karpenter.k8s.aws/instance-family In [c5 m5 r5], karpenter.k8s.aws/instance-size In [16xlarge 24xlarge 2xlarge 4xlarge 8xlarge and 1 others], karpenter.sh/capacity-type In [on-demand spot], karpenter.sh/nodepool In [default], kubernetes.io/arch In [amd64] (no instance type met the scheduling requirements or had a required offering)
2024-05-13T13:55:23Z [Normal] pod didn't trigger scale-up: 1 node(s) didn't match Pod's node affinity/selector
2024-05-13T14:00:35.378113Z [Warning] 0/4 nodes are available: 4 node(s) didn't match Pod's node affinity/selector. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling..
Spawn failed: pod jupyterhub/jupyter-user1 did not start in 1200 seconds!
@vara-bonthu
Copy link
Contributor

@lusoal @ratnopamc @askulkarni2 fyi..

@JM322
Copy link
Author

JM322 commented May 21, 2024

Any ideas whether this even is a bug or an error on my side?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants