Failing to schedule pod with default configuration #529

JM322 · 2024-05-13T14:31:07Z

Description

After applying the jupyterhub on EKS blueprint I am able to access the home screen via port forwarding. I can then select one of the provided options to setup a server:

Data Engineering (CPU)
Trainium (trn1)
Inferentia (inf2)
Data Science ...
...

All of these options immediately fail with the same/similar error messages:

Server requested
2024-05-13T13:55:13.094920Z [Warning] 0/4 nodes are available: 4 node(s) didn't match Pod's node affinity/selector. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling..
2024-05-13T13:55:14Z [Warning] Failed to schedule pod, incompatible with nodepool "trainium", daemonset overhead={"cpu":"710m","memory":"740Mi","pods":"8"}, did not tolerate aws.amazon.com/neuroncore=true:NoSchedule; did not tolerate aws.amazon.com/neuron=true:NoSchedule; incompatible with nodepool "inferentia", daemonset overhead={"cpu":"710m","memory":"740Mi","pods":"8"}, did not tolerate aws.amazon.com/neuroncore=true:NoSchedule; did not tolerate aws.amazon.com/neuron=true:NoSchedule; incompatible with nodepool "gpu-ts", daemonset overhead={"cpu":"710m","memory":"740Mi","pods":"8"}, did not tolerate nvidia.com/gpu=:NoSchedule; incompatible with nodepool "gpu", daemonset overhead={"cpu":"710m","memory":"740Mi","pods":"8"}, did not tolerate nvidia.com/gpu=:NoSchedule; incompatible with nodepool "default", daemonset overhead={"cpu":"710m","memory":"740Mi","pods":"8"}, no instance type satisfied resources {"cpu":"2710m","memory":"8932Mi","pods":"9"} and requirements NodeGroupType In [default], NodePool In [default], hub.jupyter.org/node-purpose In [user], karpenter.k8s.aws/instance-family In [c5 m5 r5], karpenter.k8s.aws/instance-size In [16xlarge 24xlarge 2xlarge 4xlarge 8xlarge and 1 others], karpenter.sh/capacity-type In [on-demand spot], karpenter.sh/nodepool In [default], kubernetes.io/arch In [amd64] (no instance type met the scheduling requirements or had a required offering)
2024-05-13T13:55:23Z [Normal] pod didn't trigger scale-up: 1 node(s) didn't match Pod's node affinity/selector
2024-05-13T14:00:35.378113Z [Warning] 0/4 nodes are available: 4 node(s) didn't match Pod's node affinity/selector. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling..
Spawn failed: pod jupyterhub/jupyter-user1 did not start in 1200 seconds!

The documentation does not mention to customize any node labels. The only changes made to the blueprint were replacing the VPC module by a VPC data source and updating the references to the VPC module in the other terraform files.

✋ I have searched the open/closed issues and my issue is not listed.

Versions

Module version [Required]: v1.0.2
Terraform version: v1.8.3
Provider version(s):

+ provider registry.terraform.io/hashicorp/archive v2.4.0
+ provider registry.terraform.io/hashicorp/aws v5.49.0
+ provider registry.terraform.io/hashicorp/cloudinit v2.3.4
+ provider registry.terraform.io/hashicorp/helm v2.13.2
+ provider registry.terraform.io/hashicorp/kubernetes v2.30.0
+ provider registry.terraform.io/hashicorp/random v3.1.0
+ provider registry.terraform.io/hashicorp/time v0.11.1
+ provider registry.terraform.io/hashicorp/tls v4.0.5

Reproduction Code [Required]

Steps to reproduce the behavior:

Update the vpc.tf to use a VPC data source
Update references to the VPC module and specify the subnet to deploy the workloads to
terraform apply

Expected behavior

JupyterHub resources are created
JupyterHub is reachable
Starting a server (e.g. Data Engineering (CPU)) creates a pod on a matching node or provisions a new node

Actual behavior

JupyterHub resources are created
JupyterHub is reachable
Starting a server (e.g. Data Engineering (CPU)) fails

Terminal Output Screenshot(s)

Server requested
2024-05-13T13:55:13.094920Z [Warning] 0/4 nodes are available: 4 node(s) didn't match Pod's node affinity/selector. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling..
2024-05-13T13:55:14Z [Warning] Failed to schedule pod, incompatible with nodepool "trainium", daemonset overhead={"cpu":"710m","memory":"740Mi","pods":"8"}, did not tolerate aws.amazon.com/neuroncore=true:NoSchedule; did not tolerate aws.amazon.com/neuron=true:NoSchedule; incompatible with nodepool "inferentia", daemonset overhead={"cpu":"710m","memory":"740Mi","pods":"8"}, did not tolerate aws.amazon.com/neuroncore=true:NoSchedule; did not tolerate aws.amazon.com/neuron=true:NoSchedule; incompatible with nodepool "gpu-ts", daemonset overhead={"cpu":"710m","memory":"740Mi","pods":"8"}, did not tolerate nvidia.com/gpu=:NoSchedule; incompatible with nodepool "gpu", daemonset overhead={"cpu":"710m","memory":"740Mi","pods":"8"}, did not tolerate nvidia.com/gpu=:NoSchedule; incompatible with nodepool "default", daemonset overhead={"cpu":"710m","memory":"740Mi","pods":"8"}, no instance type satisfied resources {"cpu":"2710m","memory":"8932Mi","pods":"9"} and requirements NodeGroupType In [default], NodePool In [default], hub.jupyter.org/node-purpose In [user], karpenter.k8s.aws/instance-family In [c5 m5 r5], karpenter.k8s.aws/instance-size In [16xlarge 24xlarge 2xlarge 4xlarge 8xlarge and 1 others], karpenter.sh/capacity-type In [on-demand spot], karpenter.sh/nodepool In [default], kubernetes.io/arch In [amd64] (no instance type met the scheduling requirements or had a required offering)
2024-05-13T13:55:23Z [Normal] pod didn't trigger scale-up: 1 node(s) didn't match Pod's node affinity/selector
2024-05-13T14:00:35.378113Z [Warning] 0/4 nodes are available: 4 node(s) didn't match Pod's node affinity/selector. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling..
Spawn failed: pod jupyterhub/jupyter-user1 did not start in 1200 seconds!

The text was updated successfully, but these errors were encountered:

vara-bonthu · 2024-05-14T01:52:48Z

@lusoal @ratnopamc @askulkarni2 fyi..

JM322 · 2024-05-21T07:55:27Z

Any ideas whether this even is a bug or an error on my side?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failing to schedule pod with default configuration #529

Failing to schedule pod with default configuration #529

JM322 commented May 13, 2024

vara-bonthu commented May 14, 2024

JM322 commented May 21, 2024

Failing to schedule pod with default configuration #529

Failing to schedule pod with default configuration #529

Comments

JM322 commented May 13, 2024

Description

Versions

Reproduction Code [Required]

Expected behavior

Actual behavior

Terminal Output Screenshot(s)

vara-bonthu commented May 14, 2024

JM322 commented May 21, 2024