Skip to content
This repository has been archived by the owner on Dec 30, 2020. It is now read-only.

Pending status for cow-job #202

Open
pouya-codes opened this issue Jul 2, 2020 · 5 comments
Open

Pending status for cow-job #202

pouya-codes opened this issue Jul 2, 2020 · 5 comments

Comments

@pouya-codes
Copy link

pouya-codes commented Jul 2, 2020

I was trying to run the cow-job after setup environments by the following command:
vagrant up && vagrant ssh k8s-master
kubectl apply -f examples/cow.yaml

but when I run kubectl get pods my cow-job is "Pending":
NAME READY STATUS RESTARTS AGE
cow-job 0/1 Pending 0 13s
wlm-operator-ffddd8795-lz98t 1/1 Running 0 16m

@adamwoolhether
Copy link

Have you figured it out? Having the same problem of SlurmJobs not initiating.

It seems like they aren't being assigned to the virtual-kubelets, despite ensuring the virtual kubelets have both labels:
k describe pod cow-job

Name:           cow-job
Namespace:      default
Priority:       0
Node:           <none>
Labels:         <none>
Annotations:    <none>
Status:         Pending
IP:
IPs:            <none>
Controlled By:  SlurmJob/cow
Containers:
  jt1:
    Image:        no-image
    Port:         <none>
    Host Port:    <none>
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-b86xw (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  default-token-b86xw:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-b86xw
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  type=virtual-kubelet
                 wlm.sylabs.io/containers=singularity
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
                 virtual-kubelet.io/provider=wlm:NoSchedule
Events:
  Type     Reason            Age        From               Message
  ----     ------            ----       ----               -------
  Warning  FailedScheduling  <unknown>  default-scheduler  0/7 nodes are available: 7 node(s) didn't match node selector.

k get nodes --show-labels

NAME                   STATUS   ROLES    AGE   VERSION          LABELS
qpod3-cn01             Ready    <none>   10d   v1.17.4          beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=qpod3-cn01,kubernetes.io/os=linux
qpod3-cn02             Ready    <none>   10d   v1.17.4          beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=qpod3-cn02,kubernetes.io/os=linux
qpod3-cn03             Ready    <none>   10d   v1.17.4          beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=qpod3-cn03,kubernetes.io/os=linux
qpod3-k8s-master       Ready    master   10d   v1.17.4          beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=qpod3-k8s-master,kubernetes.io/os=linux,node-role.kubernetes.io/master=
slurm-qpod3-cn01-cpn   Ready    agent    49m   v1.13.1-vk-N/A   alpha.service-controller.kubernetes.io/exclude-balancer=true,beta.kubernetes.io/os=linux,kubernetes.io/hostname=slurm-qpod3-cn01-cpn,kubernetes.io/os=linux,kubernetes.io/role=agent,type=virtual-kubelet,wlm.sylabs.io/containers=singularity
slurm-qpod3-cn02-cpn   Ready    agent    49m   v1.13.1-vk-N/A   alpha.service-controller.kubernetes.io/exclude-balancer=true,beta.kubernetes.io/os=linux,kubernetes.io/hostname=slurm-qpod3-cn02-cpn,kubernetes.io/os=linux,kubernetes.io/role=agent,type=virtual-kubelet,wlm.sylabs.io/containers=singularity
slurm-qpod3-cn03-cpn   Ready    agent    49m   v1.13.1-vk-N/A   alpha.service-controller.kubernetes.io/exclude-balancer=true,beta.kubernetes.io/os=linux,kubernetes.io/hostname=slurm-qpod3-cn03-cpn,kubernetes.io/os=linux,kubernetes.io/role=agent,type=virtual-kubelet,wlm.sylabs.io/containers=singularity

@pouya-codes
Copy link
Author

Hi, yeah, you need to change the nodeSelector at your cow-job config file to one of the existing nodes (e.g., slurm-qpod3-cn01-cpn).

@adamwoolhether
Copy link

adamwoolhether commented Dec 16, 2020

@pisarukv I really appreciate the response. I assume you're referring to the "virtual-kubelet" node?

The slurmjob's pod still isn't being assigned to any node, even after adding kubernetes.io/hostname: slurm-qpod3-cn03-cp to Yaml. No matter how congurent, the pod still fails scheduling, citing no matching no selectors.

If it's not too much trouble, would you mind showing me the output for the following commands?
kubectl get nodes -o wide --show-labels
kubectl describe pods cow-job
kubectl describe describe slurmjobs.wlm.sylabs.io cow
kubectl logs wlm-operator.......

Thanks again.

@pouya-codes
Copy link
Author

Yes, I'm referring to the virtual-kubelet nodes.
I attached the output of the commands you mentioned.
logs.txt
describeSlurmCow.txt
describePods.txt
getNodes.txt

@adamwoolhether
Copy link

adamwoolhether commented Dec 21, 2020

Many thanks! I think my issue may stem from the fact that I was running the k8s master and slurm master(with slurmctld) as the same node. I've set up a separate test env from out dev environment and got it working.

Thanks again!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants