Skip to content
This repository has been archived by the owner on Dec 30, 2020. It is now read-only.

sycri can't create container #365

Open
malixian opened this issue Sep 25, 2019 · 23 comments
Open

sycri can't create container #365

malixian opened this issue Sep 25, 2019 · 23 comments

Comments

@malixian
Copy link

malixian commented Sep 25, 2019

my kubelet.service config is:

[Unit]
Description=Kubernetes Kubelet
Documentation=https://github.com/GoogleCloudPlatform/kubernetes
After=docker.service
Requires=docker.service

[Service]
WorkingDirectory=/var/lib/kubelet
ExecStart=/opt/kube/bin/kubelet
--address=10.2.152.182
--allow-privileged=true
--anonymous-auth=false
--authentication-token-webhook
--authorization-mode=Webhook
--client-ca-file=/etc/kubernetes/ssl/ca.pem
--cluster-dns=10.70.0.2
--cluster-domain=cluster.local.
--cni-bin-dir=/opt/kube/bin
--cni-conf-dir=/etc/cni/net.d
--fail-swap-on=false
--hairpin-mode hairpin-veth
--hostname-override=10.2.152.182
--kubeconfig=/etc/kubernetes/kubelet.kubeconfig
--max-pods=110
--network-plugin=cni
--pod-infra-container-image=mirrorgooglecontainers/pause-amd64:3.1
--register-node=true
--root-dir=/var/lib/kubelet
--tls-cert-file=/etc/kubernetes/ssl/kubelet.pem
--tls-private-key-file=/etc/kubernetes/ssl/kubelet-key.pem
--v=2
--container-runtime=remote
--container-runtime-endpoint=unix:///var/run/singularity.sock
--image-service-endpoint=unix:///var/run/singularity.sock

ExecStartPost=/sbin/iptables -A INPUT -s 10.0.0.0/8 -p tcp --dport 4194 -j ACCEPT
ExecStartPost=/sbin/iptables -A INPUT -s 172.16.0.0/12 -p tcp --dport 4194 -j ACCEPT
ExecStartPost=/sbin/iptables -A INPUT -s 192.168.0.0/16 -p tcp --dport 4194 -j ACCEPT
ExecStartPost=/sbin/iptables -A INPUT -p tcp --dport 4194 -j DROP
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

i think this config is ok, but pod events have some error:

Error: could not create container: could not spawn container: could not create oci bundle: could not create SIF bundle: failed to load SIF image /var/lib/singularity/cf5d9eea227371037e614fc7dec7c1f437a6398f9b08250b89ef5c92aab7e737: image format not recognized

@sashayakovtseva
Copy link
Contributor

Hi @malixian,

Please provide the following information:

  • OS distribution and version
  • go version
  • Singularity-CRI version (sycri version)
  • singularity version
  • kubectl version
  • pod specification you are trying to run

BTW do you have docker running? I see you require docker.service in kubelet service, which is not needed if you use other runtime.

@malixian
Copy link
Author

malixian commented Sep 26, 2019

thanks @sashayakovtseva ,
Here is my environment:
1. arch is arm64
2. singularity version 3.4.0-1
3.kubelet version v1.13.5
consider with my arch is arm64, so customize to build a image in this node and upload to sylab.io.
pod yaml is sample:

> apiVersion: v1
> kind: Pod
> metadata:
>   name: test-arm64
>   namespace: default
> spec:
>   containers:
>     name: test-arm64
>     image: cloud.sylabs.io/malixian/default/test-arm64:latest
>   nodeSelector:
>     beta.kubernetes.io/arch: arm64

journalctl -u kubelet -e, find error:

1.provider.go:116] Refreshing cache for provider: *credentialprovider.defaultDockerConfigProvider
2.SyncLoop (PLEG): "test-arm64_default(d938b836-e003-11e9-9f5c-ac1f6bac1d10)", event: &pleg.PodLifecycleEvent{ID:"d938b836-e003-11e9-9f5c-ac1f6bac1d10", Type:"ContainerDied,Data:"29950f6fcfc74d1e6268caba276c898280949621a5e31fc9e9ff91e50b4a360e"}

journalctl -u sycri -e, find error:

Sep 26 14:20:40 localhost sycri[28374]:E0926 14:20:40.218152   28374 main.go:276] /runtime.v1alpha2.ImageService/PullImage
Sep 26 14:20:40 localhost sycri[28374]:         Request: {"image":{"image":"cloud.sylabs.io/malixian/default/test-arm64:latest"}}
Sep 26 14:20:40 localhost sycri[28374]:         Response: null
Sep 26 14:20:40 localhost sycri[28374]:         Error: rpc error: code = Internal desc = could not get cloud.sylabs.io/malixian/default/test-arm64:latest image metadata: could not get library image info: error making request to server:
Sep 26 14:20:40 localhost sycri[28374]:         Get https://library.sylabs.io/v1/images/malixian/default/test-arm64:latest: net/http: TLS handshake timeout
Sep 26 14:34:31 localhost sycri[28374]: Calico CNI releasing IP address
Sep 26 14:34:31 localhost sycri[28374]: Calico CNI deleting device in netns /var/run/singularity/pods/19d9118ee6ac1e19ce475a4650ae0021f752c635009b0888b3f5d03b0a803f2a/namesp
Sep 26 14:34:31 localhost sycri[28374]: Calico CNI deleted device in netns /var/run/singularity/pods/19d9118ee6ac1e19ce475a4650ae0021f752c635009b0888b3f5d03b0a803f2a/namespa
Sep 26 14:34:36 localhost sycri[28374]: Calico CNI IPAM request count IPv4=1 IPv6=0
Sep 26 14:34:36 localhost sycri[28374]: Calico CNI IPAM handle=k8s-pod-network.1a08e33abd1973f28f4b6d0ec580ca335376eaba7e052f29dc94c93a10bb6696
Sep 26 14:34:36 localhost sycri[28374]: Calico CNI IPAM assigned addresses IPv4=[172.22.20.159] IPv6=[]
Sep 26 14:34:36 localhost sycri[28374]: Calico CNI using IPs: [172.22.20.159/32]
Sep 26 14:35:00 localhost sycri[28374]: E0926 14:35:00.032013   28374 container.go:177] Could not fetch container 393b15294a318ec6dd6fcb96a02a663345ec346f937e904982fa6acfd02
Sep 26 14:35:30 localhost sycri[28374]: E0926 14:35:30.725947   28374 pod.go:164] Could not update pod state: could not get pod state: could not query state: FATAL:   no con
Sep 26 14:45:46 localhost sycri[28374]: E0926 14:45:46.791990   28374 container.go:177] Could not fetch container 1e60e5d99769517eaa4add0bc7c70daecd053171434ff4ecaba5dd517f9
Sep 26 14:56:14 localhost sycri[28374]: E0926 14:56:14.904313   28374 container.go:177] Could not fetch container 498ddec28abee523f56c20ab1f173e690e382b4c5d3ec10c1f89993b701

it looks like image not't exist,but i have pushed image with

> singualrity push image.sif library://malixian/default/test-arm64

i try to deploy sycri in x86,but also appear some error:
journalctl -u kubelet -e

pod_workers.go:190] Error syncing pod e0f2d839-e038-11e9-9f5c-ac1f6bac1d10 ("image-service-deployment-655d89d94d-rfl5f_default(e0f2d839-e038-11e9-9f5c-ac1f6bac1d10)"), skipping: failed to "StartContainer" for "image-server" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=image-server pod=image-service-deployment-655d89d94d-rfl5f_default(e0f2d11e9-9f5c-ac1f6bac1d10)"
cri_stats_provider.go:320] Failed to get the info of the filesystem with mountpoint "/run": failed to ge
9月 26 17:06:18 comput1 kubelet[2792]: E0926 17:06:18.623068    2792 cri_stats_provider.go:320] Failed to get the info of the filesystem with mountpoint "/run": failed to get device for dir "/run": could not find device with major: 0, minor: 20 in cached partitions map.

journalctl -u sycri -e

9月 26 16:43:55 comput1 sycri[1787]: Error: rpc error: code = Internal desc = could not start container: unexpected container state: 4
9月 26 16:44:41 comput1 sycri[1787]: E0926 16:44:41.549846    1787 main.go:276] /runtime.v1alpha2.RuntimeService/StartContainer
9月 26 16:44:41 comput1 sycri[1787]: Request: {"container_id":"adcec04f0d24bf166b94b202dc7b1ebdd2dae6a0437c6982cc5f8111f8c6ead5"}
9月 26 16:44:41 comput1 sycri[1787]: Response: null
9月 26 16:44:41 comput1 sycri[1787]: Error: rpc error: code = Internal desc = could not start container: unexpected container state: 4
9月 26 16:46:05 comput1 sycri[1787]: E0926 16:46:05.197723    1787 main.go:276] /runtime.v1alpha2.RuntimeService/StartContainer
9月 26 16:46:05 comput1 sycri[1787]: Request: {"container_id":"d33181a4531a36e3f365c1d9b3b6107137a28c64ad752d762ce603fe1b95a7cd"}
9月 26 16:46:05 comput1 sycri[1787]: Response: null
9月 26 16:46:05 comput1 sycri[1787]: Error: rpc error: code = Internal desc = could not start container: unexpected container state: 4
9月 26 16:46:06 comput1 sycri[1787]: E0926 16:46:06.262393    1787 container.go:177] Could not fetch container adcec04f0d24bf166b94b202dc7b1ebdd2dae6a0437c6982cc5f8111f8c6ea
9月 26 16:48:50 comput1 sycri[1787]: E0926 16:48:50.075495    1787 main.go:276] /runtime.v1alpha2.RuntimeService/StartContainer
9月 26 16:48:50 comput1 sycri[1787]: Request: {"container_id":"82800d31b039ae364530530a4138e1ac245e222c6f80b83c74c491ea0f84a94e"}
9月 26 16:48:50 comput1 sycri[1787]: Response: null

@sashayakovtseva
Copy link
Contributor

@malixian What version of singularity-cri you are using?

Can you confirm the same network error appears if you do singularity pull library://malixian/default/test-arm64:latest on that host?

@malixian
Copy link
Author

@malixian What version of singularity-cri you are using?

Can you confirm the same network error appears if you do singularity pull library://malixian/default/test-arm64:latest on that host?

it's ok in that host.
And i try to deploy sycri in x86,but also appear some error:
journalctl -u kubelet -e

pod_workers.go:190] Error syncing pod e0f2d839-e038-11e9-9f5c-ac1f6bac1d10 ("image-service-deployment-655d89d94d-rfl5f_default(e0f2d839-e038-11e9-9f5c-ac1f6bac1d10)"), skipping: failed to "StartContainer" for "image-server" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=image-server pod=image-service-deployment-655d89d94d-rfl5f_default(e0f2d11e9-9f5c-ac1f6bac1d10)"
cri_stats_provider.go:320] Failed to get the info of the filesystem with mountpoint "/run": failed to ge
9月 26 17:06:18 comput1 kubelet[2792]: E0926 17:06:18.623068    2792 cri_stats_provider.go:320] Failed to get the info of the filesystem with mountpoint "/run": failed to get device for dir "/run": could not find device with major: 0, minor: 20 in cached partitions map.

journalctl -u sycri -e

9月 26 16:43:55 comput1 sycri[1787]: Error: rpc error: code = Internal desc = could not start container: unexpected container state: 4
9月 26 16:44:41 comput1 sycri[1787]: E0926 16:44:41.549846    1787 main.go:276] /runtime.v1alpha2.RuntimeService/StartContainer
9月 26 16:44:41 comput1 sycri[1787]: Request: {"container_id":"adcec04f0d24bf166b94b202dc7b1ebdd2dae6a0437c6982cc5f8111f8c6ead5"}
9月 26 16:44:41 comput1 sycri[1787]: Response: null
9月 26 16:44:41 comput1 sycri[1787]: Error: rpc error: code = Internal desc = could not start container: unexpected container state: 4
9月 26 16:46:05 comput1 sycri[1787]: E0926 16:46:05.197723    1787 main.go:276] /runtime.v1alpha2.RuntimeService/StartContainer
9月 26 16:46:05 comput1 sycri[1787]: Request: {"container_id":"d33181a4531a36e3f365c1d9b3b6107137a28c64ad752d762ce603fe1b95a7cd"}
9月 26 16:46:05 comput1 sycri[1787]: Response: null
9月 26 16:46:05 comput1 sycri[1787]: Error: rpc error: code = Internal desc = could not start container: unexpected container state: 4
9月 26 16:46:06 comput1 sycri[1787]: E0926 16:46:06.262393    1787 container.go:177] Could not fetch container adcec04f0d24bf166b94b202dc7b1ebdd2dae6a0437c6982cc5f8111f8c6ea
9月 26 16:48:50 comput1 sycri[1787]: E0926 16:48:50.075495    1787 main.go:276] /runtime.v1alpha2.RuntimeService/StartContainer
9月 26 16:48:50 comput1 sycri[1787]: Request: {"container_id":"82800d31b039ae364530530a4138e1ac245e222c6f80b83c74c491ea0f84a94e"}
9月 26 16:48:50 comput1 sycri[1787]: Response: null

@malixian
Copy link
Author

malixian commented Sep 26, 2019

@malixian What version of singularity-cri you are using?
Can you confirm the same network error appears if you do singularity pull library://malixian/default/test-arm64:latest on that host?

it's ok in that host.
And i try to deploy sycri in x86,but also appear some error:
yaml file is official example:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: image-service-deployment
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: image-service
  template:
    metadata:
      labels:
        app: image-service
      name: image-service
      namespace: default
    spec:
      containers:
      - name: image-server
        image: cloud.sylabs.io/sashayakovtseva/test/image-server
        ports:
        - containerPort: 8080
      securityContext:
        runAsUser: 1000
      nodeSelector:
        kubernetes.io/hostname: 10.18.127.1
---
apiVersion: v1
kind: Service
metadata:
  name: image-service
spec:
  type: NodePort
  ports:
    - port: 80
      targetPort: 8080
  selector:
    app: image-service

journalctl -u kubelet -e

pod_workers.go:190] Error syncing pod e0f2d839-e038-11e9-9f5c-ac1f6bac1d10 ("image-service-deployment-655d89d94d-rfl5f_default(e0f2d839-e038-11e9-9f5c-ac1f6bac1d10)"), skipping: failed to "StartContainer" for "image-server" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=image-server pod=image-service-deployment-655d89d94d-rfl5f_default(e0f2d11e9-9f5c-ac1f6bac1d10)"
cri_stats_provider.go:320] Failed to get the info of the filesystem with mountpoint "/run": failed to ge
9月 26 17:06:18 comput1 kubelet[2792]: E0926 17:06:18.623068    2792 cri_stats_provider.go:320] Failed to get the info of the filesystem with mountpoint "/run": failed to get device for dir "/run": could not find device with major: 0, minor: 20 in cached partitions map.

journalctl -u sycri -e

9月 26 16:43:55 comput1 sycri[1787]: Error: rpc error: code = Internal desc = could not start container: unexpected container state: 4
9月 26 16:44:41 comput1 sycri[1787]: E0926 16:44:41.549846    1787 main.go:276] /runtime.v1alpha2.RuntimeService/StartContainer
9月 26 16:44:41 comput1 sycri[1787]: Request: {"container_id":"adcec04f0d24bf166b94b202dc7b1ebdd2dae6a0437c6982cc5f8111f8c6ead5"}
9月 26 16:44:41 comput1 sycri[1787]: Response: null
9月 26 16:44:41 comput1 sycri[1787]: Error: rpc error: code = Internal desc = could not start container: unexpected container state: 4
9月 26 16:46:05 comput1 sycri[1787]: E0926 16:46:05.197723    1787 main.go:276] /runtime.v1alpha2.RuntimeService/StartContainer
9月 26 16:46:05 comput1 sycri[1787]: Request: {"container_id":"d33181a4531a36e3f365c1d9b3b6107137a28c64ad752d762ce603fe1b95a7cd"}
9月 26 16:46:05 comput1 sycri[1787]: Response: null
9月 26 16:46:05 comput1 sycri[1787]: Error: rpc error: code = Internal desc = could not start container: unexpected container state: 4
9月 26 16:46:06 comput1 sycri[1787]: E0926 16:46:06.262393    1787 container.go:177] Could not fetch container adcec04f0d24bf166b94b202dc7b1ebdd2dae6a0437c6982cc5f8111f8c6ea
9月 26 16:48:50 comput1 sycri[1787]: E0926 16:48:50.075495    1787 main.go:276] /runtime.v1alpha2.RuntimeService/StartContainer
9月 26 16:48:50 comput1 sycri[1787]: Request: {"container_id":"82800d31b039ae364530530a4138e1ac245e222c6f80b83c74c491ea0f84a94e"}
9月 26 16:48:50 comput1 sycri[1787]: Response: null

@sashayakovtseva
Copy link
Contributor

sashayakovtseva commented Sep 26, 2019

@malixian Is this is not an issue anymore? Looks like you have accidentally closed it.

And I need a full output of kubectl describe no <your node> and sycri version.

@malixian
Copy link
Author

@sashayakovtseva yes,you're right, sycri version is 1.0.0-beta.5. node information is

Name:               10.18.127.3
Roles:              node
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/hostname=10.18.127.3
                    kubernetes.io/role=node
Annotations:        node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Wed, 12 Jun 2019 16:20:55 +0800
Taints:             <none>
Unschedulable:      false
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Thu, 26 Sep 2019 17:56:07 +0800   Wed, 12 Jun 2019 16:20:55 +0800   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Thu, 26 Sep 2019 17:56:07 +0800   Wed, 12 Jun 2019 16:20:55 +0800   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Thu, 26 Sep 2019 17:56:07 +0800   Wed, 12 Jun 2019 16:20:55 +0800   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Thu, 26 Sep 2019 17:56:07 +0800   Tue, 24 Sep 2019 17:55:59 +0800   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:  10.18.127.3
  Hostname:    10.18.127.3
Capacity:
 cpu:                32
 ephemeral-storage:  511750Mi
 hugepages-1Gi:      0
 hugepages-2Mi:      0
 memory:             65393136Ki
 pods:               110
Allocatable:
 cpu:                32
 ephemeral-storage:  482947890401
 hugepages-1Gi:      0
 hugepages-2Mi:      0
 memory:             65290736Ki
 pods:               110
System Info:
 Machine ID:                 2e4e94b510a14392ab58491d3e377c96
 System UUID:                00000000-0000-0000-0000-AC1F6BAC404E
 Boot ID:                    d5e7100c-4a2f-4a96-a776-179cf47676ef
 Kernel Version:             3.10.0-957.21.2.el7.x86_64
 OS Image:                   CentOS Linux 7 (Core)
 Operating System:           linux
 Architecture:               amd64
 Container Runtime Version:  docker://18.9.2
 Kubelet Version:            v1.13.5
 Kube-Proxy Version:         v1.13.5
PodCIDR:                     172.22.1.0/24
Non-terminated Pods:         (40 in total)
  Namespace                  Name                                                         CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                  ----                                                         ------------  ----------  ---------------  -------------  ---
  kube-system                calico-kube-controllers-84db645bdf-ggptb                     0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d
  kube-system                calico-node-r85wt                                            250m (0%)     0 (0%)      0 (0%)           0 (0%)         106d
  kube-system                coredns-7c5785cbcc-2f4r6                                     100m (0%)     0 (0%)      70Mi (0%)        170Mi (0%)     2d
  kube-system                coredns-7c5785cbcc-pqvfs                                     100m (0%)     0 (0%)      70Mi (0%)        170Mi (0%)     2d
  kube-system                heapster-5b9b6b6597-d5n67                                    0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d
  kube-system                kubernetes-dashboard-76479d66bb-j9bpg                        0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d
  kube-system                metrics-server-79558444c6-9gx75                              0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d
  kubeflow                   ambassador-6f7cc986df-5qg87                                  200m (0%)     1 (3%)      100Mi (0%)       400Mi (0%)     2d
  kubeflow                   ambassador-6f7cc986df-lp6bq                                  200m (0%)     1 (3%)      100Mi (0%)       400Mi (0%)     2d
  kubeflow                   ambassador-6f7cc986df-m66w7                                  200m (0%)     1 (3%)      100Mi (0%)       400Mi (0%)     2d
  kubeflow                   argo-ui-db7cf456c-dlhlf                                      0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d
  kubeflow                   centraldashboard-79f6448bb7-x6q55                            0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d
  kubeflow                   config-controller-6d84df4f66-6g85b                           0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d
  kubeflow                   jupyter-0                                                    0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d
  kubeflow                   jupyter-web-app-78844bd57-lhsc2                              0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d
  kubeflow                   katib-ui-bf44885cd-6bk4l                                     0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d
  kubeflow                   metacontroller-0                                             500m (1%)     4 (12%)     1Gi (1%)         4Gi (6%)       2d
  kubeflow                   minio-6d879f8d6c-k9sg9                                       0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d
  kubeflow                   ml-pipeline-5dfc9cc665-4cvrz                                 0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d
  kubeflow                   ml-pipeline-persistenceagent-5c5d669f5d-xg7lt                0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d
  kubeflow                   ml-pipeline-scheduledworkflow-84ddd9886d-hg7kr               0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d
  kubeflow                   ml-pipeline-ui-58c78c9ffb-dqlhd                              0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d
  kubeflow                   ml-pipeline-viewer-controller-deployment-547bb45844-96xp9    0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d
  kubeflow                   mysql-58cfd7c97b-9vcjs                                       0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d
  kubeflow                   notebooks-controller-86c8944799-5x952                        0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d
  kubeflow                   profiles-7896d9bd97-phd44                                    0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d
  kubeflow                   pytorch-operator-54484d9b6c-jpgp5                            0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d
  kubeflow                   spartakus-volunteer-6798cc9878-kjpd8                         0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d
  kubeflow                   studyjob-controller-58bccc4747-n4vcn                         0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d
  kubeflow                   tf-job-dashboard-56564f6f99-9c6gb                            0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d
  kubeflow                   tf-job-operator-6bfd5c7db8-qlctd                             0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d
  kubeflow                   vizier-core-cfd9566b-mzsv8                                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d
  kubeflow                   vizier-core-rest-6c69cd9656-nkdwl                            0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d
  kubeflow                   vizier-db-6885dbd6cb-twm5p                                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d
  kubeflow                   vizier-suggestion-bayesianoptimization-7ddbbd49b6-4tspd      0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d
  kubeflow                   vizier-suggestion-grid-ccc744bfb-bwvhp                       0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d
  kubeflow                   vizier-suggestion-hyperband-5bfbd98c78-snc7m                 0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d
  kubeflow                   vizier-suggestion-random-f69bf84f4-4dx7p                     0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d
  kubeflow                   workflow-controller-6866879d86-jhlsl                         0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d
  kubeflow                   wzy-0                                                        4 (12%)       0 (0%)      8Gi (12%)        0 (0%)         2d
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests      Limits
  --------           --------      ------
  cpu                5550m (17%)   7 (21%)
  memory             9656Mi (15%)  5636Mi (8%)
  ephemeral-storage  0 (0%)        0 (0%)
Events:              <none>

@sashayakovtseva
Copy link
Contributor

Can you please update sycri to the latest version?
Also node information says you are running docker. I think you pasted the wrong node info here. According to your pod yaml you schedule it to 10.18.127.1, but info is for 10.18.127.3.

@malixian
Copy link
Author

malixian commented Sep 26, 2019

sorry, it's my fault。and your suggestion is update sycri to 1.0.0-beta.6?current is 1.0.0-beta.5

Name:               10.18.127.1
Roles:              node
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/hostname=10.18.127.1
                    kubernetes.io/role=node
Annotations:        node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Wed, 12 Jun 2019 16:20:03 +0800
Taints:             <none>
Unschedulable:      false
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Thu, 26 Sep 2019 18:30:27 +0800   Fri, 21 Jun 2019 14:36:04 +0800   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Thu, 26 Sep 2019 18:30:27 +0800   Fri, 21 Jun 2019 14:36:04 +0800   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Thu, 26 Sep 2019 18:30:27 +0800   Fri, 21 Jun 2019 14:36:04 +0800   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Thu, 26 Sep 2019 18:30:27 +0800   Thu, 26 Sep 2019 16:29:11 +0800   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:  10.18.127.1
  Hostname:    10.18.127.1
Capacity:
 cpu:                32
 ephemeral-storage:  511750Mi
 hugepages-1Gi:      0
 hugepages-2Mi:      0
 memory:             65393136Ki
 pods:               110
Allocatable:
 cpu:                32
 ephemeral-storage:  482947890401
 hugepages-1Gi:      0
 hugepages-2Mi:      0
 memory:             65290736Ki
 pods:               110
System Info:
 Machine ID:                 8b1ca9216f384e5c90f309b4af7066b1
 System UUID:                00000000-0000-0000-0000-AC1F6BAC1D10
 Boot ID:                    b3e7c019-5477-4a70-8e16-0b0ceac4da50
 Kernel Version:             3.10.0-957.21.2.el7.x86_64
 OS Image:                   CentOS Linux 7 (Core)
 Operating System:           linux
 Architecture:               amd64
 Container Runtime Version:  singularity://3.4.0-1
 Kubelet Version:            v1.13.5
 Kube-Proxy Version:         v1.13.5
PodCIDR:                     172.22.0.0/24
Non-terminated Pods:         (2 in total)
  Namespace                  Name                                         CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                  ----                                         ------------  ----------  ---------------  -------------  ---
  default                    image-service-deployment-655d89d94d-rfl5f    0 (0%)        0 (0%)      0 (0%)           0 (0%)         113m
  kube-system                calico-node-ps7bb                            250m (0%)     0 (0%)      0 (0%)           0 (0%)         106d
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests   Limits
  --------           --------   ------
  cpu                250m (0%)  0 (0%)
  memory             0 (0%)     0 (0%)
  ephemeral-storage  0 (0%)     0 (0%)
Events:              <none>

@sashayakovtseva
Copy link
Contributor

your suggestion is update sycri to 1.0.0-beta.6

Yes, there've been some fixes

@sashayakovtseva
Copy link
Contributor

sashayakovtseva commented Sep 26, 2019

Also enabling debug logs will help us to understand what is wrong.
In sycri service add -v 6 option to sycri command and then restart services (assuming systemd is used):

$ sudo systemctl daemon-realod
$ sudo systemctl stop kubelet sycri 
$ sudo systemctl restart sycri kubelet

@malixian
Copy link
Author

@sashayakovtseva unfortunately, it doesn't work.but i can see sycri detail execute information:

9月 27 11:16:46 comput1 sycri[189766]: DEBUG   [U=1000,P=1]       startup()                     oci runtime engine selected
9月 27 11:16:46 comput1 sycri[189766]: VERBOSE [U=1000,P=1]       startup()                     Execute stage 2
9月 27 11:16:46 comput1 sycri[189766]: DEBUG   [U=1000,P=1]       StageTwo()                    Entering stage 2
9月 27 11:16:46 comput1 sycri[189766]: I0927 11:16:46.874076  189766 sync.go:76] Received state 2 at /var/run/singularity/containers/2d3675bcb18c6c9aa0d34fe2018da97449838a8a7720a318f23f8da7de721b1a/sync.sock
9月 27 11:16:46 comput1 sycri[189766]: I0927 11:16:46.877823  189766 client_oci.go:125] Stream copying returned: context canceled
9月 27 11:16:46 comput1 sycri[189766]: I0927 11:16:46.989017  189766 container.go:288] Starting container 2d3675bcb18c6c9aa0d34fe2018da97449838a8a7720a318f23f8da7de721b1a
9月 27 11:16:46 comput1 sycri[189766]: I0927 11:16:46.989140  189766 client.go:87] Executing [singularity -d oci start 2d3675bcb18c6c9aa0d34fe2018da97449838a8a7720a318f23f8da7de721b1a]
9月 27 11:16:47 comput1 sycri[189766]: DEBUG   [U=0,P=175124]     createConfDir()               /root/.singularity already exists. Not creating.
9月 27 11:16:47 comput1 sycri[189766]: I0927 11:16:47.071929  189766 sync.go:76] Received state 4 at /var/run/singularity/containers/2d3675bcb18c6c9aa0d34fe2018da97449838a8a7720a318f23f8da7de721b1a/sync.sock
9月 27 11:16:47 comput1 sycri[189766]: E0927 11:16:47.079662  189766 main.go:276] /runtime.v1alpha2.RuntimeService/StartContainer
9月 27 11:16:47 comput1 sycri[189766]: Request: {"container_id":"2d3675bcb18c6c9aa0d34fe2018da97449838a8a7720a318f23f8da7de721b1a"}
9月 27 11:16:47 comput1 sycri[189766]: Response: null
9月 27 11:16:47 comput1 sycri[189766]: Error: rpc error: code = Internal desc = could not start container: unexpected container state: 4
9月 27 11:16:48 comput1 sycri[189766]: I0927 11:16:48.612509  189766 container_files.go:104] Removing bundle at /var/run/singularity/containers/1d632b4b2ce0d403c730b400469f5d663e8cc900992db8fd0ce0a1bf499e68a8/bundle
9月 27 11:16:48 comput1 sycri[189766]: I0927 11:16:48.657972  189766 container_files.go:118] Removing container base directory /var/run/singularity/containers/1d632b4b2ce0d403c730b400469f5d663e8cc900992db8fd0ce0a1bf499e68a8
9月 27 11:16:53 comput1 sycri[189766]: I0927 11:16:53.742925  189766 client_oci.go:166] Executing [singularity -d oci exec 620068bdf0dadcb5dc6409ab9ac27854a54745bebec5914d05f03debf2033c73 /bin/calico-node -bird-ready -felix-ready]
9月 27 11:17:03 comput1 sycri[189766]: I0927 11:17:03.742912  189766 client_oci.go:166] Executing [singularity -d oci exec 620068bdf0dadcb5dc6409ab9ac27854a54745bebec5914d05f03debf2033c73 /bin/calico-node -bird-ready -felix-ready]
9月 27 11:17:13 comput1 sycri[189766]: I0927 11:17:13.742929  189766 client_oci.go:166] Executing [singularity -d oci exec 620068bdf0dadcb5dc6409ab9ac27854a54745bebec5914d05f03debf2033c73 /bin/calico-node -bird-ready -felix-ready]
9月 27 11:17:23 comput1 sycri[189766]: I0927 11:17:23.742745  189766 client_oci.go:166] Executing [singularity -d oci exec 620068bdf0dadcb5dc6409ab9ac27854a54745bebec5914d05f03debf2033c73 /bin/calico-node -bird-ready -felix-ready]
9月 27 11:17:33 comput1 sycri[189766]: I0927 11:17:33.743061  189766 client_oci.go:166] Executing [singularity -d oci exec 620068bdf0dadcb5dc6409ab9ac27854a54745bebec5914d05f03debf2033c73 /bin/calico-node -bird-ready -felix-ready]

you can see , execution process doesn't have stage 3 and unexpected container state: 4,
can you explain what's wrong with it?

@sashayakovtseva
Copy link
Contributor

Is that for sashayakovtseva/test/image-server? That image was built for amd64, so if you are scheduling it to arm it fails to start.

@malixian
Copy link
Author

malixian commented Sep 27, 2019

Is that for sashayakovtseva/test/image-server? That image was built for amd64, so if you are scheduling it to arm it fails to start.

i change arch to amd64, and that error is not like image compatible.the core error is what i have show."StartContainer" for"image-server" with RunContainerError: "could not start container: unexpected container state: 4"

@sashayakovtseva
Copy link
Contributor

I see this error, but that only means container fails to start. The reason is somewhere in the logs hopefully. However, I think you may not see some of them because of #360.

I would appreciate if you can try to run the image with Singularity directly on that host and paste full output here:

$ singularity pull  image.sif library://sashayakovtseva/test/image-server:latest
$ sudo singularity oci mount image.sif image
$ sudo singularity -d oci run -b image server

@malixian
Copy link
Author

output seems like normal.

DEBUG   [U=0,P=243837]     createConfDir()               /root/.singularity already exists. Not creating.
VERBOSE [U=0,P=243848]     print()                       Set messagelevel to: 5
VERBOSE [U=0,P=243848]     init()                        Starter initialization
DEBUG   [U=0,P=243848]     get_pipe_exec_fd()            PIPE_EXEC_FD value: 8
VERBOSE [U=0,P=243848]     is_suid()                     Check if we are running as setuid
DEBUG   [U=0,P=243848]     init()                        Read engine configuration
DEBUG   [U=0,P=243848]     init()                        Wait completion of stage1
DEBUG   [U=0,P=243849]     set_parent_death_signal()     Set parent death signal to 9
VERBOSE [U=0,P=243849]     init()                        Spawn stage 1
DEBUG   [U=0,P=243849]     startup()                     oci runtime engine selected
VERBOSE [U=0,P=243849]     startup()                     Execute stage 1
DEBUG   [U=0,P=243849]     StageOne()                    Entering stage 1
VERBOSE [U=0,P=243848]     wait_child()                  stage 1 exited with status 0
DEBUG   [U=0,P=243848]     cleanup_fd()                  Close file descriptor 4
DEBUG   [U=0,P=243848]     init()                        Set child signal mask
VERBOSE [U=0,P=243848]     init()                        Run as instance
DEBUG   [U=0,P=243856]     init()                        Create socketpair for master communication channel
DEBUG   [U=0,P=243856]     init()                        Create RPC socketpair for communication between stage 2 and RPC server
VERBOSE [U=0,P=243856]     priv_escalate()               Get root privileges
VERBOSE [U=0,P=243856]     priv_escalate()               Change filesystem uid to 0
VERBOSE [U=0,P=243856]     pid_namespace_init()          Create pid namespace
VERBOSE [U=0,P=243856]     init()                        Spawn master process
DEBUG   [U=0,P=1]          set_parent_death_signal()     Set parent death signal to 9
VERBOSE [U=0,P=1]          create_namespace()            Create network namespace
VERBOSE [U=0,P=1]          create_namespace()            Create uts namespace
VERBOSE [U=0,P=1]          create_namespace()            Create ipc namespace
VERBOSE [U=0,P=1]          create_namespace()            Create mount namespace
DEBUG   [U=0,P=2]          set_parent_death_signal()     Set parent death signal to 9
VERBOSE [U=0,P=2]          init()                        Spawn RPC server
DEBUG   [U=0,P=243856]     startup()                     oci runtime engine selected
VERBOSE [U=0,P=243856]     startup()                     Execute master process
DEBUG   [U=0,P=243856]     func1()                       Using singularity directory "/root/.singularity"
DEBUG   [U=0,P=2]          startup()                     oci runtime engine selected
VERBOSE [U=0,P=2]          startup()                     Serve RPC requests
DEBUG   [U=0,P=243856]     addRootfsMount()              Parent rootfs: /run/singularity/containers/image/rootfs
DEBUG   [U=0,P=243856]     CreateContainer()             Mount all
DEBUG   [U=0,P=243856]     mount()                       Checking if /proc/243857/root/run/singularity/containers/image/rootfs exists
DEBUG   [U=0,P=243856]     mount()                       Mount /run/singularity/containers/image/rootfs to /run/singularity/containers/image/rootfs :  []
DEBUG   [U=0,P=243856]     mount()                       Checking if /proc/243857/root/run/singularity/containers/image/rootfs/proc exists
DEBUG   [U=0,P=243856]     mount()                       Mount proc to /run/singularity/containers/image/rootfs/proc : proc []
DEBUG   [U=0,P=243856]     mount()                       Checking if /proc/243857/root/run/singularity/containers/image/rootfs/dev exists
DEBUG   [U=0,P=243856]     mount()                       Mount tmpfs to /run/singularity/containers/image/rootfs/dev : tmpfs [mode=755,size=65536k]
DEBUG   [U=0,P=243856]     mount()                       Checking if /proc/243857/root/run/singularity/containers/image/rootfs/dev/pts exists
DEBUG   [U=0,P=243856]     mount()                       Creating /proc/243857/root/run/singularity/containers/image/rootfs/dev/pts
DEBUG   [U=0,P=243856]     mount()                       Mount devpts to /run/singularity/containers/image/rootfs/dev/pts : devpts [newinstance,ptmxmode=0666,mode=0620,gid=5]
DEBUG   [U=0,P=243856]     mount()                       Checking if /proc/243857/root/run/singularity/containers/image/rootfs/dev/shm exists
DEBUG   [U=0,P=243856]     mount()                       Creating /proc/243857/root/run/singularity/containers/image/rootfs/dev/shm
DEBUG   [U=0,P=243856]     mount()                       Mount shm to /run/singularity/containers/image/rootfs/dev/shm : tmpfs [mode=1777,size=65536k]
DEBUG   [U=0,P=243856]     mount()                       Checking if /proc/243857/root/run/singularity/containers/image/rootfs/dev/mqueue exists
DEBUG   [U=0,P=243856]     mount()                       Creating /proc/243857/root/run/singularity/containers/image/rootfs/dev/mqueue
DEBUG   [U=0,P=243856]     mount()                       Mount mqueue to /run/singularity/containers/image/rootfs/dev/mqueue : mqueue []
DEBUG   [U=0,P=243856]     mount()                       Checking if /proc/243857/root/run/singularity/containers/image/rootfs/sys exists
DEBUG   [U=0,P=243856]     mount()                       Mount sysfs to /run/singularity/containers/image/rootfs/sys : sysfs []
DEBUG   [U=0,P=2]          Chroot()                      Change current directory to /run/singularity/containers/image/rootfs
DEBUG   [U=0,P=2]          Chroot()                      Hold reference to host / directory
DEBUG   [U=0,P=2]          Chroot()                      Called pivot_root on /run/singularity/containers/image/rootfs
DEBUG   [U=0,P=2]          Chroot()                      Change current directory to host / directory
DEBUG   [U=0,P=2]          Chroot()                      Apply slave mount propagation for host / directory
DEBUG   [U=0,P=2]          Chroot()                      Called unmount(/, syscall.MNT_DETACH)
DEBUG   [U=0,P=2]          Chroot()                      Changing directory to / to avoid getpwd issues
VERBOSE [U=0,P=1]          wait_child()                  rpc server exited with status 0
DEBUG   [U=0,P=1]          apply_container_privileges()  Set main group ID to 0
DEBUG   [U=0,P=1]          apply_container_privileges()  Set 1 additional group IDs
DEBUG   [U=0,P=1]          apply_container_privileges()  Set user ID to 0
DEBUG   [U=0,P=1]          set_parent_death_signal()     Set parent death signal to 9
DEBUG   [U=0,P=1]          startup()                     oci runtime engine selected
VERBOSE [U=0,P=1]          startup()                     Execute stage 2
DEBUG   [U=0,P=1]          StageTwo()                    Entering stage 2
2019/09/27 10:05:29 Listening on 8080

my kubelet.service config is

[Unit]
Description=Kubernetes Kubelet
Documentation=https://github.com/GoogleCloudPlatform/kubernetes

[Service]
WorkingDirectory=/var/lib/kubelet
ExecStart=/opt/kube/bin/kubelet \
  --address=10.18.127.1 \
  --allow-privileged=true \
  --anonymous-auth=false \
  --authentication-token-webhook \
  --authorization-mode=Webhook \
  --client-ca-file=/etc/kubernetes/ssl/ca.pem \
  --cluster-dns=10.70.0.2 \
  --cluster-domain=cluster.local. \
  --cni-bin-dir=/opt/kube/bin \
  --cni-conf-dir=/etc/cni/net.d \
  --fail-swap-on=false \
  --hairpin-mode hairpin-veth \
  --hostname-override=10.18.127.1 \
  --kubeconfig=/etc/kubernetes/kubelet.kubeconfig \
  --max-pods=110 \
  --network-plugin=cni \
  --pod-infra-container-image=mirrorgooglecontainers/pause-amd64:3.1 \
  --register-node=true \
  --root-dir=/var/lib/kubelet \
  --tls-cert-file=/etc/kubernetes/ssl/kubelet.pem \
  --tls-private-key-file=/etc/kubernetes/ssl/kubelet-key.pem \
  --v=2 \
  --container-runtime=remote \
  --container-runtime-endpoint=unix:///var/run/singularity.sock \
  --image-service-endpoint=unix:///var/run/singularity.sock
ExecStartPost=/sbin/iptables -A INPUT -s 10.0.0.0/8 -p tcp --dport 4194 -j ACCEPT
ExecStartPost=/sbin/iptables -A INPUT -s 172.16.0.0/12 -p tcp --dport 4194 -j ACCEPT
ExecStartPost=/sbin/iptables -A INPUT -s 192.168.0.0/16 -p tcp --dport 4194 -j ACCEPT
ExecStartPost=/sbin/iptables -A INPUT -p tcp --dport 4194 -j DROP
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

@sashayakovtseva
Copy link
Contributor

sashayakovtseva commented Sep 27, 2019

Weird..
While I am working on fix for #360, could you try to launch image-service with an allocated tty, i.e.:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: image-service-deployment
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: image-service
  template:
    metadata:
      labels:
        app: image-service
      name: image-service
      namespace: default
    spec:
      containers:
        - name: image-server
          image: cloud.sylabs.io/sashayakovtseva/test/image-server
          ports:
            - containerPort: 8080
         tty: true
      securityContext:
        runAsUser: 1000
      nodeSelector:
        kubernetes.io/hostname: 10.18.127.1

This should prevent from logs being truncated. Then post the output here please.

@malixian
Copy link
Author

sycri only has the same output

9月 29 09:05:29 comput1 sycri[189766]: I0929 09:05:29.825818  189766 sync.go:76] Received state 2 at /var/run/singularity/containers/c62c56aac26ffd846a1ce884dce7a320b0aafa057f062637ebb69a357ba76669/sync.sock
9月 29 09:05:29 comput1 sycri[189766]: I0929 09:05:29.829802  189766 client_oci.go:125] Stream copying returned: context canceled
9月 29 09:05:29 comput1 sycri[189766]: I0929 09:05:29.928710  189766 container.go:288] Starting container c62c56aac26ffd846a1ce884dce7a320b0aafa057f062637ebb69a357ba76669
9月 29 09:05:29 comput1 sycri[189766]: I0929 09:05:29.928841  189766 client.go:87] Executing [singularity -d oci start c62c56aac26ffd846a1ce884dce7a320b0aafa057f062637ebb69a357ba76669]
9月 29 09:05:29 comput1 sycri[189766]: DEBUG   [U=0,P=332186]     createConfDir()               /root/.singularity already exists. Not creating.
9月 29 09:05:30 comput1 sycri[189766]: I0929 09:05:30.017106  189766 sync.go:76] Received state 4 at /var/run/singularity/containers/c62c56aac26ffd846a1ce884dce7a320b0aafa057f062637ebb69a357ba76669/sync.sock
9月 29 09:05:30 comput1 sycri[189766]: E0929 09:05:30.025262  189766 main.go:276] /runtime.v1alpha2.RuntimeService/StartContainer
9月 29 09:05:30 comput1 sycri[189766]: Request: {"container_id":"c62c56aac26ffd846a1ce884dce7a320b0aafa057f062637ebb69a357ba76669"}
9月 29 09:05:30 comput1 sycri[189766]: Response: null
9月 29 09:05:30 comput1 sycri[189766]: Error: rpc error: code = Internal desc = could not start container: unexpected container state: 4
9月 29 09:05:31 comput1 sycri[189766]: I0929 09:05:31.394787  189766 container_files.go:104] Removing bundle at /var/run/singularity/containers/645be6b9aadf616290a5c038d3f435c4d7b2fe078ac1a47941ce1158a06b3370/bundle
9月 29 09:05:31 comput1 sycri[189766]: I0929 09:05:31.442043  189766 container_files.go:118] Removing container base directory /var/run/singularity/containers/645be6b9aadf616290a5c038d3f435c4d7b2fe078ac1a47941ce1158a06b3370
9月 29 09:05:33 comput1 sycri[189766]: I0929 09:05:33.743126  189766 client_oci.go:166] Executing [singularity -d oci exec 620068bdf0dadcb5dc6409ab9ac27854a54745bebec5914d05f03debf2033c73 /bin/calico-node -bird-ready -felix-ready]
9月 29 09:05:43 comput1 sycri[189766]: I0929 09:05:43.742937  189766 client_oci.go:166] Executing [singularity -d oci exec 620068bdf0dadcb5dc6409ab9ac27854a54745bebec5914d05f03debf2033c73 /bin/calico-node -bird-ready -felix-ready]

@sashayakovtseva
Copy link
Contributor

What do container logs show?

@sashayakovtseva sashayakovtseva self-assigned this Oct 1, 2019
@malixian
Copy link
Author

malixian commented Oct 21, 2019

the problem has been found, may be shell in container execution time is too short to logs shows unexpected container state: 4. If i add sleep 30 for example, the pod status is expected running
Is that a bug? because container has exposed port, pod status should be running.

@sashayakovtseva
Copy link
Contributor

Execution time will not result in an unexpected container state.
Try to fetch pod logs (they remain even if container is recreated) and also please provide the shell script you are trying to run.

Btw there is a beta7 version, so feel free to update singularity-cri :)

@malixian
Copy link
Author

malixian commented Oct 22, 2019

hi @sashayakovtseva , i try to execute the yaml file like this:

apiVersion: v1
kind: Pod
metadata:
  name: mpi-worker-03
spec:
  containers:
    - command:
        - /bin/sh
        - -c
        - |
          mkdir -p /var/run/sshd; /usr/sbin/sshd -D;
      image: volcanosh/example-mpi:0.0.1
      name: mpi-worker-03
      ports:
        - containerPort: 22
          name: mpijob-port
      workingDir: /home
  nodeSelector:
     kubernetes.io/hostname: k8s03
  restartPolicy: OnFailure

And i find pod status is CrashLoopBackOff, but when append shell sleep 600 in container like mkdir -p /var/run/sshd; /usr/sbin/sshd -D; sleep 600, that pod status is expected running,after 600s the pod status is Completed. Whatever appending sleep 600 or not, i run the same yaml file with docker runtime , the pod status always is Running because we set containerPort in yaml. If you are free you can try it. And I will be very grateful to you for answering my doubts.

@sashayakovtseva sashayakovtseva removed their assignment Dec 10, 2019
@ruijf
Copy link

ruijf commented Mar 27, 2020

@malixian @sashayakovtseva
I setup one node for testing, also have the same issue. Would you share some update ?

This is my environment:

# kubectl describe no amax
Name:               amax
Roles:              master
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=amax
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/master=
Annotations:        kubeadm.alpha.kubernetes.io/cri-socket: unix:///var/run/singularity.sock
                    node.alpha.kubernetes.io/ttl: 0
                    projectcalico.org/IPv4Address: 192.168.118.45/24
                    projectcalico.org/IPv4IPIPTunnelAddr: 192.168.195.192
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Thu, 26 Mar 2020 18:26:52 +0800
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  amax
  AcquireTime:     <unset>
  RenewTime:       Fri, 27 Mar 2020 11:20:07 +0800
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Fri, 27 Mar 2020 09:41:40 +0800   Fri, 27 Mar 2020 09:41:40 +0800   CalicoIsUp                   Calico is running on this node
  MemoryPressure       False   Fri, 27 Mar 2020 11:15:31 +0800   Thu, 26 Mar 2020 18:26:49 +0800   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Fri, 27 Mar 2020 11:15:31 +0800   Thu, 26 Mar 2020 18:26:49 +0800   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Fri, 27 Mar 2020 11:15:31 +0800   Thu, 26 Mar 2020 18:26:49 +0800   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Fri, 27 Mar 2020 11:15:31 +0800   Thu, 26 Mar 2020 18:31:16 +0800   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:  192.168.118.45
  Hostname:    amax
Capacity:
  cpu:                8
  ephemeral-storage:  95800732Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             7137484Ki
  pods:               110
Allocatable:
  cpu:                8
  ephemeral-storage:  88289954466
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             7035084Ki
  pods:               110
System Info:
  Machine ID:                 0d2830c8aec14057baf9ef5796780648
  System UUID:                7F437ED2-FA52-4367-AD96-06C00FF55E38
  Boot ID:                    9300e908-a60c-4943-9fdf-d4d84c4f4506
  Kernel Version:             4.15.0-30deepin-generic
  OS Image:                   Deepin 15
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  singularity://3.5.3
  Kubelet Version:            v1.17.4
  Kube-Proxy Version:         v1.17.4
PodCIDR:                      192.168.0.0/24
PodCIDRs:                     192.168.0.0/24
Non-terminated Pods:          (18 in total)
  Namespace                   Name                                          CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                   ----                                          ------------  ----------  ---------------  -------------  ---
  default                     centos7-758596459c-ncmjh                      0 (0%)        0 (0%)      0 (0%)           0 (0%)         8m54s
  default                     hello                                         0 (0%)        0 (0%)      0 (0%)           0 (0%)         16h
  default                     hello-kubernetes-8764bc78f-ff84z              0 (0%)        0 (0%)      0 (0%)           0 (0%)         16h
  default                     hello-kubernetes-8764bc78f-ql4d7              0 (0%)        0 (0%)      0 (0%)           0 (0%)         16h
  default                     hello-kubernetes-8764bc78f-z8cpc              0 (0%)        0 (0%)      0 (0%)           0 (0%)         16h
  default                     image-service-deployment-766979ff9d-2rwbw     0 (0%)        0 (0%)      0 (0%)           0 (0%)         19m
  default                     sif-scheduler-extender                        0 (0%)        0 (0%)      0 (0%)           0 (0%)         26m
  kube-system                 calico-kube-controllers-bc44d789c-hszpp       0 (0%)        0 (0%)      0 (0%)           0 (0%)         16h
  kube-system                 calico-node-ckhmq                             250m (3%)     0 (0%)      0 (0%)           0 (0%)         16h
  kube-system                 coredns-9d85f5447-g689n                       100m (1%)     0 (0%)      70Mi (1%)        170Mi (2%)     16h
  kube-system                 coredns-9d85f5447-lj9td                       100m (1%)     0 (0%)      70Mi (1%)        170Mi (2%)     16h
  kube-system                 etcd-amax                                     0 (0%)        0 (0%)      0 (0%)           0 (0%)         16h
  kube-system                 kube-apiserver-amax                           250m (3%)     0 (0%)      0 (0%)           0 (0%)         16h
  kube-system                 kube-controller-manager-amax                  200m (2%)     0 (0%)      0 (0%)           0 (0%)         16h
  kube-system                 kube-proxy-jjdvj                              0 (0%)        0 (0%)      0 (0%)           0 (0%)         16h
  kube-system                 kube-scheduler-amax                           100m (1%)     0 (0%)      0 (0%)           0 (0%)         20m
  kubernetes-dashboard        dashboard-metrics-scraper-7b8b58dc8b-6ddnb    0 (0%)        0 (0%)      0 (0%)           0 (0%)         16h
  kubernetes-dashboard        kubernetes-dashboard-755dcb9575-99ktd         0 (0%)        0 (0%)      0 (0%)           0 (0%)         16h
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests    Limits
  --------           --------    ------
  cpu                1 (12%)     0 (0%)
  memory             140Mi (2%)  340Mi (4%)
  ephemeral-storage  0 (0%)      0 (0%)
Events:              <none>

I test the example/k8s/image-service.yaml, get this error log:

# kubectl describe pod image-service-deployment-766979ff9d-2rwbw 
Name:         image-service-deployment-766979ff9d-2rwbw
Namespace:    default
Priority:     0
Node:         amax/192.168.118.45
Start Time:   Fri, 27 Mar 2020 11:01:04 +0800
Labels:       app=image-service
              pod-template-hash=766979ff9d
Annotations:  cni.projectcalico.org/podIP: 192.168.195.217/32
Status:       Running
IP:           192.168.195.217
IPs:
  IP:           192.168.195.217
Controlled By:  ReplicaSet/image-service-deployment-766979ff9d
Containers:
  image-server:
    Container ID:   singularity://d5216a3a308a4ae5a30289ff01b9c60c9fc0603ffc0a99072575d63918edac6e
    Image:          cloud.sylabs.io/sashayakovtseva/test/image-server
    Image ID:       cf5d9eea227371037e614fc7dec7c1f437a6398f9b08250b89ef5c92aab7e737
    Port:           8080/TCP
    Host Port:      0/TCP
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Message:      exited with code 255
      Exit Code:    255
      Started:      Thu, 01 Jan 1970 08:00:00 +0800
      Finished:     Fri, 27 Mar 2020 11:22:39 +0800
    Ready:          False
    Restart Count:  9
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-bkrkc (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  default-token-bkrkc:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-bkrkc
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason     Age                  From               Message
  ----     ------     ----                 ----               -------
  Normal   Scheduled  <unknown>            default-scheduler  Successfully assigned default/image-service-deployment-766979ff9d-2rwbw to amax
  Normal   Created    24m (x4 over 25m)    kubelet, amax      Created container image-server
  Warning  Failed     24m (x4 over 25m)    kubelet, amax      Error: could not start container: unexpected container state: exited
  Normal   Pulling    23m (x5 over 25m)    kubelet, amax      Pulling image "cloud.sylabs.io/sashayakovtseva/test/image-server"
  Normal   Pulled     23m (x5 over 25m)    kubelet, amax      Successfully pulled image "cloud.sylabs.io/sashayakovtseva/test/image-server"
  Warning  BackOff    12s (x112 over 25m)  kubelet, amax      Back-off restarting failed container

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants