Continously getting "no space left on Device" even though node has enough space. #2042

kavita1205 · 2024-05-09T15:29:07Z

What happened:
We have installed koordinator using helm chart mentioned in the documentation https://koordinator.sh/docs/installation/ and we are getting continously crashloopbackoff for koordlet. Although I have already raised this issue in issue #2028 and here provided solution was to check node space , so my issue got resolved for most of the pods but we are still getting this error even after node has enough space.

kubectl get po -n koordinator-system |grep -v Running
NAME                                 READY   STATUS             RESTARTS          AGE
koordlet-hx5gh                       0/1     CrashLoopBackOff   309 (2m54s ago)   2d23h

Error message:

kubectl logs -n koordinator-system koordlet-hx5gh
I0509 08:20:27.399574  638200 cgroup_driver.go:212] Node lv01-****-l03 use 'systemd' as cgroup driver guessed with the cgroup name
I0509 08:20:27.425079  638200 feature_gate.go:245] feature gates: &{map[Accelerators:true BECPUEvict:true BEMemoryEvict:true CgroupReconcile:true]}
I0509 08:20:27.425329  638200 main.go:70] Setting up kubeconfig for koordlet
I0509 08:20:27.425565  638200 koordlet.go:76] NODE_NAME is lv**-m****-l03, start time 1.715268027e+09
I0509 08:20:27.427391  638200 version.go:45] [/host-cgroup/cpu/cpu.bvt_warp_ns] PathExists exists false, err: <nil>
I0509 08:20:27.428156  638200 version.go:52] [/host-cgroup/memory/*/memory.wmark_ratio] PathExists wmark_ratio exists [], err: <nil>
I0509 08:20:27.431395  638200 resctrl.go:74] isResctrlAvailableByCpuInfo result, isCatFlagSet: true, isMbaFlagSet: true
I0509 08:20:27.431747  638200 resctrl.go:89] isResctrlAvailableByKernelCmd result, isCatFlagSet: false, isMbaFlagSet: false
I0509 08:20:27.431770  638200 resctrl.go:106] IsSupportResctrl result, cpuSupport: true, kernelSupport: false
I0509 08:20:27.431788  638200 config.go:73] resctrl supported: true
I0509 08:20:27.431804  638200 koordlet.go:80] sysconf: &{CgroupRootDir:/host-cgroup/ CgroupKubePath:kubepods/ SysRootDir:/host-sys/ SysFSRootDir:/host-sys-fs/ ProcRootDir:/proc/ VarRunRootDir:/host-var-run/ RunRootDir:/host-run/ RuntimeHooksConfigDir:/host-etc-hookserver/ ContainerdEndPoint: PouchEndpoint: DockerEndPoint: DefaultRuntimeType:containerd}, agentMode: dsMode
I0509 08:20:27.431855  638200 koordlet.go:81] kernel version INFO: {IsAnolisOS:false}
panic: preallocate: no space left on device

goroutine 8538 [running]:
github.com/prometheus/prometheus/tsdb.handleChunkWriteError({0x2943760?, 0xc0026bfdd0?})
        /go/pkg/mod/github.com/prometheus/prometheus@v0.39.2/tsdb/head_append.go:893 +0x76
github.com/prometheus/prometheus/tsdb/chunks.(*ChunkDiskMapper).WriteChunk(0xc00106e1e0, 0x41afe7?, 0x28?, 0x421545?, {0x2965af8, 0xc00039a040}, 0x2598c90)
        /go/pkg/mod/github.com/prometheus/prometheus@v0.39.2/tsdb/chunks/head_chunks.go:418 +0x151
github.com/prometheus/prometheus/tsdb.(*memSeries).mmapCurrentHeadChunk(0xc000b561a0, 0x63eaf7f9?)
        /go/pkg/mod/github.com/prometheus/prometheus@v0.39.2/tsdb/head_append.go:882 +0x53
github.com/prometheus/prometheus/tsdb.(*memSeries).cutNewHeadChunk(0xc000b561a0, 0x18f57814ede, 0x4164a00000000000?, 0x1b7740)
        /go/pkg/mod/github.com/prometheus/prometheus@v0.39.2/tsdb/head_append.go:826 +0x2f
github.com/prometheus/prometheus/tsdb.(*memSeries).append(0xc000b561a0, 0x18f57814ede, 0x4164a00000000000, 0x0, 0x0?, 0x1b7740)
        /go/pkg/mod/github.com/prometheus/prometheus@v0.39.2/tsdb/head_append.go:797 +0x1ea
github.com/prometheus/prometheus/tsdb.(*walSubsetProcessor).processWALSamples(0xc0021c81b0, 0xc00061a900, 0x0?, 0x0?)
        /go/pkg/mod/github.com/prometheus/prometheus@v0.39.2/tsdb/head_wal.go:463 +0x3f0
github.com/prometheus/prometheus/tsdb.(*Head).loadWAL.func7(0x0?)
        /go/pkg/mod/github.com/prometheus/prometheus@v0.39.2/tsdb/head_wal.go:110 +0x45
created by github.com/prometheus/prometheus/tsdb.(*Head).loadWAL
        /go/pkg/mod/github.com/prometheus/prometheus@v0.39.2/tsdb/head_wal.go:109 +0x414

Node space:

kavsingh@lv**-m****-l03:~$ df -h
Filesystem                         Size  Used Avail Use% Mounted on
/dev/mapper/ubuntu--vg-ubuntu--lv  277G   88G  190G  32% /

below is the values.yaml

# Default values for koordinator.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.

crds:
  managed: true

# values for koordinator installation
installation:
  namespace: koordinator-system
  roleListGroups:
    - '*'

featureGates: ""

imageRepositoryHost: ghcr.io

koordlet:
  image:
    repository: koordinator-sh/koordlet
    tag: "v1.4.1"
  resources:
    limits:
      cpu: 1000m
      memory: 1Gi
    requests:
      cpu: 500m
      memory: 256Mi
  features: ""
  log:
    # log level for koordlet
    level: "4"
  hostDirs:
    kubeletConfigDir: /etc/kubernetes/
    kubeletLibDir: /var/lib/kubelet/
    koordProxyRegisterDir: /etc/runtime/hookserver.d/
    koordletSockDir: /var/run/koordlet
    predictionCheckpointDir: /var/run/koordlet/prediction-checkpoints
    # if not specified, use tmpfs by default
    koordletTSDBDir: ""
  enableServiceMonitor: false


manager:
  # settings for log print
  log:
    # log level for koord-manager
    level: "4"

  replicas: 5
  image:
    repository: koordinator-sh/koord-manager
    tag: "v1.4.1"
  webhook:
    port: 9876
  metrics:
    port: 8080
  healthProbe:
    port: 8000

  resyncPeriod: "0"

  # resources of koord-manager container
  resources:
    limits:
      cpu: 1000m
      memory: 1Gi
    requests:
      cpu: 500m
      memory: 256Mi

  hostNetwork: false

  nodeAffinity: {}
  nodeSelector: 
    node-role.kubernetes.io/control-plane: ""
  tolerations: 
    - key: "node-role.kubernetes.io/master"
      operator: "Equal"
      effect: "NoSchedule"

webhookConfiguration:
  failurePolicy:
    pods: Ignore
    elasticquotas: Ignore
    nodeStatus: Ignore
    nodes: Ignore
  timeoutSeconds: 30

serviceAccount:
  annotations: {}


scheduler:
  # settings for log print
  log:
    # log level for koord-scheduler
    level: "4"

  replicas: 5
  image:
    repository: koordinator-sh/koord-scheduler
    tag: "v1.4.1"
  port: 10251

  # feature-gates for k8s > 1.22
  featureGates: ""
  # feature-gates for k8s 1.22
  compatible122FeatureGates: "CompatibleCSIStorageCapacity=true"
  # feature-gates for k8s < 1.22
  compatibleBelow122FeatureGates: "DisableCSIStorageCapacityInformer=true,CompatiblePodDisruptionBudget=true"

  # resources of koord-scheduler container
  resources:
    limits:
      cpu: 1000m
      memory: 1Gi
    requests:
      cpu: 500m
      memory: 256Mi

  hostNetwork: false

  nodeAffinity: {}
  nodeSelector: 
    node-role.kubernetes.io/control-plane: ""
  tolerations: 
    - key: "node-role.kubernetes.io/master"
      operator: "Equal"
      effect: "NoSchedule"

descheduler:
  # settings for log print
  log:
    # log level for koord-descheduler
    level: "4"

  replicas: 2
  image:
    repository: koordinator-sh/koord-descheduler
    tag: "v1.4.1"
  port: 10251

  featureGates: ""

  # resources of koord-descheduler container
  resources:
    limits:
      cpu: 1000m
      memory: 1Gi
    requests:
      cpu: 500m
      memory: 256Mi

  hostNetwork: false

  nodeAffinity: {}
  nodeSelector: 
    node-role.kubernetes.io/control-plane: ""
  tolerations: 
    - key: "node-role.kubernetes.io/master"
      operator: "Equal"
      effect: "NoSchedule"

What you expected to happen:
All pods should run properly without error.
How to reproduce it (as minimally and precisely as possible):
You can use the values.yaml mentioned above and reproduce this issue.
Anything else we need to know?:

Environment:

App version: 1.4.1
Kubernetes version (use kubectl version): 1.24
Install details (e.g. helm install args): helm install koordinator koordinator-sh/koordinator --version 1.4.1 -f values.yaml
Node environment (for koordlet/runtime-proxy issue):
- Containerd/Docker version:
- OS version: Ubuntu 22.04
- Kernal version:
- Cgroup driver: cgroupfs/systemd
Others:

The text was updated successfully, but these errors were encountered:

saintube · 2024-05-10T01:52:27Z

hi @kavita1205, as mentioned in #2028, please check if the memory/RAM space is enough since instead of the disk, the koordlet mounted the metrics store via the tmpfs by default. For more info, please check the koordlet Daemonset template. For a workaround, you can change the chart variable koordlet.hostDirs.koordletTSDBDir to a hostPath to mount disk space if needed.

kavita1205 added the kind/bug Create a report to help us improve label May 9, 2024

saintube added the kind/question Support request or question relating to Koordinator label May 10, 2024

saintube added the area/koordlet label May 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Continously getting "no space left on Device" even though node has enough space. #2042

Continously getting "no space left on Device" even though node has enough space. #2042

kavita1205 commented May 9, 2024 •

edited

saintube commented May 10, 2024 •

edited

Continously getting "no space left on Device" even though node has enough space. #2042

Continously getting "no space left on Device" even though node has enough space. #2042

Comments

kavita1205 commented May 9, 2024 • edited

saintube commented May 10, 2024 • edited

kavita1205 commented May 9, 2024 •

edited

saintube commented May 10, 2024 •

edited