save-demo-agent pod failure #1986

sanyavertolet · 2023-03-15T10:30:48Z

In save-demo we create a job using fabric8io/kubernetes-client.
This job should run one pod with one of the images from here: saveourtool packages. The problem that it seem to work fine (right now there are 3 jobs running) until something - a new pod failes:

Events:
  Type     Reason                 Age                    From                Message
  ----     ------                 ----                   ----                -------
  Normal   NotTriggerScaleUp      9m43s (x42 over 16m)   cluster-autoscaler  pod didn't trigger scale-up:
  Warning  FailedScheduling       9m38s (x10 over 16m)   default-scheduler   0/3 nodes are available: 1 node(s) had taint {node.kubernetes.io/disk-pressure: }, that the pod didn't tolerate, 2 node(s) didn't have free ports for the requested pod ports.
  Normal   Scheduled              9m27s                  default-scheduler   Successfully assigned save-cloud/demo-cqfn.org-diktat-warn-1-4pcrw to 172.16.0.55
  Normal   Pulling                9m25s                  kubelet             Pulling image "ghcr.io/saveourtool/save-base:eclipse-temurin-17"
  Normal   Pulled                 5m23s                  kubelet             Successfully pulled image "ghcr.io/saveourtool/save-base:eclipse-temurin-17" in 4m2.296098444s
  Normal   SuccessfulCreate       5m21s                  kubelet             Created container save-demo-agent-pod
  Normal   Started                5m21s                  kubelet             Started container save-demo-agent-pod
  Normal   SuccessfulMountVolume  5m20s (x2 over 9m27s)  kubelet             Successfully mounted volumes for pod "demo-cqfn.org-diktat-warn-1-4pcrw_save-cloud(791fc987-054c-4723-8e46-e80f67e27fa9)"
  Warning  Evicted                106s                   kubelet             The node was low on resource: ephemeral-storage. Container save-demo-agent-pod was using 78844Ki, which exceeds its request of 0.
  Normal   Killing                106s                   kubelet             Stopping container save-demo-agent-pod

Seems that the same problem is discussed here

The text was updated successfully, but these errors were encountered:

sanyavertolet · 2023-03-16T11:49:25Z

Got same old stuff again:

Events:
  Type     Reason                 Age    From               Message
  ----     ------                 ----   ----               -------
  Normal   Scheduled              3m45s  default-scheduler  Successfully assigned save-cloud/demo-cqfn.org-diktat-warn-1-rrgcj to 172.16.0.55
  Normal   SuccessfulMountVolume  3m45s  kubelet            Successfully mounted volumes for pod "demo-cqfn.org-diktat-warn-1-rrgcj_save-cloud(bdb2abe5-ff26-49d4-882b-c80e1546aae6)"
  Normal   Pulling                3m43s  kubelet            Pulling image "ghcr.io/saveourtool/save-base:eclipse-temurin-17"
  Normal   Pulled                 15s    kubelet            Successfully pulled image "ghcr.io/saveourtool/save-base:eclipse-temurin-17" in 3m27.624967525s
  Warning  Evicted                13s    kubelet            The node was low on resource: ephemeral-storage.
  Normal   SuccessfulCreate       12s    kubelet            Created container save-demo-agent-pod
  Normal   Started                12s    kubelet            Started container save-demo-agent-pod
  Normal   Killing                11s    kubelet            Stopping container save-demo-agent-pod

sanyavertolet · 2023-03-21T12:44:41Z

Now pod is in pending state (after pod spec update):

Events:
  Type     Reason             Age                     From                Message
  ----     ------             ----                    ----                -------
  Warning  FailedScheduling   7m32s (x1091 over 23h)  default-scheduler   0/3 nodes are available: 1 node(s) were unschedulable, 2 node(s) didn't have free ports for the requested pod ports.
  Warning  FailedScheduling   3m12s (x518 over 23h)   default-scheduler   0/3 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 1 node(s) had taint {node.kubernetes.io/disk-pressure: }, that the pod didn't tolerate, 1 node(s) were unschedulable.
  Normal   NotTriggerScaleUp  6s (x959 over 160m)     cluster-autoscaler  pod didn't trigger scale-up:

sanyavertolet · 2023-03-29T09:31:33Z

upd

It was found out that assigning hostPort to containerPort is a bad idea as it seems to occupy node's port (in this way, there could be only two Jobs at a time);
After the disk was extended on all three nodes, the problem with ephemeral-storage didn't go away.

sanyavertolet · 2023-03-29T09:32:05Z

Here is a new error event:

apiVersion: v1
count: 1
eventTime: null
firstTimestamp: "2023-03-29T09:25:23Z"
involvedObject:
  apiVersion: v1
  fieldPath: spec.containers{save-demo-agent-pod}
  kind: Pod
  name: demo-testnariman-pylink-1-7jl49
  namespace: save-cloud
  resourceVersion: "150879703"
  uid: b7051a56-fdd6-441f-9283-fe93686be5fe
kind: Event
lastTimestamp: "2023-03-29T09:25:23Z"
message: 'Failed to pull image "ghcr.io/saveourtool/save-base:python-3.10": rpc error:
  code = FailedPrecondition desc = failed to pull and unpack image "ghcr.io/saveourtool/save-base:python-3.10":
  failed commit on ref "layer-sha256:a6daac42f0daee2fe11dd237f570a94067fd69ec126838af3bc82a4d57f11f8c":
  "layer-sha256:a6daac42f0daee2fe11dd237f570a94067fd69ec126838af3bc82a4d57f11f8c"
  failed size validation: 6239744 != 17386982: failed precondition'
metadata:
  creationTimestamp: "2023-03-29T09:25:23Z"
  name: demo-testnariman-pylink-1-7jl49.1750d99bb05ded5b
  namespace: save-cloud
  resourceVersion: "1421615"
  uid: 677b50a7-8393-4dfc-b521-bc2d1c957834
reason: FailedPullImage
reportingComponent: ""
reportingInstance: ""
source:
  component: kubelet
  host: 172.16.0.55
type: Warning

which leads us to these lines

{
      "digest": "sha256:a6daac42f0daee2fe11dd237f570a94067fd69ec126838af3bc82a4d57f11f8c",
      "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "size": 17386982
    },

of saveourtool/save-base:python-3.10.

sanyavertolet · 2023-03-29T09:41:30Z

Have absolutely no idea how could there events happen in this order and what is the connection between ephemeral-storage issue and FailedPullImage

sanyavertolet · 2023-03-29T09:46:56Z

@petertrr maybe you could give a helping hand?

petertrr · 2023-03-29T10:26:22Z

I think the sequence here is like this:

There was an error pulling image (there is a potential fix, but apparently we are using not a very recent version of containerd)
A new attempt has been started by back-off strategy
Pod has been created and started successfully
Pod has been killed, because node is low on ephemeral storage
Job didn't attempt to schedule a new one, because it ran out of attempts

Also I'm not sure why logs of save-demo-agent are not being scraped, because kubectl claims that the container has been started.

I think it's worth ssh-ing into nodes and df -h-ing to see what occupies a lot of storage. kubectl describe node gives an insight:

Warning  FreeDiskSpaceFailed   3m29s (x33739 over 117d)  kubelet  (combined from similar events): failed to garbage collect required amount of images. Wanted to free 7993606144 bytes, but freed 0 bytes

So I would check contents and size of /var/lib/containerd/

sanyavertolet · 2023-03-29T10:30:25Z

Great thanks!

I would check contents and size of /var/lib/containerd/

I thought we have extended the nodes' space, but probably it was extended under the wrong path.

nulls · 2023-03-31T09:30:35Z

Another failed pod:

nulls@vdjuceu263557:~/d-projects/save-cloud-deployment$ kubectl-save describe pods/gateway-58c5969ff7-jbktq
Name:             gateway-58c5969ff7-jbktq
Namespace:        save-cloud
Priority:         0
Service Account:  default
Node:             172.16.0.55/
Start Time:       Mon, 27 Mar 2023 18:18:58 +0300
Labels:           io.kompose.service=gateway
                  pod-template-hash=58c5969ff7
                  version=0.4.0-alpha.0.190-933f8fa
Annotations:      kubernetes.io/psp: psp-global
                  prometheus.io/path: /actuator/prometheus
                  prometheus.io/port: 5801
                  prometheus.io/scrape: true
Status:           Failed
Reason:           Evicted
Message:          The node was low on resource: ephemeral-storage.

sanyavertolet · 2023-03-31T09:42:16Z

Not sure if it is connected with the problem, but this is what see when connecting to 55th node:

sanyavertolet · 2023-03-31T09:45:41Z

So I would check contents and size of /var/lib/containerd/

Seems there is nothing criminal except 89% of disk usage for some dirs

sanyavertolet · 2023-03-31T09:53:10Z

There seem to be an open issue so the problem is not new as well as not that rare. Also found a nice post with explanation on stack-overflow

Here is a fragment of 172.16.0.55 node's get request

  - lastHeartbeatTime: "2023-03-31T12:11:44Z"
    lastTransitionTime: "2023-03-31T09:16:59Z"
    message: kubelet has no disk pressure
    reason: KubeletHasNoDiskPressure
    status: "False"
    type: DiskPressure

which means that there is no disk pressure (at list by the time of this message)

Moreover, here is a fragment of describe request to 172.16.0.55:

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests     Limits
  --------           --------     ------
  cpu                310m (16%)   800m (41%)
  memory             600Mi (21%)  1256Mi (44%)
  ephemeral-storage  100Mi (1%)   500Mi (5%)
  hugepages-1Gi      0 (0%)       0 (0%)
  hugepages-2Mi      0 (0%)       0 (0%)

from which it seems that there should be no problem with ephemeral-storage🤡

p.s. it was found out that nodes 172.16.0.52 and 172.16.0.87 have lots and lots of different version of same save images like demo, demo-cpg etc which means that we are bad at image clean up

sanyavertolet · 2023-05-31T12:44:30Z

Fixed in one of PRs

sanyavertolet added the bug Something isn't working label Mar 15, 2023

sanyavertolet self-assigned this Mar 15, 2023

sanyavertolet mentioned this issue Mar 17, 2023

Ephemeral storage fix #1996

Merged

sanyavertolet closed this as completed in #1996 Mar 17, 2023

sanyavertolet reopened this Mar 21, 2023

sanyavertolet added infra Issues related to build or deploy infrastructure demo Issues and PRs related to save-demo service labels Mar 21, 2023

sanyavertolet closed this as completed May 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

save-demo-agent pod failure #1986

save-demo-agent pod failure #1986

sanyavertolet commented Mar 15, 2023 •

edited

sanyavertolet commented Mar 16, 2023

sanyavertolet commented Mar 21, 2023

sanyavertolet commented Mar 29, 2023

sanyavertolet commented Mar 29, 2023 •

edited

sanyavertolet commented Mar 29, 2023

sanyavertolet commented Mar 29, 2023

petertrr commented Mar 29, 2023

sanyavertolet commented Mar 29, 2023 •

edited

nulls commented Mar 31, 2023

sanyavertolet commented Mar 31, 2023

sanyavertolet commented Mar 31, 2023 •

edited

sanyavertolet commented Mar 31, 2023 •

edited

sanyavertolet commented May 31, 2023

save-demo-agent pod failure #1986

save-demo-agent pod failure #1986

Comments

sanyavertolet commented Mar 15, 2023 • edited

sanyavertolet commented Mar 16, 2023

sanyavertolet commented Mar 21, 2023

sanyavertolet commented Mar 29, 2023

upd

sanyavertolet commented Mar 29, 2023 • edited

sanyavertolet commented Mar 29, 2023

sanyavertolet commented Mar 29, 2023

petertrr commented Mar 29, 2023

sanyavertolet commented Mar 29, 2023 • edited

nulls commented Mar 31, 2023

sanyavertolet commented Mar 31, 2023

sanyavertolet commented Mar 31, 2023 • edited

sanyavertolet commented Mar 31, 2023 • edited

sanyavertolet commented May 31, 2023

sanyavertolet commented Mar 15, 2023 •

edited

sanyavertolet commented Mar 29, 2023 •

edited

sanyavertolet commented Mar 29, 2023 •

edited

sanyavertolet commented Mar 31, 2023 •

edited

sanyavertolet commented Mar 31, 2023 •

edited