Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

save-demo-agent pod failure #1986

Closed
sanyavertolet opened this issue Mar 15, 2023 · 13 comments · Fixed by #1996
Closed

save-demo-agent pod failure #1986

sanyavertolet opened this issue Mar 15, 2023 · 13 comments · Fixed by #1996
Assignees
Labels
bug Something isn't working demo Issues and PRs related to save-demo service infra Issues related to build or deploy infrastructure

Comments

@sanyavertolet
Copy link
Member

sanyavertolet commented Mar 15, 2023

In save-demo we create a job using fabric8io/kubernetes-client.
This job should run one pod with one of the images from here: saveourtool packages. The problem that it seem to work fine (right now there are 3 jobs running) until something - a new pod failes:

Events:
  Type     Reason                 Age                    From                Message
  ----     ------                 ----                   ----                -------
  Normal   NotTriggerScaleUp      9m43s (x42 over 16m)   cluster-autoscaler  pod didn't trigger scale-up:
  Warning  FailedScheduling       9m38s (x10 over 16m)   default-scheduler   0/3 nodes are available: 1 node(s) had taint {node.kubernetes.io/disk-pressure: }, that the pod didn't tolerate, 2 node(s) didn't have free ports for the requested pod ports.
  Normal   Scheduled              9m27s                  default-scheduler   Successfully assigned save-cloud/demo-cqfn.org-diktat-warn-1-4pcrw to 172.16.0.55
  Normal   Pulling                9m25s                  kubelet             Pulling image "ghcr.io/saveourtool/save-base:eclipse-temurin-17"
  Normal   Pulled                 5m23s                  kubelet             Successfully pulled image "ghcr.io/saveourtool/save-base:eclipse-temurin-17" in 4m2.296098444s
  Normal   SuccessfulCreate       5m21s                  kubelet             Created container save-demo-agent-pod
  Normal   Started                5m21s                  kubelet             Started container save-demo-agent-pod
  Normal   SuccessfulMountVolume  5m20s (x2 over 9m27s)  kubelet             Successfully mounted volumes for pod "demo-cqfn.org-diktat-warn-1-4pcrw_save-cloud(791fc987-054c-4723-8e46-e80f67e27fa9)"
  Warning  Evicted                106s                   kubelet             The node was low on resource: ephemeral-storage. Container save-demo-agent-pod was using 78844Ki, which exceeds its request of 0.
  Normal   Killing                106s                   kubelet             Stopping container save-demo-agent-pod

Seems that the same problem is discussed here

@sanyavertolet sanyavertolet added the bug Something isn't working label Mar 15, 2023
@sanyavertolet sanyavertolet self-assigned this Mar 15, 2023
@sanyavertolet
Copy link
Member Author

Got same old stuff again:

Events:
  Type     Reason                 Age    From               Message
  ----     ------                 ----   ----               -------
  Normal   Scheduled              3m45s  default-scheduler  Successfully assigned save-cloud/demo-cqfn.org-diktat-warn-1-rrgcj to 172.16.0.55
  Normal   SuccessfulMountVolume  3m45s  kubelet            Successfully mounted volumes for pod "demo-cqfn.org-diktat-warn-1-rrgcj_save-cloud(bdb2abe5-ff26-49d4-882b-c80e1546aae6)"
  Normal   Pulling                3m43s  kubelet            Pulling image "ghcr.io/saveourtool/save-base:eclipse-temurin-17"
  Normal   Pulled                 15s    kubelet            Successfully pulled image "ghcr.io/saveourtool/save-base:eclipse-temurin-17" in 3m27.624967525s
  Warning  Evicted                13s    kubelet            The node was low on resource: ephemeral-storage.
  Normal   SuccessfulCreate       12s    kubelet            Created container save-demo-agent-pod
  Normal   Started                12s    kubelet            Started container save-demo-agent-pod
  Normal   Killing                11s    kubelet            Stopping container save-demo-agent-pod

@sanyavertolet
Copy link
Member Author

Now pod is in pending state (after pod spec update):

Events:
  Type     Reason             Age                     From                Message
  ----     ------             ----                    ----                -------
  Warning  FailedScheduling   7m32s (x1091 over 23h)  default-scheduler   0/3 nodes are available: 1 node(s) were unschedulable, 2 node(s) didn't have free ports for the requested pod ports.
  Warning  FailedScheduling   3m12s (x518 over 23h)   default-scheduler   0/3 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 1 node(s) had taint {node.kubernetes.io/disk-pressure: }, that the pod didn't tolerate, 1 node(s) were unschedulable.
  Normal   NotTriggerScaleUp  6s (x959 over 160m)     cluster-autoscaler  pod didn't trigger scale-up:

@sanyavertolet sanyavertolet reopened this Mar 21, 2023
@sanyavertolet sanyavertolet added infra Issues related to build or deploy infrastructure demo Issues and PRs related to save-demo service labels Mar 21, 2023
@sanyavertolet
Copy link
Member Author

upd

  1. It was found out that assigning hostPort to containerPort is a bad idea as it seems to occupy node's port (in this way, there could be only two Jobs at a time);
  2. After the disk was extended on all three nodes, the problem with ephemeral-storage didn't go away.

@sanyavertolet
Copy link
Member Author

sanyavertolet commented Mar 29, 2023

Here is a new error event:

apiVersion: v1
count: 1
eventTime: null
firstTimestamp: "2023-03-29T09:25:23Z"
involvedObject:
  apiVersion: v1
  fieldPath: spec.containers{save-demo-agent-pod}
  kind: Pod
  name: demo-testnariman-pylink-1-7jl49
  namespace: save-cloud
  resourceVersion: "150879703"
  uid: b7051a56-fdd6-441f-9283-fe93686be5fe
kind: Event
lastTimestamp: "2023-03-29T09:25:23Z"
message: 'Failed to pull image "ghcr.io/saveourtool/save-base:python-3.10": rpc error:
  code = FailedPrecondition desc = failed to pull and unpack image "ghcr.io/saveourtool/save-base:python-3.10":
  failed commit on ref "layer-sha256:a6daac42f0daee2fe11dd237f570a94067fd69ec126838af3bc82a4d57f11f8c":
  "layer-sha256:a6daac42f0daee2fe11dd237f570a94067fd69ec126838af3bc82a4d57f11f8c"
  failed size validation: 6239744 != 17386982: failed precondition'
metadata:
  creationTimestamp: "2023-03-29T09:25:23Z"
  name: demo-testnariman-pylink-1-7jl49.1750d99bb05ded5b
  namespace: save-cloud
  resourceVersion: "1421615"
  uid: 677b50a7-8393-4dfc-b521-bc2d1c957834
reason: FailedPullImage
reportingComponent: ""
reportingInstance: ""
source:
  component: kubelet
  host: 172.16.0.55
type: Warning

which leads us to these lines

{
      "digest": "sha256:a6daac42f0daee2fe11dd237f570a94067fd69ec126838af3bc82a4d57f11f8c",
      "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "size": 17386982
    },

of saveourtool/save-base:python-3.10.

@sanyavertolet
Copy link
Member Author

screenshot

Have absolutely no idea how could there events happen in this order and what is the connection between ephemeral-storage issue and FailedPullImage

@sanyavertolet
Copy link
Member Author

@petertrr maybe you could give a helping hand?

@petertrr
Copy link
Member

I think the sequence here is like this:

  • There was an error pulling image (there is a potential fix, but apparently we are using not a very recent version of containerd)
  • A new attempt has been started by back-off strategy
  • Pod has been created and started successfully
  • Pod has been killed, because node is low on ephemeral storage
  • Job didn't attempt to schedule a new one, because it ran out of attempts

Also I'm not sure why logs of save-demo-agent are not being scraped, because kubectl claims that the container has been started.

I think it's worth ssh-ing into nodes and df -h-ing to see what occupies a lot of storage. kubectl describe node gives an insight:

Warning  FreeDiskSpaceFailed   3m29s (x33739 over 117d)  kubelet  (combined from similar events): failed to garbage collect required amount of images. Wanted to free 7993606144 bytes, but freed 0 bytes

So I would check contents and size of /var/lib/containerd/

@sanyavertolet
Copy link
Member Author

sanyavertolet commented Mar 29, 2023

Great thanks!

I would check contents and size of /var/lib/containerd/

I thought we have extended the nodes' space, but probably it was extended under the wrong path.

@nulls
Copy link
Member

nulls commented Mar 31, 2023

Another failed pod:

nulls@vdjuceu263557:~/d-projects/save-cloud-deployment$ kubectl-save describe pods/gateway-58c5969ff7-jbktq
Name:             gateway-58c5969ff7-jbktq
Namespace:        save-cloud
Priority:         0
Service Account:  default
Node:             172.16.0.55/
Start Time:       Mon, 27 Mar 2023 18:18:58 +0300
Labels:           io.kompose.service=gateway
                  pod-template-hash=58c5969ff7
                  version=0.4.0-alpha.0.190-933f8fa
Annotations:      kubernetes.io/psp: psp-global
                  prometheus.io/path: /actuator/prometheus
                  prometheus.io/port: 5801
                  prometheus.io/scrape: true
Status:           Failed
Reason:           Evicted
Message:          The node was low on resource: ephemeral-storage.

@sanyavertolet
Copy link
Member Author

Not sure if it is connected with the problem, but this is what see when connecting to 55th node:
screenshot

@sanyavertolet
Copy link
Member Author

sanyavertolet commented Mar 31, 2023

So I would check contents and size of /var/lib/containerd/

Seems there is nothing criminal except 89% of disk usage for some dirs

screenshot

@sanyavertolet
Copy link
Member Author

sanyavertolet commented Mar 31, 2023

There seem to be an open issue so the problem is not new as well as not that rare. Also found a nice post with explanation on stack-overflow

Here is a fragment of 172.16.0.55 node's get request

  - lastHeartbeatTime: "2023-03-31T12:11:44Z"
    lastTransitionTime: "2023-03-31T09:16:59Z"
    message: kubelet has no disk pressure
    reason: KubeletHasNoDiskPressure
    status: "False"
    type: DiskPressure

which means that there is no disk pressure (at list by the time of this message)

Moreover, here is a fragment of describe request to 172.16.0.55:

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests     Limits
  --------           --------     ------
  cpu                310m (16%)   800m (41%)
  memory             600Mi (21%)  1256Mi (44%)
  ephemeral-storage  100Mi (1%)   500Mi (5%)
  hugepages-1Gi      0 (0%)       0 (0%)
  hugepages-2Mi      0 (0%)       0 (0%)

from which it seems that there should be no problem with ephemeral-storage🤡

p.s. it was found out that nodes 172.16.0.52 and 172.16.0.87 have lots and lots of different version of same save images like demo, demo-cpg etc which means that we are bad at image clean up

@sanyavertolet
Copy link
Member Author

Fixed in one of PRs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working demo Issues and PRs related to save-demo service infra Issues related to build or deploy infrastructure
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants