`delete pods` API call latencies shot up on large cluster tests #51899

shyamjvs · 2017-09-04T12:39:04Z

Updated issue description with latest findings:

#50350 introduced a change to kubelet pod deletion that results in delete pod API calls from kubelets being concentrated immediately after container garbage collection.

When performing a deletion of large numbers (thousands) of pods across large numbers (hundreds) of nodes, the resulting concentrated delete calls from the kubelets cause increased latency of delete pods API calls above the target threshold:

Sep 27 14:34:14.484: INFO: WARNING Top latency metric: {Resource:pods Subresource: Verb:DELETE Latency:{Perc50:4.551ms Perc90:10.216ms Perc99:1.403002s Perc100:0s} Count:122966}

Seen on https://k8s-gubernator.appspot.com/builds/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/

Not seen on https://k8s-gubernator.appspot.com/builds/kubernetes-jenkins/logs/ci-kubernetes-kubemark-gce-scale/

hoisted details from #51899 (comment) into description:
Graphs over a three minute window (both latency in ms and distribution of delete calls per second):

the delete calls from the gc controller merely set deletionTimestamp on the pod (which the kubelets observe and start shutting down the pods). the delete calls from the kubelet actually delete from etcd.

there were multiple spikes throughout the run (corresponding to deletions of 3000-pod replication controllers). here's a graph of delete calls that took more than one second throughout the whole run:

Seems like a kubelet-induced thundering herd, when deleting massive numbers of pods across massive numbers of nodes. kubemark's gc is stubbed since there are no real containers, so this isn't observed there

Original issue description follows:

From 5k-node density test run (no. 26) on last friday:

I0901 10:11:29.395] Sep  1 10:11:29.395: INFO: WARNING Top latency metric: {Resource:pods Subresource: Verb:LIST Latency:{Perc50:66.805ms Perc90:375.804ms Perc99:8.111217s Perc100:0s} Count:11911}
I0901 10:11:29.396] Sep  1 10:11:29.395: INFO: Top latency metric: {Resource:nodes Subresource: Verb:LIST Latency:{Perc50:417.103ms Perc90:1.090847s Perc99:2.336106s Perc100:0s} Count:1261}
I0901 10:11:29.396] Sep  1 10:11:29.395: INFO: WARNING Top latency metric: {Resource:nodes Subresource:status Verb:PATCH Latency:{Perc50:5.633ms Perc90:279.15ms Perc99:1.04109s Perc100:0s} Count:5543730}
I0901 10:11:29.396] Sep  1 10:11:29.395: INFO: Top latency metric: {Resource:pods Subresource:binding Verb:POST Latency:{Perc50:1.442ms Perc90:15.171ms Perc99:921.523ms Perc100:0s} Count:155096}
I0901 10:11:29.396] Sep  1 10:11:29.395: INFO: Top latency metric: {Resource:replicationcontrollers Subresource: Verb:POST Latency:{Perc50:3.13ms Perc90:218.472ms Perc99:886.065ms Perc100:0s} Count:5049}

And from the last healthy run (no. 23) of the test:

I0826 03:49:44.301] Aug 26 03:49:44.301: INFO: Top latency metric: {Resource:pods Subresource: Verb:LIST Latency:{Perc50:830.411ms Perc90:1.468316s Perc99:2.627309s Perc100:0s} Count:10615}
I0826 03:49:44.302] Aug 26 03:49:44.301: INFO: Top latency metric: {Resource:nodes Subresource: Verb:LIST Latency:{Perc50:337.423ms Perc90:427.443ms Perc99:998.426ms Perc100:0s} Count:1800}
I0826 03:49:44.302] Aug 26 03:49:44.301: INFO: Top latency metric: {Resource:apiservices Subresource: Verb:PUT Latency:{Perc50:954Âµs Perc90:5.726ms Perc99:138.528ms Perc100:0s} Count:840}
I0826 03:49:44.302] Aug 26 03:49:44.301: INFO: Top latency metric: {Resource:namespaces Subresource: Verb:GET Latency:{Perc50:445Âµs Perc90:2.641ms Perc99:83.57ms Perc100:0s} Count:1608}
I0826 03:49:44.302] Aug 26 03:49:44.301: INFO: Top latency metric: {Resource:nodes Subresource:status Verb:PATCH Latency:{Perc50:3.22ms Perc90:9.449ms Perc99:55.588ms Perc100:0s} Count:7033941}

This is a huge increase we're seeing:
LIST pods : 2.6s -> 8.1s
LIST nodes: 1s -> 2.3s
PATCH node-status: 56ms -> 1s
...

cc @kubernetes/sig-api-machinery-bugs @kubernetes/sig-scalability-misc @smarterclayton @wojtek-t @gmarek

The text was updated successfully, but these errors were encountered:

shyamjvs · 2017-09-04T12:41:17Z

I'm trying to find the offending PR(s). The diff is too huge across those runs.
Luckily this is even seen on relatively smaller clusters for e.g kubemark-500 - https://k8s-testgrid.appspot.com/sig-scalability#kubemark-500

wojtek-t · 2017-09-04T12:42:22Z

My strong feeling is that it may be related to pagination.

If I'm right, #51876 should hopefully fix the problem.

@smarterclayton

wojtek-t · 2017-09-04T12:42:53Z

But this is mostly a guess and it may also be something different.

shyamjvs · 2017-09-04T12:43:23Z

@kubernetes/test-infra-maintainers We don't have any logs (except the build-log from jenkins) for run-26 of ci-kubernetes-e2e-gce-scale-performance because jenkins crashed midway in the test:

ERROR: Connection was broken: java.io.IOException: Unexpected termination of the channel
	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:50)
Caused by: java.io.EOFException
	at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2351)
	at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2820)

Such failures (that too right at the end of the run) can hit us really hard wrt debugging.

shyamjvs · 2017-09-04T12:48:05Z

@wojtek-t We're also seeing increase in the patch latencies. So there probably is another regression too?

shyamjvs · 2017-09-04T12:55:41Z

Till run 8132, kubemark-500 seemed fine:

    {
      "data": {
        "Perc50": 31.303,
        "Perc90": 39.326,
        "Perc99": 59.415
      },
      "unit": "ms",
      "labels": {
        "Count": "43",
        "Resource": "nodes",
        "Subresource": "",
        "Verb": "LIST"
      }
    },

We had some startup failures in b/w and it seems to shot up starting from run 8141:

    {
      "data": {
        "Perc50": 213.34,
        "Perc90": 608.223,
        "Perc99": 1014.937
      },
      "unit": "ms",
      "labels": {
        "Count": "53",
        "Resource": "nodes",
        "Subresource": "",
        "Verb": "LIST"
      }
    },

Looking at the diff right now.

wojtek-t · 2017-09-04T13:02:37Z

Another possiblity (that you mentioned to me offline) is:
#48287

I initially doubted about it, but now when I think about it, maybe it's about memory allocations (I didn't look carefully, but maybe it's much higher with json-iter)?

@thockin - FYI (still just my guess)

wojtek-t · 2017-09-04T13:04:37Z

Anyway - @shyamjvs we need more data about what other metrics has changed (cpu usage, allocations, number of API calls, ...)

shyamjvs · 2017-09-04T13:07:19Z

After eliminating trivial and unrelated PRs, the following are left:

Revert "CPU manager wiring and none policy" Revert "CPU manager wiring and none policy" #51804 (I think this one is only affecting kubelet, so unlikely)
Split APIVersion into APIGroup and APIVersion in audit events Split APIVersion into APIGroup and APIVersion in audit events #50007 (this one's just affecting audit events)
Use json-iterator instead of ugorji for JSON. Use json-iterator instead of ugorji for JSON. #48287 (I'm most suspicious about this)
add information for subresource kind determination add information for subresource kind determination #49971 (This one is adding group and version fields to the APIResource)

shyamjvs · 2017-09-04T13:08:33Z

Anyway - @shyamjvs we need more data about what other metrics has changed (cpu usage, allocations, number of API calls, ...)

Yup.. I'm doing some local experiments to verify this. Also merged #51892 to start getting kubemark master logs.

gmarek · 2017-09-04T15:32:08Z

Can you grab CPU profile from the master when running test? It should help you with figuring out the cause.

gmarek · 2017-09-04T15:33:55Z

Adding release team: @jdumars @calebamiles @spiffxp

shyamjvs · 2017-09-04T20:39:23Z

So I fetched the CPU & memory usage for the load test from the last > 200 runs of kubemark-500, and I don't see any noticeable increase. Actually it appears like it improved:

shyamjvs · 2017-09-04T20:45:12Z

There are some consecutive failed runs (due to testinfra bugs / regressions) leading to missing points or temporary spikes... Neglect the noise

shyamjvs · 2017-09-04T21:23:21Z

I ran the tests locally on kubemark-500 against head (commit ffed1d3) and it went through fine with low latency values like before. Re-running again to see if it's flaky. I'll continue digging into it tomorrow.

smarterclayton · 2017-09-05T00:09:50Z

Unless you've turned on all alpha features on kubemark you shouldn't have seen an increase with pagination (which is sounds like further investigation eliminated). If you did turn on alpha features on kubemark that's exactly the validation I was looking for with chunking (that no regression occurred), which is good. Chunking would increase average latency but should decrease tail latency in most cases (naive chunking would increase error rates).

…

On Mon, Sep 4, 2017 at 5:23 PM, Shyam JVS ***@***.***> wrote: I ran the tests locally on kubemark-500 against head (commit ffed1d3 <ffed1d3>) and it went through fine with low latency values like before. Re-running again to see if it's flaky. I'll continue digging into it tomorrow. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#51899 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_p_ODR8S6ycmF1ppTOEZ8Bub3aVsaks5sfGpSgaJpZM4PL3gt> .

shyamjvs · 2017-09-05T10:03:09Z

Ok... So the reason for kubemark failing is different. As @wojtek-t pointed out, the kubemark master is not rightly sized.
I recently made some changes in test-infra around auto-calculation of kubemark master-size and seems like it didn't take effect.

shyamjvs · 2017-09-05T10:06:20Z

Btw - that also explains why the failure wasn't reproducing locally for me. Because I bypassed all the testinfra logic by manually starting the clusters, it got rightly calculated.

shyamjvs · 2017-09-05T10:11:29Z

@smarterclayton Thanks for the explanation. I checked that neither of kubemark nor our real cluster tests are using pagination (as ?limit=1 is still returning all the results).
So the real 5k-node test failures seems unrelated to it.

liggitt · 2017-09-28T16:28:25Z

cc @dashpole @vishh @dchen1107
from introducing PR (#50350)

dashpole · 2017-09-28T16:36:52Z

I can post a PR to revert the key portion of #50350.

jdumars · 2017-09-28T16:37:31Z

@dashpole thank you!

shyamjvs · 2017-09-28T16:40:36Z

I guess we should also wait until the fix is tested too (add ~2 more hours for it).

dashpole · 2017-09-28T16:51:08Z

PR posted: #53210

dchen1107 · 2017-09-28T16:54:25Z

@dashpole and I had an offline discussion, and plan to revert portion of #50350. The original pr was introduced because without that change, Kubelet disk eviction manager might delete more containers unnecessarily.

We should improve disk management with more intelligence to take proactive action, instead of relying on the periodical gc.

jdumars · 2017-09-28T19:40:19Z

While this is in the 1.9 milestone, there is every expectation that this will be resolved at the earliest responsible moment in a 1.8 patch release. The full decision-making process around this can be viewed at https://youtu.be/r6D5DNel2l8

jpbetz · 2017-10-02T17:57:41Z

Any updates on this?

wojtek-t · 2017-10-02T18:04:13Z

We already have a victim PR - the PR that is fixing that is #53233

@wojtek-t

Automatic merge from submit-queue (batch tested with PRs 51765, 53053, 52771, 52860, 53284). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Add audit-logging, feature-gates & few admission plugins to kubemark To make kubemark match real cluster settings. Also includes a few other settings like request-timeout, etcd-quorum, etc. Fixes #53021 Related #51899 #44701 cc @kubernetes/sig-scalability-misc @wojtek-t @gmarek @smarterclayton

Automatic merge from submit-queue (batch tested with PRs 53403, 53233). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Remove containers from deleted pods once containers have exited Issue #51899 Since container deletion is currently done through periodic garbage collection every 30 seconds, it takes a long time for pods to be deleted, and causes the kubelet to send all delete pod requests at the same time, which has performance issues. This PR makes the kubelet actively remove containers of deleted pods rather than wait for them to be removed in periodic garbage collection. /release-note-none

liggitt · 2017-10-04T02:24:24Z

#53233 is merged to master. pick for 1.8.1 open at #53422

…3-upstream-release-1.8 Automatic merge from submit-queue. Automated cherry pick of #53233 Fixes #51899 Cherry pick of #53233 on release-1.8. #53233: remove containers of deleted pods once all containers have ```release-note Fixes a performance issue (#51899) identified in large-scale clusters when deleting thousands of pods simultaneously across hundreds of nodes, by actively removing containers of deleted pods, rather than waiting for periodic garbage collection and batching resulting pod API deletion requests. ```

k8s-github-robot · 2017-10-05T22:37:43Z

[MILESTONENOTIFIER] Milestone Issue Needs Approval

@dashpole @shyamjvs @kubernetes/sig-api-machinery-bugs @kubernetes/sig-node-bugs @kubernetes/sig-scalability-bugs

Action required: This issue must have the status/approved-for-milestone label applied by a SIG maintainer.

Issue Labels

sig/api-machinery sig/node sig/scalability: Issue will be escalated to these SIGs if needed.
priority/critical-urgent: Never automatically move out of a release milestone; continually escalate to contributor and SIG through all available channels.
kind/bug: Fixes a bug discovered during the current release.

Help

dashpole · 2017-10-05T22:46:36Z

/close
Closed via #53422. The fix be in 1.8.1.

@jdumars

Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Modify traces in deletion handler Ref kubernetes/kubernetes#51899 (comment) cc @kubernetes/sig-release-members @jdumars @dims Can we get this into 1.8? Kubernetes-commit: 5952e932e9a1b593876c1504df5d1cb3fd72299d

@jdumars

Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Modify traces in deletion handler Ref kubernetes/kubernetes#51899 (comment) cc @kubernetes/sig-release-members @jdumars @dims Can we get this into 1.8? Kubernetes-commit: 5952e932e9a1b593876c1504df5d1cb3fd72299d

@jdumars

Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Modify traces in deletion handler Ref kubernetes/kubernetes#51899 (comment) cc @kubernetes/sig-release-members @jdumars @dims Can we get this into 1.8? Kubernetes-commit: 5952e932e9a1b593876c1504df5d1cb3fd72299d

shyamjvs added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-blocker labels Sep 4, 2017

shyamjvs added this to the v1.8 milestone Sep 4, 2017

shyamjvs self-assigned this Sep 4, 2017

k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. kind/bug Categorizes issue or PR as related to a bug. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. labels Sep 4, 2017

k8s-github-robot added the milestone-labels-complete label Sep 4, 2017

wojtek-t removed the milestone-labels-complete label Sep 4, 2017

k8s-github-robot added the milestone-labels-complete label Sep 4, 2017

shyamjvs mentioned this issue Sep 5, 2017

Fix kubemark master-size and num-nodes config #51948

Merged

dashpole mentioned this issue Sep 28, 2017

Revert: Wait for container cleanup before deletion #50350 #53210

Closed

jdumars removed the release-blocker label Sep 28, 2017

jdumars modified the milestones: v1.8, v1.9 Sep 28, 2017

jdumars removed the status/approved-for-milestone label Sep 28, 2017

dashpole mentioned this issue Sep 28, 2017

Remove containers from deleted pods once containers have exited #53233

Merged

shyamjvs mentioned this issue Sep 29, 2017

Introduce critical performance tests as (manually unblockable) post-submit blockers #53255

Closed

liggitt mentioned this issue Oct 4, 2017

Automated cherry pick of #53233 #53422

Merged

k8s-github-robot added the milestone/needs-approval label Oct 5, 2017

k8s-ci-robot closed this as completed Oct 5, 2017

shyamjvs mentioned this issue Oct 10, 2017

Large kubemark performance tests failing with timeout during ns deletion #53327

Closed

rphillips mentioned this issue Oct 11, 2017

Bump k8s v1.8.1 kubernetes-retired/bootkube#729

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`delete pods` API call latencies shot up on large cluster tests #51899

`delete pods` API call latencies shot up on large cluster tests #51899

shyamjvs commented Sep 4, 2017 •

edited by liggitt

shyamjvs commented Sep 4, 2017 •

edited

wojtek-t commented Sep 4, 2017

wojtek-t commented Sep 4, 2017

shyamjvs commented Sep 4, 2017 •

edited

shyamjvs commented Sep 4, 2017

shyamjvs commented Sep 4, 2017

wojtek-t commented Sep 4, 2017

wojtek-t commented Sep 4, 2017

shyamjvs commented Sep 4, 2017

shyamjvs commented Sep 4, 2017

gmarek commented Sep 4, 2017

gmarek commented Sep 4, 2017

shyamjvs commented Sep 4, 2017

shyamjvs commented Sep 4, 2017

shyamjvs commented Sep 4, 2017

smarterclayton commented Sep 5, 2017 via email

shyamjvs commented Sep 5, 2017

shyamjvs commented Sep 5, 2017

shyamjvs commented Sep 5, 2017

liggitt commented Sep 28, 2017

dashpole commented Sep 28, 2017

jdumars commented Sep 28, 2017

shyamjvs commented Sep 28, 2017

dashpole commented Sep 28, 2017

dchen1107 commented Sep 28, 2017

jdumars commented Sep 28, 2017

jpbetz commented Oct 2, 2017

wojtek-t commented Oct 2, 2017

liggitt commented Oct 4, 2017 •

edited

k8s-github-robot commented Oct 5, 2017

dashpole commented Oct 5, 2017 •

edited

delete pods API call latencies shot up on large cluster tests #51899

delete pods API call latencies shot up on large cluster tests #51899

Comments

shyamjvs commented Sep 4, 2017 • edited by liggitt

shyamjvs commented Sep 4, 2017 • edited

wojtek-t commented Sep 4, 2017

wojtek-t commented Sep 4, 2017

shyamjvs commented Sep 4, 2017 • edited

shyamjvs commented Sep 4, 2017

shyamjvs commented Sep 4, 2017

wojtek-t commented Sep 4, 2017

wojtek-t commented Sep 4, 2017

shyamjvs commented Sep 4, 2017

shyamjvs commented Sep 4, 2017

gmarek commented Sep 4, 2017

gmarek commented Sep 4, 2017

shyamjvs commented Sep 4, 2017

shyamjvs commented Sep 4, 2017

shyamjvs commented Sep 4, 2017

smarterclayton commented Sep 5, 2017 via email

shyamjvs commented Sep 5, 2017

shyamjvs commented Sep 5, 2017

shyamjvs commented Sep 5, 2017

liggitt commented Sep 28, 2017

dashpole commented Sep 28, 2017

jdumars commented Sep 28, 2017

shyamjvs commented Sep 28, 2017

dashpole commented Sep 28, 2017

dchen1107 commented Sep 28, 2017

jdumars commented Sep 28, 2017

jpbetz commented Oct 2, 2017

wojtek-t commented Oct 2, 2017

liggitt commented Oct 4, 2017 • edited

k8s-github-robot commented Oct 5, 2017

dashpole commented Oct 5, 2017 • edited

`delete pods` API call latencies shot up on large cluster tests #51899

`delete pods` API call latencies shot up on large cluster tests #51899

shyamjvs commented Sep 4, 2017 •

edited by liggitt

shyamjvs commented Sep 4, 2017 •

edited

shyamjvs commented Sep 4, 2017 •

edited

liggitt commented Oct 4, 2017 •

edited

dashpole commented Oct 5, 2017 •

edited