Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[test flakes] master-scalability suites #60589

Closed
3 tasks done
krzyzacy opened this issue Feb 28, 2018 · 164 comments · Fixed by #61504
Closed
3 tasks done

[test flakes] master-scalability suites #60589

krzyzacy opened this issue Feb 28, 2018 · 164 comments · Fixed by #61504
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Milestone

Comments

@krzyzacy
Copy link
Member

krzyzacy commented Feb 28, 2018

Failing release-blocking suites:

all three suites are flaking a lot recently, mind triage?

/sig scalability
/priority failing-test
/kind bug
/status approved-for-milestone

cc @jdumars @jberkus
/assign @shyamjvs @wojtek-t

@k8s-ci-robot k8s-ci-robot added status/approved-for-milestone sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/bug Categorizes issue or PR as related to a bug. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 28, 2018
@shyamjvs
Copy link
Member

shyamjvs commented Feb 28, 2018

I'll try getting to them sometime during this week (having scarcity of free cycles atm).

@krzyzacy
Copy link
Member Author

krzyzacy commented Mar 2, 2018

@shyamjvs is there any update for this issue?

@wojtek-t
Copy link
Member

wojtek-t commented Mar 2, 2018

https://k8s-testgrid.appspot.com/sig-release-master-blocking#gce-scale-correctness

I took a brief look into that. And either some test(s) are extreeeeemely slow or something is hanging somehwere. Par of logs from last run:

62571 I0301 23:01:31.360] Mar  1 23:01:31.348: INFO: Running AfterSuite actions on all node
62572 I0301 23:01:31.360]
62573 W0302 07:32:00.441] 2018/03/02 07:32:00 process.go:191: Abort after 9h30m0s timeout during ./hack/ginkgo-e2e.sh --ginkgo.flakeAttempts=2 --ginkgo.skip=\[Serial\]|\[Disruptive      \]|\[Flaky\]|\[Feature:.+\]|\[DisabledForLargeClusters\] --allowed-not-ready-nodes=50 --node-schedulable-timeout=90m --minStartupPods=8 --gather-resource-usage=master --gathe      r-metrics-at-teardown=master --logexporter-gcs-path=gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-correctness/80/artifacts --report-dir=/workspace/_artifacts --dis      able-log-dump=true --cluster-ip-range=10.64.0.0/11. Will terminate in another 15m
62574 W0302 07:32:00.445] SIGABRT: abort

No test finished within 8h30m

@wojtek-t
Copy link
Member

wojtek-t commented Mar 2, 2018

https://k8s-testgrid.appspot.com/sig-release-master-blocking#gce-scale-performance

Indeed seems like a regression. I think the regression happened somewhere between runs:
105 (which was still ok)
108 (which had visibly higher startup time)

We can try looking into kubemark-5000 to see if it is visible there too.

@wojtek-t
Copy link
Member

wojtek-t commented Mar 2, 2018

Kubemark-5000 is pretty stable. 99th percentile on this graph (maybe the regression happened even before, but I think it is somewhere between 105 and 108):

screenshot from 2018-03-02 14-36-54

@wojtek-t
Copy link
Member

wojtek-t commented Mar 2, 2018

Regarding the corretness tests - gce-large-correctness is also failing.
Maybe some test that is extremely long was added at that time?

@shyamjvs
Copy link
Member

shyamjvs commented Mar 2, 2018

Thanks a lot for looking @wojtek-t. Wrt performance job - I too feel strongly there's a regression (though I couldn't get to look properly into them).

Maybe some test that is extremely long was added at that time?

I was looking into this a while ago. And there were 2 suspicious changes I found:

@shyamjvs
Copy link
Member

shyamjvs commented Mar 2, 2018

cc @kubernetes/sig-storage-bugs

@k8s-ci-robot k8s-ci-robot added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Mar 2, 2018
@msau42
Copy link
Member

msau42 commented Mar 2, 2018

/assign

Some of the local storage tests will try to use every node in the cluster, thinking that cluster sizes are not that big. I'll add a fix to cap the max number of nodes.

@wojtek-t
Copy link
Member

wojtek-t commented Mar 2, 2018

Some of the local storage tests will try to use every node in the cluster, thinking that cluster sizes are not that big. I'll add a fix to cap the max number of nodes.

Thanks @msau42 - that would be great.

@wojtek-t
Copy link
Member

wojtek-t commented Mar 2, 2018

Going back to https://k8s-testgrid.appspot.com/sig-release-master-blocking#gce-scale-performance suite

I took a closer look into runs up to 105 and runs 108 and after that.
The biggest difference wrt to pod startup time seem to appear at the step:

10% worst watch latencies: 

[the name of it is misleading - will explain below]

Up to 105 run, it generally looked like this:

I0129 21:17:43.450] Jan 29 21:17:43.356: INFO: perc50: 1.041233793s, perc90: 1.685463015s, perc99: 2.350747103s

Starting with 108 run, it looks more like:

Feb 12 10:08:57.123: INFO: perc50: 1.122693874s, perc90: 1.934670461s, perc99: 3.195883331s

That basically means ~0.85s increase and this roughly what we observe in the end result.

Now - what that "watch lag" is.
It is basically a time between "Kubelet observed that the pod is running" to "when test observed pod update setting its state to running".
There are a couple possibility where we could have regressed:

  • kubelet is slowed in reporting status
  • kubelet is qps-starved (and thus slower in reporting status)
  • apiserver is slower (e.g. cpu-starved) and thus processing requests slower (either write, watch, or both)
  • test is cpu-starved and thus processing incoming events slower

Since we don't really observe a difference between "schedule -> start" of a pod, that suggests that it most probably isn't apiserver (because processing requests and watch is on that path too), and it's most probably not slow kubelet too (because it starts the pod).

So i think the most likely hypothesis is:

  • kubelet is qps-starved (or sth that prevents it from quickly sending status update)
  • test is cpu-starved (or sth like that)

Test didn't change at all around that time. So I think it's most probably the first one.

That said, I went through PRs merged between 105 and 108 runs and didn't find anything useful so far.

@wojtek-t
Copy link
Member

wojtek-t commented Mar 2, 2018

I think the next step is to:

  • look into the slowest pods (there seem to be O(1s) difference between slowest too) and see whether the difference is "before"of "after" the update status request was sent

@wojtek-t
Copy link
Member

wojtek-t commented Mar 2, 2018

So I looked into example pods. And I'm already seeing this:

I0209 10:01:19.960823       1 wrap.go:42] PUT /api/v1/namespaces/e2e-tests-density-30-16-5pl6z/pods/density-latency-pod-3115-fh7hl/status: (1.615907ms) 200 [[kubelet/v1.10.0 (l    inux/amd64) kubernetes/05944b1] 35.196.200.5:37414]
...
I0209 10:01:22.464046       1 wrap.go:42] PUT /api/v1/namespaces/e2e-tests-density-30-16-5pl6z/pods/density-latency-pod-3115-fh7hl/status: (279.153µs) 429 [[kubelet/v1.10.0 (li    nux/amd64) kubernetes/05944b1] 35.196.200.5:37414]
I0209 10:01:23.468353       1 wrap.go:42] PUT /api/v1/namespaces/e2e-tests-density-30-16-5pl6z/pods/density-latency-pod-3115-fh7hl/status: (218.216µs) 429 [[kubelet/v1.10.0 (li    nux/amd64) kubernetes/05944b1] 35.196.200.5:37414]
I0209 10:01:24.470944       1 wrap.go:42] PUT /api/v1/namespaces/e2e-tests-density-30-16-5pl6z/pods/density-latency-pod-3115-fh7hl/status: (1.42987ms) 200 [[kubelet/v1.10.0 (li    nux/amd64) kubernetes/05944b1] 35.196.200.5:37414]
I0209 09:57:01.559125       1 wrap.go:42] PUT /api/v1/namespaces/e2e-tests-density-30-35-gjzmd/pods/density-latency-pod-2034-t7c7h/status: (1.836423ms) 200 [[kubelet/v1.10.0 (l    inux/amd64) kubernetes/05944b1] 35.229.43.12:37782]
...
I0209 09:57:04.452830       1 wrap.go:42] PUT /api/v1/namespaces/e2e-tests-density-30-35-gjzmd/pods/density-latency-pod-2034-t7c7h/status: (231.2µs) 429 [[kubelet/v1.10.0 (linu    x/amd64) kubernetes/05944b1] 35.229.43.12:37782]
I0209 09:57:05.454274       1 wrap.go:42] PUT /api/v1/namespaces/e2e-tests-density-30-35-gjzmd/pods/density-latency-pod-2034-t7c7h/status: (213.872µs) 429 [[kubelet/v1.10.0 (li    nux/amd64) kubernetes/05944b1] 35.229.43.12:37782]
I0209 09:57:06.458831       1 wrap.go:42] PUT /api/v1/namespaces/e2e-tests-density-30-35-gjzmd/pods/density-latency-pod-2034-t7c7h/status: (2.13295ms) 200 [[kubelet/v1.10.0 (li    nux/amd64) kubernetes/05944b1] 35.229.43.12:37782]
I0209 10:01:53.063575       1 wrap.go:42] PUT /api/v1/namespaces/e2e-tests-density-30-5-6w4kv/pods/density-latency-pod-3254-586j7/status: (1.410064ms) 200 [[kubelet/v1.10.0 (li    nux/amd64) kubernetes/05944b1] 35.196.212.60:3391
...
I0209 10:01:55.259949       1 wrap.go:42] PUT /api/v1/namespaces/e2e-tests-density-30-5-6w4kv/pods/density-latency-pod-3254-586j7/status: (10.4894ms) 429 [[kubelet/v1.10.0 (lin    ux/amd64) kubernetes/05944b1] 35.196.212.60:33916]
I0209 10:01:56.266377       1 wrap.go:42] PUT /api/v1/namespaces/e2e-tests-density-30-5-6w4kv/pods/density-latency-pod-3254-586j7/status: (233.931µs) 429 [[kubelet/v1.10.0 (lin    ux/amd64) kubernetes/05944b1] 35.196.212.60:33916]
I0209 10:01:57.269427       1 wrap.go:42] PUT /api/v1/namespaces/e2e-tests-density-30-5-6w4kv/pods/density-latency-pod-3254-586j7/status: (182.035µs) 429 [[kubelet/v1.10.0 (lin    ux/amd64) kubernetes/05944b1] 35.196.212.60:33916]
I0209 10:01:58.290456       1 wrap.go:42] PUT /api/v1/namespaces/e2e-tests-density-30-5-6w4kv/pods/density-latency-pod-3254-586j7/status: (13.44863ms) 200 [[kubelet/v1.10.0 (li    nux/amd64) kubernetes/05944b1] 35.196.212.60:33916]

So it seems pretty clear that the problem is related to "429"s.

@jdumars
Copy link
Member

jdumars commented Mar 2, 2018

Are those throttled API calls due to a quota on the owner account?

@wojtek-t
Copy link
Member

wojtek-t commented Mar 2, 2018

Are those throttled API calls due to a quota on the owner account?

This isn't throttling as I thought initially. These are 429s on apiserver (the reason may be either slower apiserver because of some reason, or more requests coming to apiserver).

@jdumars
Copy link
Member

jdumars commented Mar 2, 2018

Oh, ok. That's not great news.

@jdumars jdumars added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Mar 2, 2018
@Random-Liu
Copy link
Member

@jdumars np~ @yujuhong helped me a lot!

@yujuhong
Copy link
Contributor

But we do want to ask what PodScheduled mean to pods which are not scheduled by scheduler? Do we really need this condition for those pods? /cc @kubernetes/sig-autoscaling-bugs Because @yujuhong told me that PodScheduled is used by autoscaling now.

I still think letting kubelet set the PodScheduled condition is somewhat strange (as I noted in the original PR). Even if kubelet doesn't set this condition, it'd not affect the cluster autoscaler since the autoscaler ignores the pods without the specific condition. Anyway, the fix we eventually came up has very minor footprint, and would maintain the current behavior (i.e., always setting the PodScheduled condition), so we'll go with that.

@yujuhong
Copy link
Contributor

Also, revived the really old issue for adding tests for steady-state pod update rate #14391

@wojtek-t
Copy link
Member

Anyway, the fix we eventually came up has very minor footprint, and would maintain the current behavior (i.e., always setting the PodScheduled condition), so we'll go with that.

@yujuhong - are you talking about this one: #61504 (or do I misunderstand it)?

@wasylkowski @shyamjvs - can you please run 5000-node tests with that PR patched locally (before we merge it) to ensure that this really helps?

@shyamjvs
Copy link
Member

I ran the test against 1.10 HEAD + #61504, and the pod-startup latency seems to be fine:

INFO: perc50: 2.594635536s, perc90: 3.483550118s, perc99: 4.327417676s

Will re-run once more to confirm.

@wojtek-t
Copy link
Member

@shyamjvs - thanks a lot!

mfojtik pushed a commit to mfojtik/kubernetes that referenced this issue Mar 22, 2018
…-phases-as-metrics

Automatic merge from submit-queue (batch tested with PRs 61378, 60915, 61499, 61507, 61478). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Capture pod startup phases as metrics

Learning from kubernetes#60589, we should also start collecting and graphing sub-parts of pod-startup latency.

/sig scalability
/kind feature
/priority important-soon
/cc @wojtek-t 

```release-note
NONE
```
@shyamjvs
Copy link
Member

Second run also seems good:

INFO: perc50: 2.583489146s, perc90: 3.466873901s, perc99: 4.380595534s

Fairly confident now the fix did the trick. Let's get it into 1.10 asap.

@wojtek-t
Copy link
Member

Thanks @shyamjvs

As we talked offline - I think we had one more regression in the last month or so, but that one shouldn't block the release.

@yujuhong
Copy link
Contributor

@yujuhong - are you talking about this one: #61504 (or do I misunderstand it)?

Yep. The current fix in that PR is not in the options proposed initially in #60589 (comment)

k8s-github-robot pushed a commit that referenced this issue Mar 24, 2018
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Fix `PodScheduled` bug for static pod.

Fixes #60589.

This is an implementation of option 2 in #60589 (comment).
I've validated this in my own cluster, and there won't be continuously status update for static pod any more.

Signed-off-by: Lantao Liu <lantaol@google.com>



**What this PR does / why we need it**:

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #

**Special notes for your reviewer**:

**Release note**:

```release-note
none
```
@jberkus
Copy link

jberkus commented Mar 25, 2018

re-opening until we have a good performance test result.

@jberkus jberkus reopened this Mar 25, 2018
@jdumars
Copy link
Member

jdumars commented Mar 26, 2018

@yujuhong @krzyzacy @shyamjvs @wojtek-t @Random-Liu @wasylkowski-a any updates on this? This is still blocking 1.10 at the moment.

@shyamjvs
Copy link
Member

So the only part of this bug that was blocking the release is the 5k-node performance job. Unfortunately, we lost our run from today due to a different reason (ref: #61190 (comment))

That said, we're fairly confident the fix works based on my manual runs (results pasted in #60589 (comment)). So IMHO we don't need to block release on it (the next run's going to be on wed).

@wojtek-t
Copy link
Member

wojtek-t commented Mar 26, 2018

+1
@jdumars - I think we can treat this as non-blocker.

@wojtek-t
Copy link
Member

Sorry, I edited my post above. I meant that we should treat it as "non-blocker".

@jdumars
Copy link
Member

jdumars commented Mar 26, 2018

Ok, thank you very much. This conclusion represents a tremendous amount of hours you have invested, and I cannot possibly thank you all enough for the work you have done. While we talk in the abstract about "community" and "contributors" you, and the others who have worked this issue represent it in concrete terms. You are the very heart and soul of this project, and I know I speak for everyone involved when I say it is an honor to work alongside such passion, commitment, and professionalism.

@k8s-github-robot
Copy link

[MILESTONENOTIFIER] Milestone Issue: Up-to-date for process

@krzyzacy @msau42 @shyamjvs @wojtek-t

Issue Labels
  • sig/api-machinery sig/autoscaling sig/node sig/scalability sig/scheduling sig/storage: Issue will be escalated to these SIGs if needed.
  • priority/critical-urgent: Never automatically move issue out of a release milestone; continually escalate to contributor and SIG through all available channels.
  • kind/bug: Fixes a bug discovered during the current release.
Help

k8s-github-robot pushed a commit that referenced this issue Apr 12, 2018
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Update default etcd server to 3.2 for kubernetes 1.11

Repply #59836 but with latest etcd 3.2 patch version (3.2.18 which includes mvcc fix and leader election timeout fix) and default `--snapshot-count` to 10k to resolve performance regression in previous etcd 3.2 server upgrade attempt (#60589 (comment)).

See #60589 (comment) for details on the root cause of the performance regression and scalability test results of setting `--snapshot-count` to 10k.

```release-note
Upgrade the default etcd server version to 3.2.18
```
@gyuho @shyamjvs @jdumars @timothysc
k8s-publishing-bot added a commit to kubernetes/sample-apiserver that referenced this issue Apr 12, 2018
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Update default etcd server to 3.2 for kubernetes 1.11

Repply #59836 but with latest etcd 3.2 patch version (3.2.18 which includes mvcc fix and leader election timeout fix) and default `--snapshot-count` to 10k to resolve performance regression in previous etcd 3.2 server upgrade attempt (kubernetes/kubernetes#60589 (comment)).

See kubernetes/kubernetes#60589 (comment) for details on the root cause of the performance regression and scalability test results of setting `--snapshot-count` to 10k.

```release-note
Upgrade the default etcd server version to 3.2.18
```
@gyuho @shyamjvs @jdumars @timothysc

Kubernetes-commit: 9816b431886e356dfc96873392dd3ac66b5f0fe7
k8s-publishing-bot added a commit to kubernetes/apiextensions-apiserver that referenced this issue Apr 12, 2018
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Update default etcd server to 3.2 for kubernetes 1.11

Repply #59836 but with latest etcd 3.2 patch version (3.2.18 which includes mvcc fix and leader election timeout fix) and default `--snapshot-count` to 10k to resolve performance regression in previous etcd 3.2 server upgrade attempt (kubernetes/kubernetes#60589 (comment)).

See kubernetes/kubernetes#60589 (comment) for details on the root cause of the performance regression and scalability test results of setting `--snapshot-count` to 10k.

```release-note
Upgrade the default etcd server version to 3.2.18
```
@gyuho @shyamjvs @jdumars @timothysc

Kubernetes-commit: 9816b431886e356dfc96873392dd3ac66b5f0fe7
@shyamjvs
Copy link
Member

This issue was resolved with the relevant fixes in for 1.10.
For 1.11, we're tracking the failures under - #63030.

/close

prameshj pushed a commit to prameshj/kubernetes that referenced this issue Jun 1, 2018
openshift-publish-robot pushed a commit to openshift/kubernetes-sample-apiserver that referenced this issue Jan 14, 2019
Automatic merge from submit-queue (batch tested with PRs 60891, 60935). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Rollback etcd server version to 3.1.11 due to #60589

Ref kubernetes/kubernetes#60589 (comment)

The dependencies were a bit complex (so many things relying on it) + the version was updated to 3.2.16 on top of the original bump.
So I had to mostly make manual reverting changes on a case-by-case basis - so likely to have errors :)

/cc @wojtek-t @jpbetz 

```release-note
Downgrade default etcd server version to 3.1.11 due to #60589
```

(I'm not sure if we should instead remove release-notes of the original PRs)

Kubernetes-commit: 56195fd1d329e8fb6c3c6cba59e1bc1eb4a2c998
openshift-publish-robot pushed a commit to openshift/kubernetes-sample-apiserver that referenced this issue Jan 14, 2019
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Update default etcd server to 3.2 for kubernetes 1.11

Repply #59836 but with latest etcd 3.2 patch version (3.2.18 which includes mvcc fix and leader election timeout fix) and default `--snapshot-count` to 10k to resolve performance regression in previous etcd 3.2 server upgrade attempt (kubernetes/kubernetes#60589 (comment)).

See kubernetes/kubernetes#60589 (comment) for details on the root cause of the performance regression and scalability test results of setting `--snapshot-count` to 10k.

```release-note
Upgrade the default etcd server version to 3.2.18
```
@gyuho @shyamjvs @jdumars @timothysc

Kubernetes-commit: 9816b431886e356dfc96873392dd3ac66b5f0fe7
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet
Development

Successfully merging a pull request may close this issue.