Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize build cluster performance #3890

Open
2 of 7 tasks
howardjohn opened this issue Feb 28, 2022 · 2 comments
Open
2 of 7 tasks

Optimize build cluster performance #3890

howardjohn opened this issue Feb 28, 2022 · 2 comments

Comments

@howardjohn
Copy link
Member

howardjohn commented Feb 28, 2022

This issue aims to track all things related to optimizing our build cluster performance.

We have done a lot of work to reduce test flakes, but we still seem them relatively often. In a large number of cases, these appear to occur when things that should always succeed fail for reasons outside of poorly written tests or buggy Istiocode. For example, simple HTTP requests timing out after many seconds.

We have had two similar issues in the past:

  • Prow cluster resource leak #1988 was caused by not properly cleaning up resources, leading to a ton of resources running in the cluster over time. This was fixed by ensuring we clean up (through many different mechanisms)
  • Test stability regression istio#32985 jobs suddenly hang A LOT. echo takes over 60s in some cases. Triggered by a node upgrade in GKE. We switch from ubuntu to COS to mitigate this. Root cause unknown to date.

Current state:

  • Tests often fail for reasons that are likely explained by node performance (IE trivial command is throttled heavily for N seconds, and test is not robust against this). While we expect our tests to be robust against this to some degree, it appears N is sometimes extremely large. For example, we have a lot of tests that send 5 requests and expect all 5 to succeed, with many retries, with a 30s timeout. These fail relatively often.
  • We have a metric that captures the time it takes to run echo. On a health machine, this should, of course, take near 0ms. We often see this spike, correlated with increased CPU usage.
    2022-02-28_09-53-45
    2022-02-28_09-53-02

Top shows grouped by node type, bottom all nodes. You can see spikes up to 2.5s. Note: the node type graph is likely misleading; we have a small fixed number of n2/t2d nodes but a large dynamic number of e2 nodes. This means there are more samples for e2 AND it has more cache misses.

Things to try:

  • Setting CPU limits: 9dadd37. No tangible improvements in any metric
  • Guarantee QOS test pods (superset of CPU limits)
  • kubelet static CPU policy (superset of Guaranteed QOS)
  • Running other node types (n2, t2d). Currently trialing this. No conclusive data.
  • Using local SSDs. Currently we run 512/256gb pd-ssd. There is evidence we are IO bound in some portion of tests - graphs show our bandwidth is often at the cap, and we do see up to 8mb/s write throttling. However, there is no evidence that removing the bottleneck would change test results; most of our tests are not IO bound. kind etcd runs in tmpfs and should be unimpacted. Local SSD are actually cheaper and far faster, however they require n2 nodes.
  • Increasing CPU requests on some jobs. d28ae63 and 3a0765c put the most expensive ones at 15 CPUs, ensuring dedicate nodes. Since this change, unit test runtime has dropped substantially, but there is not strong evidence yet that it impacts other tests flakiness.
  • Build once, test in many places. Currently we build all docker images N times, and some test binaries N times. This is fairly expensive even with a cache. it would be ideal to build once - possibly on some giant nodes - and then just run the tests locally. This is likely a massive effort.
@howardjohn
Copy link
Member Author

Starting to look at co-located jobs during flakes


Flake: https://prow.istio.io/view/gs/istio-prow/logs/integ-k8s-120_istio_postsubmit/1498955294602956800 mysterious 141 error. OOM? At 2022-03-02T09:40:43.788454Z

Co-located with integ-pilot (started at same time) and integ-assertion (deep into run).

Total memory by all 3 is only 12gb, not very concerning

https://pantheon.corp.google.com/monitoring/metrics-explorer?pageState=%7B%22xyChart%22:%7B%22dataSets%22:%5B%7B%22timeSeriesFilter%22:%7B%22filter%22:%22metric.type%3D%5C%22kubernetes.io%2Fcontainer%2Fcpu%2Fcore_usage_time%5C%22%20resource.type%3D%5C%22k8s_container%5C%22%20metadata.system_labels.%5C%22node_name%5C%22%3D%5C%22gke-prow-istio-test-pool-cos-1104557f-xmm6%5C%22%20resource.label.%5C%22cluster_name%5C%22%3D%5C%22prow%5C%22%22,%22minAlignmentPeriod%22:%2260s%22,%22aggregations%22:%5B%7B%22perSeriesAligner%22:%22ALIGN_RATE%22,%22crossSeriesReducer%22:%22REDUCE_SUM%22,%22alignmentPeriod%22:%2260s%22,%22groupByFields%22:%5B%22metadata.user_labels.%5C%22prow.k8s.io%2Fid%5C%22%22,%22metadata.user_labels.%5C%22prow.k8s.io%2Fjob%5C%22%22%5D%7D,%7B%22crossSeriesReducer%22:%22REDUCE_NONE%22,%22alignmentPeriod%22:%2260s%22,%22groupByFields%22:%5B%5D%7D%5D%7D,%22targetAxis%22:%22Y1%22,%22plotType%22:%22LINE%22%7D%5D,%22options%22:%7B%22mode%22:%22COLOR%22%7D,%22constantLines%22:%5B%5D,%22timeshiftDuration%22:%220s%22,%22y1Axis%22:%7B%22label%22:%22y1Axis%22,%22scale%22:%22LINEAR%22%7D%7D,%22isAutoRefresh%22:true,%22timeSelection%22:%7B%22timeRange%22:%221w%22%7D,%22xZoomDomain%22:%7B%22start%22:%222022-03-02T05:26:15.292Z%22,%22end%22:%222022-03-02T12:34:36.558Z%22%7D%7D&project=istio-prow-build


https://prow.istio.io/view/gs/istio-prow/logs/integ-security-multicluster_istio_postsubmit/1498866650949095424

TestReachability/global-plaintext/b_in_primary/tcp_to_headless:tcp_positive failure at 2022-03-02T04:10:57.231356Z

Co-located with integ-k8s-119 that started at the same time. It was using near zero cpu at the time of the test failure - it was literally doing nothing ( a bug of its own )

https://pantheon.corp.google.com/monitoring/metrics-explorer?pageState=%7B%22xyChart%22:%7B%22dataSets%22:%5B%7B%22timeSeriesFilter%22:%7B%22filter%22:%22metric.type%3D%5C%22kubernetes.io%2Fcontainer%2Fcpu%2Fcore_usage_time%5C%22%20resource.type%3D%5C%22k8s_container%5C%22%20metadata.system_labels.%5C%22node_name%5C%22%3D%5C%22gke-prow-istio-test-pool-cos-1104557f-qnxm%5C%22%20resource.label.%5C%22cluster_name%5C%22%3D%5C%22prow%5C%22%22,%22minAlignmentPeriod%22:%2260s%22,%22aggregations%22:%5B%7B%22perSeriesAligner%22:%22ALIGN_RATE%22,%22crossSeriesReducer%22:%22REDUCE_SUM%22,%22alignmentPeriod%22:%2260s%22,%22groupByFields%22:%5B%22metadata.user_labels.%5C%22prow.k8s.io%2Fid%5C%22%22,%22metadata.user_labels.%5C%22prow.k8s.io%2Fjob%5C%22%22%5D%7D,%7B%22crossSeriesReducer%22:%22REDUCE_NONE%22,%22alignmentPeriod%22:%2260s%22,%22groupByFields%22:%5B%5D%7D%5D%7D,%22targetAxis%22:%22Y1%22,%22plotType%22:%22LINE%22%7D%5D,%22options%22:%7B%22mode%22:%22COLOR%22%7D,%22constantLines%22:%5B%5D,%22timeshiftDuration%22:%220s%22,%22y1Axis%22:%7B%22label%22:%22y1Axis%22,%22scale%22:%22LINEAR%22%7D%7D,%22isAutoRefresh%22:true,%22timeSelection%22:%7B%22timeRange%22:%221d%22%7D,%22xZoomDomain%22:%7B%22start%22:%222022-03-02T03:42:26.942Z%22,%22end%22:%222022-03-02T05:25:18.371Z%22%7D%7D&project=istio-prow-build

@howardjohn
Copy link
Member Author

https://prow.istio.io/view/gs/istio-prow/logs/integ-security-multicluster_istio_postsubmit/1498793275824279552 double failure!

TestReachability/beta-mtls-permissive/b_in_primary/tcp_to_b:tcp_positive at 2022-03-01T23:10:08.013392Z
TestMtlsStrictK8sCA/global-mtls-on-no-dr/b_in_remote/tcp_to_a:tcp_positive at 2022-03-01T23:18:19.563414Z

Co-scheduled with distroless job that started after. Disrtoless job peaks in CPU from 23:00 but is done by 23:05 - way before the failures.

Also co-scheduled with helm test. This one runs from 23:10 to 23:16. So it really shouldn't be overlapping with either of the failures - it is close though

So in 3 cases I looked at, coscheduling doesn't seem to be related.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant