Testgrid for integration tests is broken #2097

alculquicondor · 2024-04-29T14:28:42Z

What happened:

The testgrid shows error for Overall and doesn't show the individual tests.

Other testgrids (E2E, unit) look fine.

What you expected to happen:

A line for every test.

How to reproduce it (as minimally and precisely as possible):

https://testgrid.k8s.io/sig-scheduling#pull-kueue-test-integration-main&width=20

Anything else we need to know?:

We have lost history for the last time it worked.

Environment:

Kubernetes version (use kubectl version):
Kueue version (use git describe --tags --dirty --always):
Cloud provider or hardware configuration:
OS (e.g: cat /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Others:

The text was updated successfully, but these errors were encountered:

alculquicondor · 2024-04-29T14:38:11Z

The release-0.6 branch looks healthy https://testgrid.k8s.io/sig-scheduling#pull-kueue-test-integration-release-0-6&width=20

alculquicondor · 2024-04-29T14:43:07Z

The only difference in the presubmit configuration is that main is running on golang 1.22

alculquicondor · 2024-04-29T16:04:49Z

/assign @gabesaba

k8s-ci-robot · 2024-04-29T16:04:51Z

@alculquicondor: GitHub didn't allow me to assign the following users: gabesaba.

Note that only kubernetes-sigs members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @gabesaba

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

gabesaba · 2024-04-29T16:06:49Z

Probably because junit output is too large - there's an error message that it is malformed since over 100MB

We could either bump this limit, or reduce verbosity

gabesaba · 2024-04-29T16:17:49Z

/assign @gabesaba

gabesaba · 2024-04-29T17:47:58Z

removing -v ginkgo flag from test-integration target, junit.xml went from 204MB to 88MB. This would fix issue, but we're still close to the limit.

Looking next into any particularly spammy logs

gabesaba · 2024-04-30T08:29:03Z

Much of the size can be attributed to a few tests. Below are the tests with output > 1MB (after HTML unescaping, so the actual size is larger)

size	test name	suite name
40.906763MB	Scheduler when Queueing with StrictFIFO Should report pending workloads properly when blocked	Scheduler Suite
26.427274MB	Scheduler when Queueing with StrictFIFO Should allow mutating the requeueingStrategy	Scheduler Suite
16.314269MB	Scheduler when Queueing with StrictFIFO Should schedule workloads by their priority strictly	Scheduler Suite
10.625324MB	Preemption In a cohort with StrictFIFO Should reclaim from cohort even if another CQ has pending workloads	Scheduler Suite
10.117538MB	Scheduler when Preemption is enabled Admits workloads respecting fair share	Scheduler Fair Sharing Suite
7.241802MB	Scheduler when Using cohorts for sharing unused resources Should start workloads that are under min quota before borrowing	Scheduler Suite
3.692636MB	Scheduler when Queueing with StrictFIFO Pending workload with StrictFIFO doesn't block other CQ from borrowing from a third CQ	Scheduler Suite
2.651555MB	Preemption In a ClusterQueue that is part of a cohort Should preempt all necessary workloads in concurrent scheduling with different priorities	Scheduler Suite
2.404417MB	Preemption In a single ClusterQueue Should preempt Workloads with lower priority when there is not enough quota	Scheduler Suite
2.389974MB	Preemption When lending limit enabled Should be able to preempt when lending limit enabled	Scheduler Suite
2.365368MB	Preemption In a single ClusterQueue Should preempt newer Workloads with the same priority when there is not enough quota	Scheduler Suite
2.359989MB	Scheduler when Using cohorts for sharing unused resources Should preempt before try next flavor	Scheduler Suite
2.338912MB	Scheduler when Scheduling workloads on clusterQueues Should admit workloads when resources are dynamically reclaimed	Scheduler Suite
2.167817MB	Preemption In a ClusterQueue that is part of a cohort Should preempt all necessary workloads in concurrent scheduling with the same priority	Scheduler Suitep
2.062887MB	Preemption When most quota is in a shared ClusterQueue in a cohort should allow preempting workloads while borrowing	Scheduler Suite
1.989364MB	Preemption In a ClusterQueue that is part of a cohort Should preempt Workloads in the cohort borrowing quota	when the ClusterQueue is using less than nominal quota

gabesaba · 2024-04-30T11:46:28Z

within those tests, attributing output to certain lines

size	#lines	line
10MB	86649	queue/manager.go:475
32MB	93902	scheduler/logging.go:40
38MB	93985	recorder/recorder.go:104
27MB	93826	scheduler/scheduler.go:617
5MB	20762	preemption/preemption.go:175
23MB	62530	scheduler/scheduler.go:262

We're repeatedly reconciling unschedulable workloads without any backoff. Should there be a backoff here? I imagine that a backoff of even a fraction of a second would drastically reduce the logging output here

gabesaba · 2024-04-30T12:35:21Z

212MB to 15MB after changing scheduler.go:128 to 10ms

$ wc -c before.xml fix.xml 
211955097 before.xml
 14650361 fix.xml

alculquicondor · 2024-04-30T13:08:02Z

Uhm.... maybe we can use a NewItemExponentialFailureRateLimiter to have a backoff when we couldn't admit any workload in that iteration. And we clear the backoff anytime we successfully admit something.

alculquicondor added the kind/bug Categorizes issue or PR as related to a bug. label Apr 29, 2024

k8s-ci-robot assigned gabesaba Apr 29, 2024

gabesaba mentioned this issue Apr 30, 2024

Backoff when no successful schedules #2102

Merged

alculquicondor mentioned this issue May 2, 2024

[kueuectl] Create Local Queue #2027

Merged

gabesaba mentioned this issue May 7, 2024

Kueue integration tests are marked failed consistently in the testgrid #2148

Closed

k8s-ci-robot closed this as completed in #2102 May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Testgrid for integration tests is broken #2097

Testgrid for integration tests is broken #2097

alculquicondor commented Apr 29, 2024

alculquicondor commented Apr 29, 2024

alculquicondor commented Apr 29, 2024

alculquicondor commented Apr 29, 2024

k8s-ci-robot commented Apr 29, 2024

gabesaba commented Apr 29, 2024

gabesaba commented Apr 29, 2024

gabesaba commented Apr 29, 2024

gabesaba commented Apr 30, 2024

gabesaba commented Apr 30, 2024 •

edited

gabesaba commented Apr 30, 2024

alculquicondor commented Apr 30, 2024 •

edited

Testgrid for integration tests is broken #2097

Testgrid for integration tests is broken #2097

Comments

alculquicondor commented Apr 29, 2024

alculquicondor commented Apr 29, 2024

alculquicondor commented Apr 29, 2024

alculquicondor commented Apr 29, 2024

k8s-ci-robot commented Apr 29, 2024

gabesaba commented Apr 29, 2024

gabesaba commented Apr 29, 2024

gabesaba commented Apr 29, 2024

gabesaba commented Apr 30, 2024

gabesaba commented Apr 30, 2024 • edited

gabesaba commented Apr 30, 2024

alculquicondor commented Apr 30, 2024 • edited

gabesaba commented Apr 30, 2024 •

edited

alculquicondor commented Apr 30, 2024 •

edited