Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Testgrid for integration tests is broken #2097

Closed
alculquicondor opened this issue Apr 29, 2024 · 11 comments · Fixed by #2102
Closed

Testgrid for integration tests is broken #2097

alculquicondor opened this issue Apr 29, 2024 · 11 comments · Fixed by #2102
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@alculquicondor
Copy link
Contributor

What happened:

The testgrid shows error for Overall and doesn't show the individual tests.

Other testgrids (E2E, unit) look fine.

What you expected to happen:

A line for every test.

How to reproduce it (as minimally and precisely as possible):

https://testgrid.k8s.io/sig-scheduling#pull-kueue-test-integration-main&width=20

Anything else we need to know?:

We have lost history for the last time it worked.

Environment:

  • Kubernetes version (use kubectl version):
  • Kueue version (use git describe --tags --dirty --always):
  • Cloud provider or hardware configuration:
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@alculquicondor alculquicondor added the kind/bug Categorizes issue or PR as related to a bug. label Apr 29, 2024
@alculquicondor
Copy link
Contributor Author

@alculquicondor
Copy link
Contributor Author

The only difference in the presubmit configuration is that main is running on golang 1.22

@alculquicondor
Copy link
Contributor Author

/assign @gabesaba

@k8s-ci-robot
Copy link
Contributor

@alculquicondor: GitHub didn't allow me to assign the following users: gabesaba.

Note that only kubernetes-sigs members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @gabesaba

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@gabesaba
Copy link
Contributor

Probably because junit output is too large - there's an error message that it is malformed since over 100MB

We could either bump this limit, or reduce verbosity

@gabesaba
Copy link
Contributor

/assign @gabesaba

@gabesaba
Copy link
Contributor

removing -v ginkgo flag from test-integration target, junit.xml went from 204MB to 88MB. This would fix issue, but we're still close to the limit.

Looking next into any particularly spammy logs

@gabesaba
Copy link
Contributor

Much of the size can be attributed to a few tests. Below are the tests with output > 1MB (after HTML unescaping, so the actual size is larger)

size test name suite name
40.906763MB Scheduler when Queueing with StrictFIFO Should report pending workloads properly when blocked Scheduler Suite
26.427274MB Scheduler when Queueing with StrictFIFO Should allow mutating the requeueingStrategy Scheduler Suite
16.314269MB Scheduler when Queueing with StrictFIFO Should schedule workloads by their priority strictly Scheduler Suite
10.625324MB Preemption In a cohort with StrictFIFO Should reclaim from cohort even if another CQ has pending workloads Scheduler Suite
10.117538MB Scheduler when Preemption is enabled Admits workloads respecting fair share Scheduler Fair Sharing Suite
7.241802MB Scheduler when Using cohorts for sharing unused resources Should start workloads that are under min quota before borrowing Scheduler Suite
3.692636MB Scheduler when Queueing with StrictFIFO Pending workload with StrictFIFO doesn't block other CQ from borrowing from a third CQ Scheduler Suite
2.651555MB Preemption In a ClusterQueue that is part of a cohort Should preempt all necessary workloads in concurrent scheduling with different priorities Scheduler Suite
2.404417MB Preemption In a single ClusterQueue Should preempt Workloads with lower priority when there is not enough quota Scheduler Suite
2.389974MB Preemption When lending limit enabled Should be able to preempt when lending limit enabled Scheduler Suite
2.365368MB Preemption In a single ClusterQueue Should preempt newer Workloads with the same priority when there is not enough quota Scheduler Suite
2.359989MB Scheduler when Using cohorts for sharing unused resources Should preempt before try next flavor Scheduler Suite
2.338912MB Scheduler when Scheduling workloads on clusterQueues Should admit workloads when resources are dynamically reclaimed Scheduler Suite
2.167817MB Preemption In a ClusterQueue that is part of a cohort Should preempt all necessary workloads in concurrent scheduling with the same priority Scheduler Suitep
2.062887MB Preemption When most quota is in a shared ClusterQueue in a cohort should allow preempting workloads while borrowing Scheduler Suite
1.989364MB Preemption In a ClusterQueue that is part of a cohort Should preempt Workloads in the cohort borrowing quota when the ClusterQueue is using less than nominal quota

@gabesaba
Copy link
Contributor

gabesaba commented Apr 30, 2024

within those tests, attributing output to certain lines

size #lines line
10MB 86649 queue/manager.go:475
32MB 93902 scheduler/logging.go:40
38MB 93985 recorder/recorder.go:104
27MB 93826 scheduler/scheduler.go:617
5MB 20762 preemption/preemption.go:175
23MB 62530 scheduler/scheduler.go:262

We're repeatedly reconciling unschedulable workloads without any backoff. Should there be a backoff here? I imagine that a backoff of even a fraction of a second would drastically reduce the logging output here

@gabesaba
Copy link
Contributor

212MB to 15MB after changing scheduler.go:128 to 10ms

$ wc -c before.xml fix.xml 
211955097 before.xml
 14650361 fix.xml

@alculquicondor
Copy link
Contributor Author

alculquicondor commented Apr 30, 2024

Uhm.... maybe we can use a NewItemExponentialFailureRateLimiter to have a backoff when we couldn't admit any workload in that iteration. And we clear the backoff anytime we successfully admit something.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants