-
Notifications
You must be signed in to change notification settings - Fork 549
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI Failure (Max retries exceeded with url: /v1/debug/stress_fiber_start?num_fibers=10&min_ms_per_scheduling_point) in CpuStressInjectionTest.test_stress_fibers_ms
#13701
Labels
area/storage
ci-failure
ci-rca/test
CI Root Cause Analysis - Test Issue
sev/low
Bugs which are non-functional paper cuts, e.g. typos, issues in log messages
Comments
I see the stress fiber started, but no response was received by rpk:
Maybe @andrwng could take a look. |
VladLazar
added
the
sev/low
Bugs which are non-functional paper cuts, e.g. typos, issues in log messages
label
Sep 27, 2023
StephanDollberg
added a commit
that referenced
this issue
Jan 24, 2024
When used in a coroutine context `ss::coroutine::maybe_yield` saves a couple hundred instructions over `ss::maybe_yield` as the whole `future` machinery isn't getting involved. Can be seen in the following godbolt: https://godbolt.org/z/Wdb441PYE There is another difference where `ss::coroutine::maybe_yield` will only cause one loop through the task queue while `ss::maybe_yield` requires two. This is because `ss::maybe_yield` is implemented by waiting for an empty task to resolve which requires one run through the task queue for the empty task to run and then another to yield back to the original task. The coroutine version yields back directly. The yield loop being tighter caused the `cpu_stress_injection_test` to fail reliably (it was already unstable before). The problem there is that the stress fiber runs in the admin scheduling group which then starves the actual work of sending the replies on the admin api. To prevent that issue we move the stress fiber to the main group which is more realistic anyway and allows it to be configured when used manually. Fixes #13701 Co-authored-by: Travis Downs <travis.downs@redpanda.com>
7 tasks
StephanDollberg
added a commit
that referenced
this issue
Jan 25, 2024
The `cpu_stress_injection_test` test was occasionally flaky. This was because the stress fiber runs in the admin scheduling group (as that is where it's started from). This however starves the admin server itself and hence it can time out sending the actual API response. To work around that issue we move the stress fibers to run in the main scheduling group when invoked from the admin api. To do that we extend the stress fiber api such that the scheduling group can be specified. Fixes #13701
StephanDollberg
added a commit
that referenced
this issue
Jan 25, 2024
When used in a coroutine context `ss::coroutine::maybe_yield` saves a couple hundred instructions over `ss::maybe_yield` as the whole `future` machinery isn't getting involved. Can be seen in the following godbolt: https://godbolt.org/z/Wdb441PYE There is another difference where `ss::coroutine::maybe_yield` will only cause one loop through the task queue while `ss::maybe_yield` requires two. This is because `ss::maybe_yield` is implemented by waiting for an empty task to resolve which requires one run through the task queue for the empty task to run (which marks the future the original task waits on ready and then enqueues the original task back into the task queue) and a second to then yield back to the original task. The coroutine version yields back directly. This makes the coroutine version a little bit tighter. Fixes #13701 Co-authored-by: Travis Downs <travis.downs@redpanda.com>
dotnwat
added
kind/bug
Something isn't working
ci-disabled-test
ci-ignore
Automatic ci analysis tools ignore this issue
sev/low
Bugs which are non-functional paper cuts, e.g. typos, issues in log messages
and removed
kind/bug
Something isn't working
ci-disabled-test
ci-ignore
Automatic ci analysis tools ignore this issue
sev/low
Bugs which are non-functional paper cuts, e.g. typos, issues in log messages
labels
Apr 4, 2024
andrwng
added a commit
to andrwng/redpanda
that referenced
this issue
May 14, 2024
The test could previously fail after enabling stress fibers because the admin endpoint could become unresponsive with heavy stress. This commit attempts to fix this in a couple ways: - ignoring errors when enabling stress fibers: the tests condition on seeing a specific log line to ensure stress is enabled so the response from the HTTP endpoint doesn't matter for the correctness of the test - retrying on failure when trying to stop stress fibers Fixes redpanda-data#13701
7 tasks
vbotbuildovich
pushed a commit
to vbotbuildovich/redpanda
that referenced
this issue
May 14, 2024
The test could previously fail after enabling stress fibers because the admin endpoint could become unresponsive with heavy stress. This commit attempts to fix this in a couple ways: - ignoring errors when enabling stress fibers: the tests condition on seeing a specific log line to ensure stress is enabled so the response from the HTTP endpoint doesn't matter for the correctness of the test - retrying on failure when trying to stop stress fibers Fixes redpanda-data#13701 (cherry picked from commit f27cc4c)
vbotbuildovich
pushed a commit
to vbotbuildovich/redpanda
that referenced
this issue
May 14, 2024
The test could previously fail after enabling stress fibers because the admin endpoint could become unresponsive with heavy stress. This commit attempts to fix this in a couple ways: - ignoring errors when enabling stress fibers: the tests condition on seeing a specific log line to ensure stress is enabled so the response from the HTTP endpoint doesn't matter for the correctness of the test - retrying on failure when trying to stop stress fibers Fixes redpanda-data#13701 (cherry picked from commit f27cc4c)
This was referenced May 14, 2024
7 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
area/storage
ci-failure
ci-rca/test
CI Root Cause Analysis - Test Issue
sev/low
Bugs which are non-functional paper cuts, e.g. typos, issues in log messages
https://buildkite.com/redpanda/redpanda/builds/37677
The text was updated successfully, but these errors were encountered: