CI Failure (Max retries exceeded with url: /v1/debug/stress_fiber_start?num_fibers=10&min_ms_per_scheduling_point) in `CpuStressInjectionTest.test_stress_fibers_ms` #13701

bharathv · 2023-09-27T03:04:12Z

https://buildkite.com/redpanda/redpanda/builds/37677

Module: rptest.tests.cpu_stress_injection_test
Class: CpuStressInjectionTest
Method: test_stress_fibers_ms

test_id:    CpuStressInjectionTest.test_stress_fibers_ms
status:     FAIL
run time:   97.586 seconds

ConnectionError(MaxRetryError('HTTPConnectionPool(host=\'docker-rp-19\', port=9644): Max retries exceeded with url: /v1/debug/stress_fiber_start?num_fibers=10&min_ms_per_scheduling_point=30&max_ms_per_scheduling_point=300 (Caused by ReadTimeoutError("HTTPConnectionPool(host=\'docker-rp-19\', port=9644): Read timed out. (read timeout=30)"))'))
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 426, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 421, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib/python3.10/http/client.py", line 1375, in getresponse
    response.begin()
  File "/usr/lib/python3.10/http/client.py", line 318, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.10/http/client.py", line 279, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/lib/python3.10/socket.py", line 705, in readinto
    return self._sock.recv_into(b)
TimeoutError: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 670, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 428, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 335, in _raise_timeout
    raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='docker-rp-19', port=9644): Read timed out. (read timeout=30)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/requests/adapters.py", line 439, in send
    resp = conn.urlopen(
  File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 726, in urlopen
    retries = retries.increment(
  File "/usr/local/lib/python3.10/dist-packages/urllib3/util/retry.py", line 446, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='docker-rp-19', port=9644): Max retries exceeded with url: /v1/debug/stress_fiber_start?num_fibers=10&min_ms_per_scheduling_point=30&max_ms_per_scheduling_point=300 (Caused by ReadTimeoutError("HTTPConnectionPool(host='docker-rp-19', port=9644): Read timed out. (read timeout=30)"))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 184, in _do_run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 269, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/services/cluster.py", line 82, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/cpu_stress_injection_test.py", line 34, in test_stress_fibers_ms
    admin.stress_fiber_start(node,
  File "/root/tests/rptest/services/admin.py", line 980, in stress_fiber_start
    return self._request("PUT",
  File "/root/tests/rptest/services/admin.py", line 334, in _request
    r = self._session.request(verb, url, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/requests/sessions.py", line 530, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/requests/sessions.py", line 643, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/requests/adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='docker-rp-19', port=9644): Max retries exceeded with url: /v1/debug/stress_fiber_start?num_fibers=10&min_ms_per_scheduling_point=30&max_ms_per_scheduling_point=300 (Caused by ReadTimeoutError("HTTPConnectionPool(host='docker-rp-19', port=9644): Read timed out. (read timeout=30)"))


JIRA Link: [CORE-1469](https://redpandadata.atlassian.net/browse/CORE-1469)

[CORE-1469]: https://redpandadata.atlassian.net/browse/CORE-1469?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

The text was updated successfully, but these errors were encountered:

VladLazar · 2023-09-27T09:18:33Z

I see the stress fiber started, but no response was received by rpk:

INFO  2023-09-26 16:53:54,989 [shard 0:admi] admin_api_server - admin_server.cc:4261 - Started stress fiber...

Maybe @andrwng could take a look.

vbotbuildovich · 2023-12-13T20:22:01Z

*https://buildkite.com/redpanda/redpanda/builds/42465

When used in a coroutine context `ss::coroutine::maybe_yield` saves a couple hundred instructions over `ss::maybe_yield` as the whole `future` machinery isn't getting involved. Can be seen in the following godbolt: https://godbolt.org/z/Wdb441PYE There is another difference where `ss::coroutine::maybe_yield` will only cause one loop through the task queue while `ss::maybe_yield` requires two. This is because `ss::maybe_yield` is implemented by waiting for an empty task to resolve which requires one run through the task queue for the empty task to run and then another to yield back to the original task. The coroutine version yields back directly. The yield loop being tighter caused the `cpu_stress_injection_test` to fail reliably (it was already unstable before). The problem there is that the stress fiber runs in the admin scheduling group which then starves the actual work of sending the replies on the admin api. To prevent that issue we move the stress fiber to the main group which is more realistic anyway and allows it to be configured when used manually. Fixes #13701 Co-authored-by: Travis Downs <travis.downs@redpanda.com>

The `cpu_stress_injection_test` test was occasionally flaky. This was because the stress fiber runs in the admin scheduling group (as that is where it's started from). This however starves the admin server itself and hence it can time out sending the actual API response. To work around that issue we move the stress fibers to run in the main scheduling group when invoked from the admin api. To do that we extend the stress fiber api such that the scheduling group can be specified. Fixes #13701

When used in a coroutine context `ss::coroutine::maybe_yield` saves a couple hundred instructions over `ss::maybe_yield` as the whole `future` machinery isn't getting involved. Can be seen in the following godbolt: https://godbolt.org/z/Wdb441PYE There is another difference where `ss::coroutine::maybe_yield` will only cause one loop through the task queue while `ss::maybe_yield` requires two. This is because `ss::maybe_yield` is implemented by waiting for an empty task to resolve which requires one run through the task queue for the empty task to run (which marks the future the original task waits on ready and then enqueues the original task back into the task queue) and a second to then yield back to the original task. The coroutine version yields back directly. This makes the coroutine version a little bit tighter. Fixes #13701 Co-authored-by: Travis Downs <travis.downs@redpanda.com>

vbotbuildovich · 2024-03-07T00:13:54Z

*https://buildkite.com/redpanda/redpanda/builds/45730#018e127f-befa-4832-a962-eff36d43b2bc

vbotbuildovich · 2024-03-17T21:13:34Z

*https://buildkite.com/redpanda/redpanda/builds/46334

vbotbuildovich · 2024-04-04T21:17:17Z

*https://buildkite.com/redpanda/redpanda/builds/47373

vbotbuildovich · 2024-04-29T21:14:04Z

*https://buildkite.com/redpanda/redpanda/builds/48442
*https://buildkite.com/redpanda/redpanda/builds/48432

vbotbuildovich · 2024-05-07T21:14:29Z

*https://buildkite.com/redpanda/redpanda/builds/48760
*https://buildkite.com/redpanda/redpanda/builds/48777

vbotbuildovich · 2024-05-09T21:16:38Z

*https://buildkite.com/redpanda/redpanda/builds/48879
*https://buildkite.com/redpanda/redpanda/builds/48883

vbotbuildovich · 2024-05-10T21:13:15Z

*https://buildkite.com/redpanda/redpanda/builds/48919
*https://buildkite.com/redpanda/redpanda/builds/48917
*https://buildkite.com/redpanda/redpanda/builds/48928

vbotbuildovich · 2024-05-11T21:15:24Z

*https://buildkite.com/redpanda/redpanda/builds/48956
*https://buildkite.com/redpanda/redpanda/builds/48967

vbotbuildovich · 2024-05-12T21:15:39Z

*https://buildkite.com/redpanda/redpanda/builds/48980

vbotbuildovich · 2024-05-13T21:14:03Z

*https://buildkite.com/redpanda/redpanda/builds/48992
*https://buildkite.com/redpanda/redpanda/builds/48988
*https://buildkite.com/redpanda/redpanda/builds/48999
*https://buildkite.com/redpanda/redpanda/builds/49015
*https://buildkite.com/redpanda/redpanda/builds/49002

The test could previously fail after enabling stress fibers because the admin endpoint could become unresponsive with heavy stress. This commit attempts to fix this in a couple ways: - ignoring errors when enabling stress fibers: the tests condition on seeing a specific log line to ensure stress is enabled so the response from the HTTP endpoint doesn't matter for the correctness of the test - retrying on failure when trying to stop stress fibers Fixes redpanda-data#13701

The test could previously fail after enabling stress fibers because the admin endpoint could become unresponsive with heavy stress. This commit attempts to fix this in a couple ways: - ignoring errors when enabling stress fibers: the tests condition on seeing a specific log line to ensure stress is enabled so the response from the HTTP endpoint doesn't matter for the correctness of the test - retrying on failure when trying to stop stress fibers Fixes redpanda-data#13701 (cherry picked from commit f27cc4c)

bharathv added area/storage ci-failure labels Sep 27, 2023

VladLazar added the sev/low Bugs which are non-functional paper cuts, e.g. typos, issues in log messages label Sep 27, 2023

piyushredpanda assigned andrwng Oct 11, 2023

StephanDollberg mentioned this issue Jan 24, 2024

all: Prefer ss::coroutine::maybe_yield #16228

Open

7 tasks

bharathv mentioned this issue May 14, 2024

rm stm state cleanup #18277

Merged

7 tasks

andrwng mentioned this issue May 14, 2024

rptest: be more permissive with errors in stress fibers test #18463

Merged

7 tasks

piyushredpanda closed this as completed in #18463 May 14, 2024

michael-redpanda mentioned this issue May 14, 2024

CORE-2949 Fixup RPC unit test certs #18470

Merged

7 tasks

andrwng added the ci-rca/test CI Root Cause Analysis - Test Issue label May 16, 2024

WillemKauf mentioned this issue May 28, 2024

CI Failure (key symptom) in CpuStressInjectionTest.test_stress_fibers_ms #18616

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI Failure (Max retries exceeded with url: /v1/debug/stress_fiber_start?num_fibers=10&min_ms_per_scheduling_point) in `CpuStressInjectionTest.test_stress_fibers_ms` #13701

CI Failure (Max retries exceeded with url: /v1/debug/stress_fiber_start?num_fibers=10&min_ms_per_scheduling_point) in `CpuStressInjectionTest.test_stress_fibers_ms` #13701

bharathv commented Sep 27, 2023 •

edited by jira bot

VladLazar commented Sep 27, 2023

vbotbuildovich commented Dec 13, 2023

vbotbuildovich commented Mar 7, 2024

vbotbuildovich commented Mar 17, 2024

vbotbuildovich commented Apr 4, 2024

vbotbuildovich commented Apr 29, 2024

vbotbuildovich commented May 7, 2024

vbotbuildovich commented May 9, 2024

vbotbuildovich commented May 10, 2024

vbotbuildovich commented May 11, 2024

vbotbuildovich commented May 12, 2024

vbotbuildovich commented May 13, 2024

CI Failure (Max retries exceeded with url: /v1/debug/stress_fiber_start?num_fibers=10&min_ms_per_scheduling_point) in CpuStressInjectionTest.test_stress_fibers_ms #13701

CI Failure (Max retries exceeded with url: /v1/debug/stress_fiber_start?num_fibers=10&min_ms_per_scheduling_point) in CpuStressInjectionTest.test_stress_fibers_ms #13701

Comments

bharathv commented Sep 27, 2023 • edited by jira bot

VladLazar commented Sep 27, 2023

vbotbuildovich commented Dec 13, 2023

vbotbuildovich commented Mar 7, 2024

vbotbuildovich commented Mar 17, 2024

vbotbuildovich commented Apr 4, 2024

vbotbuildovich commented Apr 29, 2024

vbotbuildovich commented May 7, 2024

vbotbuildovich commented May 9, 2024

vbotbuildovich commented May 10, 2024

vbotbuildovich commented May 11, 2024

vbotbuildovich commented May 12, 2024

vbotbuildovich commented May 13, 2024

CI Failure (Max retries exceeded with url: /v1/debug/stress_fiber_start?num_fibers=10&min_ms_per_scheduling_point) in `CpuStressInjectionTest.test_stress_fibers_ms` #13701

CI Failure (Max retries exceeded with url: /v1/debug/stress_fiber_start?num_fibers=10&min_ms_per_scheduling_point) in `CpuStressInjectionTest.test_stress_fibers_ms` #13701

bharathv commented Sep 27, 2023 •

edited by jira bot