Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failure (Max retries exceeded with url: /v1/debug/stress_fiber_start?num_fibers=10&min_ms_per_scheduling_point) in CpuStressInjectionTest.test_stress_fibers_ms #13701

Closed
bharathv opened this issue Sep 27, 2023 · 12 comments · Fixed by #18463
Assignees
Labels
area/storage ci-failure ci-rca/test CI Root Cause Analysis - Test Issue sev/low Bugs which are non-functional paper cuts, e.g. typos, issues in log messages

Comments

@bharathv
Copy link
Contributor

bharathv commented Sep 27, 2023

https://buildkite.com/redpanda/redpanda/builds/37677

Module: rptest.tests.cpu_stress_injection_test
Class: CpuStressInjectionTest
Method: test_stress_fibers_ms
test_id:    CpuStressInjectionTest.test_stress_fibers_ms
status:     FAIL
run time:   97.586 seconds

ConnectionError(MaxRetryError('HTTPConnectionPool(host=\'docker-rp-19\', port=9644): Max retries exceeded with url: /v1/debug/stress_fiber_start?num_fibers=10&min_ms_per_scheduling_point=30&max_ms_per_scheduling_point=300 (Caused by ReadTimeoutError("HTTPConnectionPool(host=\'docker-rp-19\', port=9644): Read timed out. (read timeout=30)"))'))
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 426, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 421, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib/python3.10/http/client.py", line 1375, in getresponse
    response.begin()
  File "/usr/lib/python3.10/http/client.py", line 318, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.10/http/client.py", line 279, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/lib/python3.10/socket.py", line 705, in readinto
    return self._sock.recv_into(b)
TimeoutError: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 670, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 428, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 335, in _raise_timeout
    raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='docker-rp-19', port=9644): Read timed out. (read timeout=30)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/requests/adapters.py", line 439, in send
    resp = conn.urlopen(
  File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 726, in urlopen
    retries = retries.increment(
  File "/usr/local/lib/python3.10/dist-packages/urllib3/util/retry.py", line 446, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='docker-rp-19', port=9644): Max retries exceeded with url: /v1/debug/stress_fiber_start?num_fibers=10&min_ms_per_scheduling_point=30&max_ms_per_scheduling_point=300 (Caused by ReadTimeoutError("HTTPConnectionPool(host='docker-rp-19', port=9644): Read timed out. (read timeout=30)"))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 184, in _do_run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 269, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/services/cluster.py", line 82, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/cpu_stress_injection_test.py", line 34, in test_stress_fibers_ms
    admin.stress_fiber_start(node,
  File "/root/tests/rptest/services/admin.py", line 980, in stress_fiber_start
    return self._request("PUT",
  File "/root/tests/rptest/services/admin.py", line 334, in _request
    r = self._session.request(verb, url, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/requests/sessions.py", line 530, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/requests/sessions.py", line 643, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/requests/adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='docker-rp-19', port=9644): Max retries exceeded with url: /v1/debug/stress_fiber_start?num_fibers=10&min_ms_per_scheduling_point=30&max_ms_per_scheduling_point=300 (Caused by ReadTimeoutError("HTTPConnectionPool(host='docker-rp-19', port=9644): Read timed out. (read timeout=30)"))


JIRA Link: [CORE-1469](https://redpandadata.atlassian.net/browse/CORE-1469)

[CORE-1469]: https://redpandadata.atlassian.net/browse/CORE-1469?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
@VladLazar
Copy link
Contributor

I see the stress fiber started, but no response was received by rpk:

INFO  2023-09-26 16:53:54,989 [shard 0:admi] admin_api_server - admin_server.cc:4261 - Started stress fiber...

Maybe @andrwng could take a look.

@VladLazar VladLazar added the sev/low Bugs which are non-functional paper cuts, e.g. typos, issues in log messages label Sep 27, 2023
@vbotbuildovich
Copy link
Collaborator

StephanDollberg added a commit that referenced this issue Jan 24, 2024
When used in a coroutine context `ss::coroutine::maybe_yield` saves a
couple hundred instructions over `ss::maybe_yield` as the whole `future`
machinery isn't getting involved.

Can be seen in the following godbolt: https://godbolt.org/z/Wdb441PYE

There is another difference where `ss::coroutine::maybe_yield` will only
cause one loop through the task queue while `ss::maybe_yield` requires
two. This is because `ss::maybe_yield` is implemented by waiting for an
empty task to resolve which requires one run through the task queue for
the empty task to run and then another to yield back to the original
task. The coroutine version yields back directly.

The yield loop being tighter caused the `cpu_stress_injection_test` to
fail reliably (it was already unstable before). The problem there is
that the stress fiber runs in the admin scheduling group which then
starves the actual work of sending the replies on the admin api.

To prevent that issue we move the stress fiber to the main group which
is more realistic anyway and allows it to be configured when used
manually.

Fixes #13701

Co-authored-by: Travis Downs <travis.downs@redpanda.com>
StephanDollberg added a commit that referenced this issue Jan 25, 2024
The `cpu_stress_injection_test` test was occasionally flaky. This was
because the stress fiber runs in the admin scheduling group (as that is
where it's started from). This however starves the admin server itself
and hence it can time out sending the actual API response.

To work around that issue we move the stress fibers to run in the main
scheduling group when invoked from the admin api.

To do that we extend the stress fiber api such that the scheduling group
can be specified.

Fixes #13701
StephanDollberg added a commit that referenced this issue Jan 25, 2024
When used in a coroutine context `ss::coroutine::maybe_yield` saves a
couple hundred instructions over `ss::maybe_yield` as the whole `future`
machinery isn't getting involved.

Can be seen in the following godbolt: https://godbolt.org/z/Wdb441PYE

There is another difference where `ss::coroutine::maybe_yield` will only
cause one loop through the task queue while `ss::maybe_yield` requires
two. This is because `ss::maybe_yield` is implemented by waiting for an
empty task to resolve which requires one run through the task queue for
the empty task to run (which marks the future the original task waits on
ready and then enqueues the original task back into the task queue) and
a second to then yield back to the original task. The coroutine version
yields back directly. This makes the coroutine version a little bit
tighter.

Fixes #13701

Co-authored-by: Travis Downs <travis.downs@redpanda.com>
@vbotbuildovich
Copy link
Collaborator

@vbotbuildovich
Copy link
Collaborator

@vbotbuildovich
Copy link
Collaborator

@dotnwat dotnwat added kind/bug Something isn't working ci-disabled-test ci-ignore Automatic ci analysis tools ignore this issue sev/low Bugs which are non-functional paper cuts, e.g. typos, issues in log messages and removed kind/bug Something isn't working ci-disabled-test ci-ignore Automatic ci analysis tools ignore this issue sev/low Bugs which are non-functional paper cuts, e.g. typos, issues in log messages labels Apr 4, 2024
@vbotbuildovich
Copy link
Collaborator

@bharathv bharathv mentioned this issue May 14, 2024
7 tasks
andrwng added a commit to andrwng/redpanda that referenced this issue May 14, 2024
The test could previously fail after enabling stress fibers because the
admin endpoint could become unresponsive with heavy stress. This commit
attempts to fix this in a couple ways:
- ignoring errors when enabling stress fibers: the tests condition on
  seeing a specific log line to ensure stress is enabled so the response
  from the HTTP endpoint doesn't matter for the correctness of the test
- retrying on failure when trying to stop stress fibers

Fixes redpanda-data#13701
vbotbuildovich pushed a commit to vbotbuildovich/redpanda that referenced this issue May 14, 2024
The test could previously fail after enabling stress fibers because the
admin endpoint could become unresponsive with heavy stress. This commit
attempts to fix this in a couple ways:
- ignoring errors when enabling stress fibers: the tests condition on
  seeing a specific log line to ensure stress is enabled so the response
  from the HTTP endpoint doesn't matter for the correctness of the test
- retrying on failure when trying to stop stress fibers

Fixes redpanda-data#13701

(cherry picked from commit f27cc4c)
vbotbuildovich pushed a commit to vbotbuildovich/redpanda that referenced this issue May 14, 2024
The test could previously fail after enabling stress fibers because the
admin endpoint could become unresponsive with heavy stress. This commit
attempts to fix this in a couple ways:
- ignoring errors when enabling stress fibers: the tests condition on
  seeing a specific log line to ensure stress is enabled so the response
  from the HTTP endpoint doesn't matter for the correctness of the test
- retrying on failure when trying to stop stress fibers

Fixes redpanda-data#13701

(cherry picked from commit f27cc4c)
@andrwng andrwng added the ci-rca/test CI Root Cause Analysis - Test Issue label May 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/storage ci-failure ci-rca/test CI Root Cause Analysis - Test Issue sev/low Bugs which are non-functional paper cuts, e.g. typos, issues in log messages
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants