Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Endless errors on gateway because the writer is full #13018

Closed
Zelldon opened this issue Jun 8, 2023 · 3 comments · Fixed by #18548
Closed

Endless errors on gateway because the writer is full #13018

Zelldon opened this issue Jun 8, 2023 · 3 comments · Fixed by #18548
Assignees
Labels
component/gateway component/zeebe Related to the Zeebe component/team good first issue Marks an issue as simple enough for first time contributors kind/bug Categorizes an issue or PR as a bug scope/gateway Marks an issue or PR to appear in the gateway section of the changelog severity/low Marks a bug as having little to no noticeable impact for the user version:8.2.28 Label that represents issues released on version 8.2.28 version:8.3.12 Label that represents issues released on version 8.3.12 version:8.4.8 Label that represents issues released on version 8.4.8 version:8.5.2 Label that represents issues released on version 8.5.2 version:8.6.0-alpha2 Marks an issue as being completely or in parts released in 8.6.0-alpha2

Comments

@Zelldon
Copy link
Member

Zelldon commented Jun 8, 2023

Describe the bug

We see several new error in the gateway, but the reason is not 100% clear to me (maybe leader change?). Seem to be introduced with https://github.com/camunda/zeebe/pull/12910/files

Error group: https://console.cloud.google.com/errors/detail/CKC158H9-Ia22wE;service=zeebe;time=P7D?project=camunda-saas-int-chaos

To Reproduce

Run recent main qa

Expected behavior
If it is a valid error, more context and maybe clarify what the operator has to do. If it is not an error, then reduce log level.

Log/Stacktrace

Full LOG

Details

io.camunda.zeebe.gateway.cmd.BrokerErrorException: Received error from broker (INTERNAL_ERROR): Failed to write client request to partition '3', because the writer is full.

at io.camunda.zeebe.gateway.impl.broker.BrokerRequestManager.handleResponse ( [io/camunda.zeebe.gateway.impl.broker/BrokerRequestManager.java:194](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.gateway.impl.broker%2FBrokerRequestManager.java&line=194&project=camunda-saas-int-chaos) )
at io.camunda.zeebe.gateway.impl.broker.BrokerRequestManager.lambda$sendRequestInternal$2 ( [io/camunda.zeebe.gateway.impl.broker/BrokerRequestManager.java:143](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.gateway.impl.broker%2FBrokerRequestManager.java&line=143&project=camunda-saas-int-chaos) )
at io.camunda.zeebe.scheduler.future.FutureContinuationRunnable.run ( [io/camunda.zeebe.scheduler.future/FutureContinuationRunnable.java:28](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.scheduler.future%2FFutureContinuationRunnable.java&line=28&project=camunda-saas-int-chaos) )
at io.camunda.zeebe.scheduler.ActorJob.invoke ( [io/camunda.zeebe.scheduler/ActorJob.java:94](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.scheduler%2FActorJob.java&line=94&project=camunda-saas-int-chaos) )
at io.camunda.zeebe.scheduler.ActorJob.execute ( [io/camunda.zeebe.scheduler/ActorJob.java:45](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.scheduler%2FActorJob.java&line=45&project=camunda-saas-int-chaos) )
at io.camunda.zeebe.scheduler.ActorTask.execute ( [io/camunda.zeebe.scheduler/ActorTask.java:119](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.scheduler%2FActorTask.java&line=119&project=camunda-saas-int-chaos) )
at io.camunda.zeebe.scheduler.ActorThread.executeCurrentTask ( [io/camunda.zeebe.scheduler/ActorThread.java:106](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.scheduler%2FActorThread.java&line=106&project=camunda-saas-int-chaos) )
at io.camunda.zeebe.scheduler.ActorThread.doWork ( [io/camunda.zeebe.scheduler/ActorThread.java:87](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.scheduler%2FActorThread.java&line=87&project=camunda-saas-int-chaos) )
at io.camunda.zeebe.scheduler.ActorThread.run ( [io/camunda.zeebe.scheduler/ActorThread.java:198](https://console.cloud.google.com/debug?referrer=fromlog&file=io%2Fcamunda.zeebe.scheduler%2FActorThread.java&line=198&project=camunda-saas-int-chaos) )

Environment:

@Zelldon Zelldon added kind/bug Categorizes an issue or PR as a bug scope/gateway Marks an issue or PR to appear in the gateway section of the changelog component/gateway labels Jun 8, 2023
@megglos
Copy link
Contributor

megglos commented Jun 8, 2023

ZDP-Triage:

  • these errors were ignored in that past, the recent change made them visible in the logs
  • it might be caused by high load
  • in that case the writer is writing to the sequencer, which appears to be full
  • might be related to the dispatcher refactoring
  • worth checking the request error rate as we expect the request to fail with an error
  • the good news is, these errors should be transparent to the client and allow retrying, previously they would have caused timeouts
  • Severity: low as requests can be retried to eventually succeed, however this is likely a bad experience for the user
  • Focus: we would expect backpressure to kick in before, could be caused by a too small sequencer buffer, we should investigate this
  • we should investigate across our QA and benchmarks how often we see it, we may at least map this error to "resource_exhausted"

@megglos megglos added the severity/low Marks a bug as having little to no noticeable impact for the user label Jun 8, 2023
@deepthidevaki
Copy link
Contributor

@npepinpe
Copy link
Member

npepinpe commented May 8, 2024

We should return RESOURCE_EXHAUSTED and not INTERNAL_ERROR, as the error should be handled by reducing the load from the client. This won't solve every case of course (e.g. high internal load), but it should help, and in some cases completely resolve the issue.

@npepinpe npepinpe added the good first issue Marks an issue as simple enough for first time contributors label May 8, 2024
github-merge-queue bot pushed a commit that referenced this issue May 21, 2024
## Description

This PR doesn't fix the root cause where the writer buffer is full. But
just map the error to `RESOURCE_EXHAUSTED` so that user's can know the
request can be retried.

## Related issues

closes #13018
github-merge-queue bot pushed a commit that referenced this issue May 21, 2024
…uffer is full (#18668)

# Description
Backport of #18548 to `stable/8.5`.

relates to #13018
original author: @deepthidevaki
github-merge-queue bot pushed a commit that referenced this issue May 22, 2024
…uffer is full (#18675)

## Description

Backport of #18548

## Related issues

closes #13018
github-merge-queue bot pushed a commit that referenced this issue May 22, 2024
…usted when sequencer buffer is full (#18698)

# Description
Backport of #18675 to `stable/8.2`.

relates to #18548 #13018
original author: @deepthidevaki
github-merge-queue bot pushed a commit that referenced this issue May 22, 2024
…usted when sequencer buffer is full (#18699)

# Description
Backport of #18675 to `stable/8.3`.

relates to #18548 #13018
original author: @deepthidevaki
@github-actions github-actions bot added version:8.5.2 Label that represents issues released on version 8.5.2 version:8.3.12 Label that represents issues released on version 8.3.12 version:8.2.28 Label that represents issues released on version 8.2.28 version:8.4.8 Label that represents issues released on version 8.4.8 labels Jun 4, 2024
@korthout korthout added the version:8.6.0-alpha2 Marks an issue as being completely or in parts released in 8.6.0-alpha2 label Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/gateway component/zeebe Related to the Zeebe component/team good first issue Marks an issue as simple enough for first time contributors kind/bug Categorizes an issue or PR as a bug scope/gateway Marks an issue or PR to appear in the gateway section of the changelog severity/low Marks a bug as having little to no noticeable impact for the user version:8.2.28 Label that represents issues released on version 8.2.28 version:8.3.12 Label that represents issues released on version 8.3.12 version:8.4.8 Label that represents issues released on version 8.4.8 version:8.5.2 Label that represents issues released on version 8.5.2 version:8.6.0-alpha2 Marks an issue as being completely or in parts released in 8.6.0-alpha2
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants