Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Context.canceled handling changes for slo and receiver shim #3505

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

mdisibio
Copy link
Contributor

@mdisibio mdisibio commented Mar 19, 2024

What this PR does:

Addresses two places where a client disconnect could cause unexpected behavior and false alarms:

  • SLO calculation - if a client disconnects while calling an API, now it's considered within SLO
  • Pushing traces - if a client disconnects while pushing traces, now we ignore and don't propagate a 5xx upstream. The error is still logged to help troubleshooting

Which issue(s) this PR fixes:
Fixes n/a

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@@ -342,6 +343,12 @@ func (r *receiversShim) ConsumeTraces(ctx context.Context, td ptrace.Traces) err
metricPushDuration.Observe(time.Since(start).Seconds())
if err != nil {
r.logger.Log("msg", "pusher failed to consume trace data", "err", err)

// Client disconnects are logged but not propogated back.
if errors.Is(err, context.Canceled) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this brings to mind a difficulty I have on the read path. it's impossible to tell where this context cancelled came from. Is it further up in the otel receiver code due to a client disconnect? or deeper down in the distributor code.

For instance, if we fail to write to 2+ ingesters due to this timeout I think that would bubble up as a context canceled as well:

localCtx, cancel := context.WithTimeout(ctx, d.clientCfg.RemoteTimeout)
defer cancel()

withCancelCause was added in 1.20:

https://pkg.go.dev/context#WithCancelCause

to allow for communication of the reason, but I don't know if this is set correctly in the GRPC server. it's definitely not in our own code. Maybe we set it in our code and assume if there is no cause it's due to client disconnect?

we unfortunately cancel context in a lot of places and don't have good patterns for when, why or what is communicated when we do. as is, i think this would mask timeouts to the ingesters.

// However these errors are considered within SLO:
// * grpc resource exhausted error (429)
// * context canceled (client disconnected or canceled)
if status.Code(err) == codes.ResourceExhausted || errors.Is(err, context.Canceled) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same thoughts here. maybe we just log the cancel cause if one exists and see if that's populated?

Copy link
Contributor

This PR has been automatically marked as stale because it has not had any activity in the past 60 days.
The next time this stale check runs, the stale label will be removed if there is new activity. This pull request will be closed in 15 days if there is no new activity.
Please apply keepalive label to exempt this Pull Request.

@github-actions github-actions bot added the stale Used for stale issues / PRs label May 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Used for stale issues / PRs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants