New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Context.canceled handling changes for slo and receiver shim #3505
base: main
Are you sure you want to change the base?
Conversation
@@ -342,6 +343,12 @@ func (r *receiversShim) ConsumeTraces(ctx context.Context, td ptrace.Traces) err | |||
metricPushDuration.Observe(time.Since(start).Seconds()) | |||
if err != nil { | |||
r.logger.Log("msg", "pusher failed to consume trace data", "err", err) | |||
|
|||
// Client disconnects are logged but not propogated back. | |||
if errors.Is(err, context.Canceled) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this brings to mind a difficulty I have on the read path. it's impossible to tell where this context cancelled came from. Is it further up in the otel receiver code due to a client disconnect? or deeper down in the distributor code.
For instance, if we fail to write to 2+ ingesters due to this timeout I think that would bubble up as a context canceled as well:
tempo/modules/distributor/distributor.go
Lines 392 to 393 in c41b078
localCtx, cancel := context.WithTimeout(ctx, d.clientCfg.RemoteTimeout) | |
defer cancel() |
withCancelCause
was added in 1.20:
https://pkg.go.dev/context#WithCancelCause
to allow for communication of the reason, but I don't know if this is set correctly in the GRPC server. it's definitely not in our own code. Maybe we set it in our code and assume if there is no cause it's due to client disconnect?
we unfortunately cancel context in a lot of places and don't have good patterns for when, why or what is communicated when we do. as is, i think this would mask timeouts to the ingesters.
// However these errors are considered within SLO: | ||
// * grpc resource exhausted error (429) | ||
// * context canceled (client disconnected or canceled) | ||
if status.Code(err) == codes.ResourceExhausted || errors.Is(err, context.Canceled) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same thoughts here. maybe we just log the cancel cause if one exists and see if that's populated?
This PR has been automatically marked as stale because it has not had any activity in the past 60 days. |
What this PR does:
Addresses two places where a client disconnect could cause unexpected behavior and false alarms:
Which issue(s) this PR fixes:
Fixes n/a
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]