Context.canceled handling changes for slo and receiver shim #3505

mdisibio · 2024-03-19T18:35:00Z

What this PR does:

Addresses two places where a client disconnect could cause unexpected behavior and false alarms:

SLO calculation - if a client disconnects while calling an API, now it's considered within SLO
Pushing traces - if a client disconnects while pushing traces, now we ignore and don't propagate a 5xx upstream. The error is still logged to help troubleshooting

Which issue(s) this PR fixes:
Fixes n/a

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

joe-elliott · 2024-03-19T19:44:02Z

modules/distributor/receiver/shim.go

@@ -342,6 +343,12 @@ func (r *receiversShim) ConsumeTraces(ctx context.Context, td ptrace.Traces) err
 	metricPushDuration.Observe(time.Since(start).Seconds())
 	if err != nil {
 		r.logger.Log("msg", "pusher failed to consume trace data", "err", err)
+
+		// Client disconnects are logged but not propogated back.
+		if errors.Is(err, context.Canceled) {


this brings to mind a difficulty I have on the read path. it's impossible to tell where this context cancelled came from. Is it further up in the otel receiver code due to a client disconnect? or deeper down in the distributor code.

For instance, if we fail to write to 2+ ingesters due to this timeout I think that would bubble up as a context canceled as well:

tempo/modules/distributor/distributor.go

Lines 392 to 393 in c41b078

localCtx, cancel := context.WithTimeout(ctx, d.clientCfg.RemoteTimeout)

defer cancel()

withCancelCause was added in 1.20:

https://pkg.go.dev/context#WithCancelCause

to allow for communication of the reason, but I don't know if this is set correctly in the GRPC server. it's definitely not in our own code. Maybe we set it in our code and assume if there is no cause it's due to client disconnect?

we unfortunately cancel context in a lot of places and don't have good patterns for when, why or what is communicated when we do. as is, i think this would mask timeouts to the ingesters.

joe-elliott · 2024-03-19T19:48:45Z

modules/frontend/slos.go

+			// However these errors are considered within SLO:
+			// * grpc resource exhausted error (429)
+			// * context canceled (client disconnected or canceled)
+			if status.Code(err) == codes.ResourceExhausted || errors.Is(err, context.Canceled) {


same thoughts here. maybe we just log the cancel cause if one exists and see if that's populated?

github-actions · 2024-05-19T00:03:56Z

This PR has been automatically marked as stale because it has not had any activity in the past 60 days.
The next time this stale check runs, the stale label will be removed if there is new activity. This pull request will be closed in 15 days if there is no new activity.
Please apply keepalive label to exempt this Pull Request.

Context.canceled handling for slo and receiver shim

7178b0e

mdisibio requested review from joe-elliott, annanay25, mapno, kvrhdn, zalegrala, electron0zero, ie-pham and stoewer as code owners March 19, 2024 18:35

joe-elliott reviewed Mar 19, 2024

View reviewed changes

changelog

f15157f

github-actions bot added the stale Used for stale issues / PRs label May 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Context.canceled handling changes for slo and receiver shim #3505

Context.canceled handling changes for slo and receiver shim #3505

mdisibio commented Mar 19, 2024 •

edited

joe-elliott Mar 19, 2024

joe-elliott Mar 19, 2024

github-actions bot commented May 19, 2024

	localCtx, cancel := context.WithTimeout(ctx, d.clientCfg.RemoteTimeout)
	defer cancel()

Context.canceled handling changes for slo and receiver shim #3505

Are you sure you want to change the base?

Context.canceled handling changes for slo and receiver shim #3505

Conversation

mdisibio commented Mar 19, 2024 • edited

joe-elliott Mar 19, 2024

Choose a reason for hiding this comment

joe-elliott Mar 19, 2024

Choose a reason for hiding this comment

github-actions bot commented May 19, 2024

mdisibio commented Mar 19, 2024 •

edited