Continuous span errors while tracing tempo #3645

madaraszg-tulip · 2024-05-03T08:06:56Z

Describe the bug
We are tracing all our monitoring stack, including tempo. We are also generating service graph which shows that a significant portion of tempo-distributor to tempo-ingester calls are errors, but those are only "context cancelled" calls, don't seem to be actual errors.

To Reproduce
Steps to reproduce the behavior:

Configure tempo 2.4.1 to trace itself (we do this through alloy, which also does tail sampling)
Configure service graph generation (again, we are doing this in alloy)
See the red section in the tempo-ingester service graph node.

Expected behavior
I would expect not to be shown continuous errors in our tempo installation

Environment:

Infrastructure: kubernetes
Deployment tool: helm

Additional Context

Basically every trace shows distributor doing PushBytesV2 against 3 ingesters, and when 2 ingesters respond, the third call is cancelled on the distributor. Either this is the intended behavior, in that case this cancelled call should not be marked as an error in the span, or it is an actual issue, and then it needs to be fixed.

We are doing tail sampling of traces, primarily percentage based, but also forwarding all traces that have errors. This means that practically all traces from the distributor are sampled because all have errors.

joe-elliott · 2024-05-13T18:57:55Z

Tempo does return success as soon as two of three writes to ingesters succeed, but it shouldn't be cancelling the third. It would be interesting to review metrics to see why this might be occurring.

Are all ingesters healthy? This can be viewed using the ingester ring page on distributors.
What is the latency of the push endpoint on your ingesters? Are some slower than others?

madaraszg-tulip · 2024-05-14T09:23:20Z

All ingesters are healthy. This happens in all three environments that we have (prod, staging, testing).
Latency is uniform and stable across all ingesters in all environments. Dashboard tells me 2.5ms for the median, 4.95ms for the 99th. They run in AWS EKS, and the cluster is healthy.

madaraszg-tulip · 2024-05-14T11:11:19Z

Some additional information, focusing on our testing instance now, as it is the smallest and has the lowest load. All the tempo pods run on a single node, dedicated to this tempo instance. There's more than enough CPU and memory (c6g.large: 2 core, 4GB). Running one distributor and three ingesters. Sustained load on the distributor is 50 spans / second.

Span error rate on the distributor is 0.7/sec

Forwarder pushes is about 0.95/sec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Continuous span errors while tracing tempo #3645

Continuous span errors while tracing tempo #3645

madaraszg-tulip commented May 3, 2024

joe-elliott commented May 13, 2024

madaraszg-tulip commented May 14, 2024

madaraszg-tulip commented May 14, 2024

Continuous span errors while tracing tempo #3645

Continuous span errors while tracing tempo #3645

Comments

madaraszg-tulip commented May 3, 2024

joe-elliott commented May 13, 2024

madaraszg-tulip commented May 14, 2024

madaraszg-tulip commented May 14, 2024