Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continuous span errors while tracing tempo #3645

Open
madaraszg-tulip opened this issue May 3, 2024 · 3 comments
Open

Continuous span errors while tracing tempo #3645

madaraszg-tulip opened this issue May 3, 2024 · 3 comments

Comments

@madaraszg-tulip
Copy link
Contributor

Describe the bug
We are tracing all our monitoring stack, including tempo. We are also generating service graph which shows that a significant portion of tempo-distributor to tempo-ingester calls are errors, but those are only "context cancelled" calls, don't seem to be actual errors.

To Reproduce
Steps to reproduce the behavior:

  1. Configure tempo 2.4.1 to trace itself (we do this through alloy, which also does tail sampling)
  2. Configure service graph generation (again, we are doing this in alloy)
  3. See the red section in the tempo-ingester service graph node.

Expected behavior
I would expect not to be shown continuous errors in our tempo installation

Environment:

  • Infrastructure: kubernetes
  • Deployment tool: helm

Additional Context

image

image

Basically every trace shows distributor doing PushBytesV2 against 3 ingesters, and when 2 ingesters respond, the third call is cancelled on the distributor. Either this is the intended behavior, in that case this cancelled call should not be marked as an error in the span, or it is an actual issue, and then it needs to be fixed.

We are doing tail sampling of traces, primarily percentage based, but also forwarding all traces that have errors. This means that practically all traces from the distributor are sampled because all have errors.

@joe-elliott
Copy link
Member

Tempo does return success as soon as two of three writes to ingesters succeed, but it shouldn't be cancelling the third. It would be interesting to review metrics to see why this might be occurring.

  • Are all ingesters healthy? This can be viewed using the ingester ring page on distributors.
  • What is the latency of the push endpoint on your ingesters? Are some slower than others?

@madaraszg-tulip
Copy link
Contributor Author

All ingesters are healthy. This happens in all three environments that we have (prod, staging, testing).
Latency is uniform and stable across all ingesters in all environments. Dashboard tells me 2.5ms for the median, 4.95ms for the 99th. They run in AWS EKS, and the cluster is healthy.

@madaraszg-tulip
Copy link
Contributor Author

Some additional information, focusing on our testing instance now, as it is the smallest and has the lowest load. All the tempo pods run on a single node, dedicated to this tempo instance. There's more than enough CPU and memory (c6g.large: 2 core, 4GB). Running one distributor and three ingesters. Sustained load on the distributor is 50 spans / second.

image

Span error rate on the distributor is 0.7/sec

image

Forwarder pushes is about 0.95/sec

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants