Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate from OpenTracing to OpenTelemetry instrumentation #3646

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

andreasgerstmayr
Copy link
Contributor

@andreasgerstmayr andreasgerstmayr commented May 3, 2024

What this PR does:

Which issue(s) this PR fixes:
Fixes #2224

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Signed-off-by: Andreas Gerstmayr <agerstmayr@redhat.com>
@andreasgerstmayr
Copy link
Contributor Author

This is a first draft to replace OpenTracing with OpenTelemetry SDK.

Some open questions

Signed-off-by: Andreas Gerstmayr <agerstmayr@redhat.com>
@zalegrala
Copy link
Contributor

Thanks for the PR, this looks like its on the right track to me.

  • Creating a new tracer per package does seem like the way to go for the transition to the OTEL SDK.
  • unit64 -> int64 attributes is likely fine. I don't think traceql knows the difference.
  • I don't worry too much about backwards compatibility here since I think the number of users who will care is small, but we need to mark it as a breaking change. This will impact us internally and we will need to make preparations for this to land in main.

One concern I have is whether we lose the server spans. I see the dskit change linked above, but I'm not clear this addresses the issue. Currently, many spans are generated on the dskit server, which connects various components since the server is used for all components for grpc and http, and these spans are using the opentracing SDK. Are you able to confirm the impact here? I can test this PR out once you give the nod that you're ready.

I will mention that there has been some effort in the past to align this between all databases, and since dskit is in the mix, we may not be able to make this change on our own. I've had a coupe branches of my own which look similar, but the concern above blocked forward movement. I expect that we can have updated information here next quarter, but the complexity of moving all databases gave us some pause previously, and so some coordination may be prudent.

@andreasgerstmayr
Copy link
Contributor Author

One concern I have is whether we lose the server spans. I see the dskit change linked above, but I'm not clear this addresses the issue. Currently, many spans are generated on the dskit server, which connects various components since the server is used for all components for grpc and http, and these spans are using the opentracing SDK. Are you able to confirm the impact here? I can test this PR out once you give the nod that you're ready.

So far my testing consisted of running two tempo instances (monolithic mode), one with this PR and one without, setting up two data sources in Grafana and opening the explore page in separate tabs, and then comparing random samples of traces. Initially the context propagation of the HTTP GET - api_search trace was broken for example, but this is fixed with grafana/dskit#518.

The HTTP <method> - <routename> spans are created in dskit using OpenTracing SDK, and they are correctly linked with the ones from the OTEL SDK by the OpenTracing bridge (@kvrhdn set up the bridge in #842, I didn't change this part).

Yes, you can test this PR already.
I didn't test the microservices mode yet, I'll do that next (therefore the PR is still in draft mode, and I want to compare more sample traces).

I will mention that there has been some effort in the past to align this between all databases, and since dskit is in the mix, we may not be able to make this change on our own. I've had a coupe branches of my own which look similar, but the concern above blocked forward movement. I expect that we can have updated information here next quarter, but the complexity of moving all databases gave us some pause previously, and so some coordination may be prudent.

I think it should work fine if dskit is instrumented with OpenTracing, as long as we don't remove the OpenTracing-OpenTelemetry bridge.

@andreasgerstmayr
Copy link
Contributor Author

I tested it in microservices mode today, looks good so far.
Please let me know if you find any broken traces.

@knylander-grafana
Copy link
Contributor

Greetings. Wondering if we'll need to update the docs for your PR changes. Will this migration impact users? For example, are there changes to configuration files? IF there are updates, we can create a doc issue for your work or you can add the doc updates to this or a separate PR.

@zalegrala
Copy link
Contributor

I took some time to deploy this, and it looks pretty good to me. Nice work. Here is a 6 hour window, and you can't really tell when the deployment happened.
image

Looking at alloy for the same period, you can see the transition in receiver protocol, but not much else.

image

It might be nice to include an option for using GRPC. Additionally, to @knylander-grafana 's point, we could call out the change in environment variables where it makes sense. I changed OTEL_EXPORTER_JAEGER_ENDPOINT -> OTEL_EXPORTER_OTLP_TRACES_ENDPOINT which was enough.

@andreasgerstmayr
Copy link
Contributor Author

@knylander-grafana Yes, there is impact for some users. I'll update the docs in this PR (this section: https://grafana.com/docs/tempo/latest/operations/monitor/#traces).

I'll also add gRPC support and update the env vars next week.

@zalegrala do you have a metric for the (average) number of spans per trace in your deployment? I'm mainly concerned about broken traces (broken context propagation), this should be visible if the number of spans per trace is different before and after applying this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

migrate from jaeger-sdk to otel-sdk
3 participants