Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add OpenTelemetry Instrumentation #124

Merged
merged 7 commits into from Feb 26, 2024
Merged

Conversation

Cooksauce
Copy link

@Cooksauce Cooksauce commented Sep 1, 2023

Implements: #53
Background for OpenTelemetry in a messaging context: https://youtu.be/WsTYClGrOVI

As of now, this PR contains a rough implementation approach which is working in the sandbox. See below for how to run this.

I have this set as a draft at the moment b/c I would like a bit of feedback on the general approach before moving forward and cleaning everything up well. Specifically, I'd like to know if the maintainers here approve of adding an Activity object to all of the commands. This was the best approach I came up with due to the pipelining nature of the actual msg publishing.
Note that the objects should be null if no tracing listeners have been wired up.

Todo:

  • Propagation of trace context across publish --> subscribe
  • Create publisher span (i.e. Activity) on pub
  • Create consumer span (i.e. Activity) on sub
  • Complete implementation on all commands
  • Add jetstream receive activities
  • Surface "metrics"... counters, etc (tabled for follow up work)

Open Questions

  1. Should jetstream control messages be surfaced or hidden? (maybe behind a different activity source?)
  2. Should higher level messages (e.g. kv put, etc) be surfaced as such or as their underlying pub / sub commands?

How to run in sandbox:

  1. docker-compose up from ./sandbox directory
  2. run Example.Core.SubscribeRaw project
  3. run Example.Core.PublishModel project
  4. view traces in Jaeger UI at http://localhost:16686

Example trace at current state:

image

return builder;
}

internal class Constants
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to agree on constant values. These probably should not change after release, b/c they could break monitoring rules & alerts users have set up.

@mtmk
Copy link
Collaborator

mtmk commented Sep 1, 2023

@Cooksauce thank you for the contribution!

@mtmk
Copy link
Collaborator

mtmk commented Sep 30, 2023

@Cooksauce btw if you're not already on https://slack.nats.io #dotnet channel we'd love to have you there as well 💯

@mtmk
Copy link
Collaborator

mtmk commented Dec 6, 2023

A few updates:

  • should we have the activity attached on the Header? would that help implementing something like HTTP trace headers as well e.g. in request-reply?
  • There is also some work to create an Aspire hosting add-on. Tracing and metrics are desirable there as well.
  • what about metrics?

@Cooksauce
Copy link
Author

Cooksauce commented Jan 8, 2024

Rebased this on top of the pipelines-send branch from PR #303 since there are quite a bit of overlapping changes. So there's not actually 80 files changed here. There are some build errors to fix, tests, and a fair bit of cleanup, etc to do after that merges. I'm assuming that IO.Pipelines approach has been decided on and will be merged at some point


  • should we have the activity attached on the Header? would that help implementing something like HTTP trace headers as well e.g. in request-reply?

I may be missing understanding you here, but this is how the trace is currently propagated in the implementation. Specifically, here after rebase and changes. Are there other headers you're thinking to add?


  • There is also some work to create an Aspire hosting add-on. Tracing and metrics are desirable there as well.

Aspire OTEL is a direct implementation of the standard, so this plugs right with the standard OpenTelemetry exporter. Aspire just sets the OTEL_EXPORTER_OTLP_ENDPOINT environment variable.
(screenshot of Aspire dashboard with latest changes from here below)


  • what about metrics?

I think OpenTelemetry metrics export is a good idea. That said, the implementation for logging, tracing, and metrics is all pretty orthogonal in dotnet, so my opinion would be that be a separate, follow-up PR.


Aspire Trace View

image

@mtmk
Copy link
Collaborator

mtmk commented Jan 8, 2024

I may be missing understanding you here, but this is how the trace is currently propagated in the implementation. Specifically, here after rebase and changes. Are there other headers you're thinking to add?

I was referring to passing activity down in headers instead of NatsMsg i.e. https://github.com/nats-io/nats.net.v2/pull/124/files#diff-82509a9a86e1e07fe3c25e4c89639eac80bb91a7dab2d1fafb46f786ae10232dR124

We also have an effort to minimize NatsMsg footprint since it's a struct in #282

I think OpenTelemetry metrics export is a good idea. That said, the implementation for logging, tracing, and metrics is all pretty orthogonal in dotnet, so my opinion would be that be a separate, follow-up PR.

yep, that makes sense.
edit: opened #316 for metrics

Aspire OTEL is a direct implementation of the standard, so this plugs right with the standard OpenTelemetry exporter

excellent! good to know. just wanted to bring that to your attention. the PR is here dotnet/aspire#1175

@mtmk
Copy link
Collaborator

mtmk commented Jan 8, 2024

@Cooksauce I also noticed @stebet is working on rabbitmq implementation rabbitmq/rabbitmq-dotnet-client#1261

@stebet
Copy link
Contributor

stebet commented Jan 10, 2024

@Cooksauce I also noticed @stebet is working on rabbitmq implementation rabbitmq/rabbitmq-dotnet-client#1261

Yeah, getting close to being done, although there's some questions around Baggage propagation since the baggage functionality for .NET's Activity is not in-spec with OTel Baggage.

@martinjt
Copy link

Just been pointed here, love the work...

Noticed a point on dependencies. We're not planning on taking anything in Contrib for .NET unless there's a big reason to in terms of extra functionality needed. We advise that you create a .OpenTelemetry package in your own namespace for that.

Be careful taking a dependency on the semantic conventions package, we're introducing some change there are some point. Not to say you shouldn't, it just might be better not to take that dependency if you don't need to.

I'm more than happy to help with any questions you have around the spec and how to implement in the best way like I did with a few other OSS projects like Rabbit.

@martinjt
Copy link

Had a quick look... couple of small tweaks you can make...

check ActivitySource.HasListeners instead of getting the Activity and checking for null.

Also, the comment around propagation is yes, it should propagate the Activity.Current properties, regardless of whether your instrumentation is active.

Finally, when creating/starting activities, add as many tags as you can inside the create method, it better perf (upto 8 and it uses a different faster approach) and it also provides people with the ability to Sample better as the parameters in the create/start are all you get access to.

I do open office hours, feel free to book some time if I can help move it along!

@mtmk
Copy link
Collaborator

mtmk commented Jan 27, 2024

Thank you @martinjt we're eager to learn from you and we'd love to take you up on your kind offer. I'm still learning, @Cooksauce is our expert here. btw I noticed we're a bit behind. I'll get the branch up to date as soon as I can.

Edit: just merged main. should be up to date now.

@mtmk
Copy link
Collaborator

mtmk commented Jan 27, 2024

Noticed a point on dependencies. We're not planning on taking anything in Contrib for .NET unless there's a big reason to in terms of extra functionality needed. We advise that you create a .OpenTelemetry package in your own namespace for that.

Just created NATS.Net.OpenTelemetry package (empty shell at the moment). Let me know if you all happy with the package naming or prefer something else @Cooksauce @stebet @martinjt @larsfjerm @caleblloyd

@Cooksauce
Copy link
Author

Reworked this quite a bit to address the discussion here, and to make it play more nicely with jetstream & the higher level features built on top of that. I think it's ready for a real review now with deeper scrutiny.

  • activity context should propagate even when Nats telemetry is off
  • activity is moved into headers to reduce msg size (note that this has the effect of allocating NatsHeaders if they were otherwise not present)

check ActivitySource.HasListeners instead of getting the Activity and checking for null.

@martinjt I'm curious on this one, is there a reason you mention this other than the allocation of parameters such as tags
in .StartActivity(...)?

Finally, when creating/starting activities, add as many tags as you can inside the create method, it better perf (upto 8 and it uses a different faster approach) and it also provides people with the ability to Sample better as the parameters in the create/start are all you get access to.

@martinjt Thanks for this tip! The source definitely looks more optimized.


@mtmk Decisions to make before moving forward:

  1. Are the Nats folks open to tweaking NatsMsg<T> to expose NatsConnection so that these couple things can be addressed? https://github.com/Cooksauce/nats.net.v2/blob/611093cf416d6fde1360d94191e806b2360a0843/src/NATS.Client.Core/NatsMsg.cs#L174
  2. Are we okay leaving out OTEL Baggage support similar to RabbitMQ? - github.com/rabbitmq/rabbitmq-dotnet-client/pull/1261#issuecomment-1875667625

@Cooksauce Cooksauce marked this pull request as ready for review January 29, 2024 12:47
@mtmk
Copy link
Collaborator

mtmk commented Jan 30, 2024

@mtmk Decisions to make before moving forward:

  1. Are the Nats folks open to tweaking NatsMsg<T> to expose NatsConnection so that these couple things can be addressed? https://github.com/Cooksauce/nats.net.v2/blob/611093cf416d6fde1360d94191e806b2360a0843/src/NATS.Client.Core/NatsMsg.cs#L174

I think we can change the NatsMsg.ctor to take in a concrete NatsConnection since we do that pretty much with everything else (e.g. contexts, NatsJSMsg). For unit testing purposes we have NatsMsg-INatsMsg cast as an option. I don't think anyone will be passing interfaces to NatsMsg for unit testing if so I think it'd be reasonable to ask them to convert to accepting INatsMsg. So I'd say yes we can change NatsMsg to take NatsConnection (rather than INatsConnection) unless someone hollers otherwise cc @caleblloyd @sspates @to11mtm @paulomf

  1. Are we okay leaving out OTEL Baggage support similar to RabbitMQ? - github.com/rabbitmq/rabbitmq-dotnet-client/pull/1261#issuecomment-1875667625

No idea TBH. If it's good enough for Rabbit good for us too I'm guessing 😅
@stebet do you have an objection?

@oising
Copy link

oising commented Jan 30, 2024

  1. Are we okay leaving out OTEL Baggage support similar to RabbitMQ? - github.com/rabbitmq/rabbitmq-dotnet-client/pull/1261#issuecomment-1875667625

No idea TBH. If it's good enough for Rabbit good for us too I'm guessing 😅 @stebet do you have an objection?

Just my 2c, I think it's fine to leave it for another PR. The cooperative story behind OTEL's Baggage.* and Activity.Baggage.* APIs is vague at best, and frankly I could never get them to work reliably in tandem. Microsoft's legacy uses Trace-Context as a carrier, and OTEL use "baggage." Hopefully Microsoft will soon clear up the confusion caused by hiding their OTEL support inside of legacy APIs.

@to11mtm
Copy link
Contributor

to11mtm commented Jan 30, 2024

@mtmk Decisions to make before moving forward:

  1. Are the Nats folks open to tweaking NatsMsg<T> to expose NatsConnection so that these couple things can be addressed? https://github.com/Cooksauce/nats.net.v2/blob/611093cf416d6fde1360d94191e806b2360a0843/src/NATS.Client.Core/NatsMsg.cs#L174

I think we can change the NatsMsg.ctor to take in a concrete NatsConnection since we do that pretty much with everything else (e.g. contexts, NatsJSMsg). For unit testing purposes we have NatsMsg-INatsMsg cast as an option. I don't think anyone will be passing interfaces to NatsMsg for unit testing if so I think it'd be reasonable to ask them to convert to accepting INatsMsg. So I'd say yes we can change NatsMsg to take NatsConnection (rather than INatsConnection) unless someone hollers otherwise cc @caleblloyd @sspates @to11mtm @paulomf

  1. Are we okay leaving out OTEL Baggage support similar to RabbitMQ? - github.com/rabbitmq/rabbitmq-dotnet-client/pull/1261#issuecomment-1875667625

No idea TBH. If it's good enough for Rabbit good for us too I'm guessing 😅 @stebet do you have an objection?

Thanks for the tag here. Thanks to a tangentially related conversation, I actually have two important-ish points/questions.

1: Can we get benchmarks for this before/after? To be clear we don't have to have telemetry actually 'wired up', moreso want to make sure perf impact is minimized for minimal cases.

2: Agree OTEL could be a separate PR if folks really need it (AFAIR it's the 'old thing' anyway) but no strong feelings either way.

@martinjt
Copy link

martinjt commented Jan 30, 2024 via email

@stebet
Copy link
Contributor

stebet commented Jan 30, 2024

Also the plan is not to leave baggage propagation out for the RabbitMQ client. It'll be added very soon, we just wanted to get the core ActivitySource implementation PR done first.

@mtmk
Copy link
Collaborator

mtmk commented Jan 30, 2024

Thanks for the tag here.

Thank you! I appreciate everyone's help here since my OTEL knowledge is quite limited until I can find some time to get up to speed with it all.

Thanks to a tangentially related conversation, I actually have two important-ish points/questions.

1: Can we get benchmarks for this before/after? To be clear we don't have to have telemetry actually 'wired up', moreso want to make sure perf impact is minimized for minimal cases.

My quick publish only benchmarks didn't show anything alarming but I plan to run more comprehensive benchmarks especially to make sure we're not affecting applications not using OTEL. Could you expand a little on what you meant by "we don't have to have it wired up" please?

2: Agree OTEL could be a separate PR if folks really need it (AFAIR it's the 'old thing' anyway) but no strong feelings either way.

👍

@oising
Copy link

oising commented Jan 30, 2024

@martinjt said:
I'm not sure what you mean by OTEL being the old way and leaving that to another PR?

I think they mean "OTEL baggage," though I don't think that changes your sentiment. Baggage isn't an old thing, and is a first-class and orthogonal feature to Tracing.

/// <param name="builder"><see cref="TracerProviderBuilder"/> being configured.</param>
/// <param name="includeInternal">Include traces from internal messaging used for control flow and lower level usage during high level abstractions (e.g. JetStream, KvStore, etc)</param>
/// <returns>The same instance of <see cref="TracerProviderBuilder"/> to chain the calls.</returns>
public static TracerProviderBuilder AddNatsInstrumentation(this TracerProviderBuilder builder, bool includeInternal = false)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this library needed? It's wrapper around calling a method with these 2 strings.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty standard. Providing a second package that provides a strongly typed extension to add the instrumentation.

What we've found is that people don't want the magic strings for AddSource() and they want the semantic meaning of "Adding instrumentation". It's also providing the semantics and documentation around what internal vs standard telemetry is.

Adding a second package means that you don't need pollute the main packages with OpenTelemetry. That said, that will be required for baggage to work properly at some point soon though.

Copy link

@davidfowl davidfowl Feb 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get why but IMO, I wouldn't use this package because it does so little. I'm hoping we can get to a model where this is fully configuration based like logging and metrics (in .NET 8). Hopefully we can pull this off for tracing in later versions.

@to11mtm
Copy link
Contributor

to11mtm commented Feb 25, 2024 via email

@martinjt
Copy link

martinjt commented Feb 25, 2024 via email

@davidfowl
Copy link

davidfowl commented Feb 25, 2024

<rant>

Well .NET did all of this work to make sure that library authors could use core primitives to emit traces, metrics and logs without taking on extra dependencies. Except these core libraries still need an extra companion package to call an extension method to turn on the telemetry. The story is incomplete, and we'll get to a point where that isn't needed. I hope these glue packages go away.

</rant>

@martinjt
Copy link

martinjt commented Feb 25, 2024 via email

# Conflicts:
#	sandbox/BlazorWasm/Server/Program.cs
#	src/NATS.Client.Core/NatsConnection.cs
#	src/NATS.Client.JetStream/NatsJSContext.Consumers.cs
#	src/NATS.Client.JetStream/NatsJSContext.cs
#	src/NATS.Client.ObjectStore/NatsObjStore.cs
@mtmk mtmk removed the in-progress label Feb 26, 2024
@mtmk mtmk changed the base branch from main to stage-otel February 26, 2024 16:52
Copy link
Collaborator

@mtmk mtmk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks @Cooksauce and everyone for your input.
We will merge this into a staging branch to tidy up examples etc.

@mtmk mtmk merged commit 9adf2d7 into nats-io:stage-otel Feb 26, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants