Add separate activities for Hub methods #55439

BrennanConroy · 2024-04-30T18:24:21Z

First part of #51557

Follows conventions from https://github.com/open-telemetry/semantic-conventions/blob/main/docs/rpc/rpc-spans.md

Haven't added server.address yet since it looks like we'd need to copy a chunk of code from Kestrel to do it properly:

aspnetcore/src/Servers/Kestrel/Core/src/Internal/Infrastructure/KestrelMetrics.cs

Line 308 in 027c601

    
           private static void InitializeConnectionTags(ref TagList tags, in ConnectionMetricsContext metricsContext)

Hub methods are separate spans:

Streaming has events:

davidfowl · 2024-04-30T18:48:45Z

src/SignalR/server/Core/src/Internal/DefaultHubDispatcher.cs

+        if (serviceProvider.GetService<ActivitySource>() is ActivitySource activitySource
+            && activitySource.HasListeners())


Feels like SignalR should have its own activity source?

Yes. Each component should have its own source, so they can be enabled/disabled on a per-component basis.

The name should probably have "server" in it somewhere. There might be a SignalR client activity source in the future, and they'd have different names.

davidfowl · 2024-04-30T18:52:01Z

src/SignalR/server/Core/src/Internal/DefaultHubDispatcher.cs

+                    var tags = new ActivityTagsCollection(
+                        [new("rpc.message.type", "SENT"), new("rpc.message.id", count)]);
+                    var @event = new ActivityEvent("rpc.message", tags: tags);
+                    activity.AddEvent(@event);


Isn't there a memory concern here? Can this OOM eventually? What if I have a long stream.

I thought someone mentioned it already having a limit. But looking at AddEvent I don't see any limit. So yes, we should add one.

I feel like we should remove this wholesale.

There is some debate in the OTel community about whether we should use events on traces or log messages. Now that traces and logs can be correlated, using logs is good. Maybe this is a place that should be considered? It also means you can set a lowish level and that can be configured on a per-provider basis.

I'm not sure those events are super-helpful even if we eliminate memory concern. I remember playing with it and it was really hard to correlate these events or make sense what's going on based on them.

I'd expect users who care about the details to create custom activities on client/server for small operations. And events would be a confusing noise for those who don't care.

I'd just rely on the existing logs (I assume there are some already) for now. OTel will eventually unify span events into logs.

+1

Let's remove events. Nothing prevents us adding new events in the future.

I would suggest changing them to logs, and we can then pick an appropriate verbosity level.

We should remove anything from this code path. We already have verbose logs at other layers.

Will those be fired within the context of this activity, so that they get tagged correctly?

Not sure, but I don't think it's important enough to log every stream item. Though I'd be interested to know if we did something similar in grpc.

lmolkova · 2024-05-01T02:30:09Z

src/SignalR/server/Core/src/Internal/DefaultHubDispatcher.cs

+            {
+                // https://github.com/open-telemetry/semantic-conventions/blob/main/docs/rpc/rpc-spans.md#server-attributes
+                activity.AddTag("rpc.method", methodName);
+                activity.AddTag("rpc.system", "signalr");


would it be possible to update otel semconv and add signalr constant to system enum? LMK if you need any help.

lmolkova

rpc semconv are still experimental, so I'd expect some breaking changes there especially around events (verbosity, memory consumption, usefulness concerns).

So I'd suggest to

remove events - memory and verbosity are a good concerns to not follow the spec here

"In the lifetime of an RPC stream, an event for each message sent/received on
client and server spans SHOULD be created."

be prepared to make some (probably) minor breaking changes in rpc.* attribute names in .NET 10+

[Update] Also, it would be great to create an issue in otel semconv repo and share memory concern for events.

src/SignalR/server/Core/src/Internal/DefaultHubDispatcher.cs

JamesNK · 2024-05-01T06:16:25Z

src/SignalR/server/Core/src/Internal/DefaultHubDispatcher.cs

+            var requestContext = Activity.Current?.Context;
+            // Get off the parent span.
+            // This is likely the Http Request span and we want Hub method invocations to not be collected under a long running span.
+            Activity.Current = null;


What happens when the activity created by this method is complete? (or never created at all because no one is listening or sampling)

Is the original request activity restored? I think it should be.

It already should be restored, that's how async locals work. And the link check in the test(s) verifies this.

This feels like the wrong place to do this.

We could do it once in HubConnectionHandler.OnConnectedAsync and store the original activity in a field so we can access it for the linking.

Do we want the linking?

Since we're removing our activities from the parent span, the whole websocket request, it'd be nice to point to the parent via a link so there is some correlation. Also, the default OnConnectedAsync span we were talking about below becomes less useful if you can't see when the request started.

That makes sense.

It already should be restored, that's how async locals work. And the link check in the test(s) verifies this.

I would expect Activity.Stop() to set Activity.Current to its parent when called, but if we clear out the parent before creating this activity and then call stop, how does it know to restore the old parent?

how does it know to restore the old parent?

It wouldn't. I'm guessing @BrennanConroy was referring to the implicit restoration of async locals that occurs when an async method returns to its caller? That does work but it feels like a subtle dependency that could easily be broken in some future refactoring. Ideally I'd clear Activity.Current somewhere up the stack to make it obvious what scope of code is executing dissociated from the long-running HTTP server activity.

Changed where we clear the parent as mentioned in #55439 (comment)

This still relies on async local behavior to reset the parent if the user runs code after the SignalR middleware runs, but that seems fine? Unless we want to explicitly reset it when HubConnectionHandler.OnConnectedAsync finishes.

JamesNK · 2024-05-01T06:19:58Z

src/SignalR/server/Core/src/Internal/DefaultHubDispatcher.cs

+                // See https://github.com/dotnet/aspnetcore/blob/027c60168383421750f01e427e4f749d0684bc02/src/Servers/Kestrel/Core/src/Internal/Infrastructure/KestrelMetrics.cs#L308
+                //activity.AddTag("server.address", ...);


Feel free to move this code to a shared file. Kestrel and SignalR can share it.

Looks like we can't properly get this info until #43786 is done.

JamesNK · 2024-05-01T06:21:53Z

src/SignalR/server/Core/src/Internal/DefaultHubDispatcher.cs

        try
        {
            // OnConnectedAsync won't work with client results (ISingleClientProxy.InvokeAsync)
            InitializeHub(hub, connection, invokeAllowed: false);

+            activity = CreateActivity(scope.ServiceProvider, nameof(hub.OnConnectedAsync));


Are the OnConnectedAsync and disconnected invocations always significant? Would it make sense to only report them if the methods are overridden?

They're significant in the sense that they describe when the connection is usable. So for example if the handshake took a long time you could see the gap between the connection starting and OnConnectedAsync starting where it's not usable yet.

I think we should keep them

src/SignalR/server/Core/src/Internal/DefaultHubDispatcher.cs

src/SignalR/server/SignalR/test/HubConnectionHandlerTests.cs

davidfowl · 2024-05-03T13:43:52Z

@BrennanConroy can you show an end to end trace with the dashboard with connect/disconnect and multiple invocations?

BrennanConroy · 2024-05-03T15:09:49Z

@BrennanConroy can you show an end to end trace with the dashboard with connect/disconnect and multiple invocations?

Isn't that what the first picture in the PR description shows?

samsp-msft · 2024-05-03T18:40:54Z

Why are the resource names in your pictures GUIDs? That is more confusing than it should be.

BrennanConroy · 2024-05-03T22:21:42Z

Why are the resource names in your pictures GUIDs? That is more confusing than it should be.

I was using the preview4 Aspire dashboard docker container. Updating to preview6 cleaned it up.

BrennanConroy · 2024-05-08T16:37:30Z

Any other feedback?

JamesNK · 2024-05-09T00:25:58Z

I looked at what the ASP.NET Core activity does and it calls SetEndTime. Is this necessary or a relic of an older pattern?

aspnetcore/src/Hosting/Hosting/src/Internal/HostingApplicationDiagnostics.cs

Lines 518 to 523 in e925769

    
           // Stop sets the end time if it was unset, but we want it set before we issue the write 
        
           // so we do it now. 
        
           if (activity.Duration == TimeSpan.Zero) 
        
           { 
        
               activity.SetEndTime(DateTime.UtcNow); 
        
           }

cc @noahfalk @tarekgh

JamesNK · 2024-05-09T00:32:39Z

Should Activity.SetStatus be called when there is an unhandled error and set an error status? I don't see that happening anywhere with the Microsoft.AspNetCore activity, but I'm guessing that OpenTelemetry handles changing the status in its final output based on listening to other ASP.NET Core error events.

We want to avoid needing extra logic in OpenTelemetry so this should probably be built into SignalR.

Edit: Also, error.type attribute should be set to Exception.GetType().FullName on unhandled exception.

src/SignalR/server/Core/src/Internal/DefaultHubDispatcher.cs

noahfalk · 2024-05-09T10:02:53Z

src/SignalR/server/Core/src/Internal/DefaultHubDispatcher.cs

+            // This is likely the Http Request span and we want Hub method invocations to not be collected under a long running span.
+            Activity.Current = null;
+
+            var activity = signalRActivitySource.ActivitySource.CreateActivity($"{_fullHubName}/{methodName}", ActivityKind.Server, parentId: null,


Does every activity created by this method correspond to some distinct incoming SignalR message that is being processed? If yes then all good but just wanted to check.

src/SignalR/server/SignalR/test/HubConnectionHandlerTests.cs

noahfalk · 2024-05-09T10:22:04Z

I'm guessing that OpenTelemetry handles changing the status in its final output based on listening to other ASP.NET Core error events

Yep, it does. This is based on the DiagnosticSource exception notifications.

noahfalk · 2024-05-09T10:25:00Z

Is this necessary or a relic of an older pattern?

I think that was to preserve an invariant that the Activity Duration and EndTime would be set prior to delivering the DiagnosticListener callback. Since this code isn't doing any diagnostic listener callbacks (which is just fine) that SetEndTime() should be unnecessary. Activity.Stop() will take care of it.

src/SignalR/server/Core/src/SignalRDependencyInjectionExtensions.cs

BrennanConroy · 2024-05-13T17:27:57Z

Should Activity.SetStatus be called when there is an unhandled error and set an error status? I don't see that happening anywhere with the Microsoft.AspNetCore activity, but I'm guessing that OpenTelemetry handles changing the status in its final output based on listening to other ASP.NET Core error events.

We want to avoid needing extra logic in OpenTelemetry so this should probably be built into SignalR.

Edit: Also, error.type attribute should be set to Exception.GetType().FullName on unhandled exception.

Updated to set the Error status when an exception occurs, and to set the error.type tag.

src/SignalR/server/Core/src/HubConnectionHandler.cs

davidfowl · 2024-05-28T05:29:23Z

src/SignalR/server/Core/src/Internal/DefaultHubDispatcher.cs

        try
        {
            InitializeHub(hub, connection);

+            activity = StartActivity(connection, scope.ServiceProvider, nameof(hub.OnDisconnectedAsync));


Can we make sure this doesn't allocate?

What's allocating?

Ah nothing should be.

davidfowl · 2024-05-28T06:36:01Z

src/SignalR/server/Core/src/Internal/DefaultHubDispatcher.cs

+    // Make sure to call Activity.Stop() once the Hub method completes, and consider calling SetActivityError on exception.
+    private static Activity? StartActivity(HubConnectionContext connectionContext, IServiceProvider serviceProvider, string methodName)
+    {
+        if (serviceProvider.GetService<SignalRActivitySource>() is SignalRActivitySource signalRActivitySource


Can we do this once? Does this need to be per invocation? (We can do this change as a follow up).

We don't have a service provider until this point. We could do the whole _activitySource ??= serviceProvider.GetService<SignalRActivitySource>() pattern to only do it once, but that relies on SignalRActivitySource always being a singleton (at least with MEDI, idk how other containers work). This is true today, but if we ever make it public that would be a concern.

BrennanConroy added 2 commits April 29, 2024 16:00

Add Activity when Hub method is called

4bdd389

streaming response and fix link

99ec12c

BrennanConroy added the area-signalr Includes: SignalR clients and servers label Apr 30, 2024

BrennanConroy requested review from davidfowl, JamesNK and samsp-msft April 30, 2024 18:24

BrennanConroy requested a review from halter73 as a code owner April 30, 2024 18:24

davidfowl reviewed Apr 30, 2024

View reviewed changes

lmolkova reviewed May 1, 2024

View reviewed changes

JamesNK reviewed May 1, 2024

View reviewed changes

src/SignalR/server/Core/src/Internal/DefaultHubDispatcher.cs Outdated Show resolved Hide resolved

JamesNK reviewed May 1, 2024

View reviewed changes

src/SignalR/server/Core/src/Internal/DefaultHubDispatcher.cs Outdated Show resolved Hide resolved

JamesNK reviewed May 1, 2024

View reviewed changes

src/SignalR/server/SignalR/test/HubConnectionHandlerTests.cs Outdated Show resolved Hide resolved

fb

4e5d267

BrennanConroy mentioned this pull request May 4, 2024

Consider adding Links section when viewing traces/spans dotnet/aspire#4085

Closed

fix test

261f691

noahfalk reviewed May 9, 2024

View reviewed changes

davidfowl reviewed May 9, 2024

View reviewed changes

src/SignalR/server/Core/src/SignalRDependencyInjectionExtensions.cs Show resolved Hide resolved

fb and error

25538d1

dotnet-policy-service bot added the pending-ci-rerun When assigned to a PR indicates that the CI checks should be rerun label May 21, 2024

davidfowl reviewed May 22, 2024

View reviewed changes

src/SignalR/server/Core/src/HubConnectionHandler.cs Show resolved Hide resolved

BrennanConroy added 2 commits May 27, 2024 17:35

move setting

1d0f24f

le sigh

7573553

davidfowl reviewed May 28, 2024

View reviewed changes

davidfowl approved these changes May 28, 2024

View reviewed changes

davidfowl reviewed May 28, 2024

View reviewed changes

BrennanConroy merged commit 83aa6b1 into main May 28, 2024
26 checks passed

BrennanConroy deleted the brecon/activity branch May 28, 2024 20:24

dotnet-policy-service bot added this to the 9.0-preview6 milestone May 28, 2024

		if (serviceProvider.GetService<ActivitySource>() is ActivitySource activitySource
		&& activitySource.HasListeners())

		// See https://github.com/dotnet/aspnetcore/blob/027c60168383421750f01e427e4f749d0684bc02/src/Servers/Kestrel/Core/src/Internal/Infrastructure/KestrelMetrics.cs#L308
		//activity.AddTag("server.address", ...);

Add separate activities for Hub methods #55439

Add separate activities for Hub methods #55439

Conversation

BrennanConroy commented Apr 30, 2024 • edited

Choose a reason for hiding this comment

samsp-msft Apr 30, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lmolkova left a comment • edited

Choose a reason for hiding this comment

JamesNK May 1, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidfowl commented May 3, 2024

BrennanConroy commented May 3, 2024

samsp-msft commented May 3, 2024

BrennanConroy commented May 3, 2024 • edited

BrennanConroy commented May 8, 2024

JamesNK commented May 9, 2024

JamesNK commented May 9, 2024 • edited

Choose a reason for hiding this comment

noahfalk commented May 9, 2024

noahfalk commented May 9, 2024

BrennanConroy commented May 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BrennanConroy commented Apr 30, 2024 •

edited

samsp-msft Apr 30, 2024 •

edited

lmolkova left a comment •

edited

JamesNK May 1, 2024 •

edited

BrennanConroy commented May 3, 2024 •

edited

JamesNK commented May 9, 2024 •

edited