Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add separate activities for Hub methods #55439

Merged
merged 7 commits into from
May 28, 2024
Merged

Add separate activities for Hub methods #55439

merged 7 commits into from
May 28, 2024

Conversation

BrennanConroy
Copy link
Member

@BrennanConroy BrennanConroy commented Apr 30, 2024

First part of #51557

Follows conventions from https://github.com/open-telemetry/semantic-conventions/blob/main/docs/rpc/rpc-spans.md

Haven't added server.address yet since it looks like we'd need to copy a chunk of code from Kestrel to do it properly:

private static void InitializeConnectionTags(ref TagList tags, in ConnectionMetricsContext metricsContext)


Hub methods are separate spans:

image


Streaming has events:

image

@BrennanConroy BrennanConroy added the area-signalr Includes: SignalR clients and servers label Apr 30, 2024
Comment on lines 782 to 783
if (serviceProvider.GetService<ActivitySource>() is ActivitySource activitySource
&& activitySource.HasListeners())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels like SignalR should have its own activity source?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Each component should have its own source, so they can be enabled/disabled on a per-component basis.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name should probably have "server" in it somewhere. There might be a SignalR client activity source in the future, and they'd have different names.

Comment on lines 523 to 526
var tags = new ActivityTagsCollection(
[new("rpc.message.type", "SENT"), new("rpc.message.id", count)]);
var @event = new ActivityEvent("rpc.message", tags: tags);
activity.AddEvent(@event);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there a memory concern here? Can this OOM eventually? What if I have a long stream.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought someone mentioned it already having a limit. But looking at AddEvent I don't see any limit. So yes, we should add one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we should remove this wholesale.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some debate in the OTel community about whether we should use events on traces or log messages. Now that traces and logs can be correlated, using logs is good. Maybe this is a place that should be considered? It also means you can set a lowish level and that can be configured on a per-provider basis.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure those events are super-helpful even if we eliminate memory concern. I remember playing with it and it was really hard to correlate these events or make sense what's going on based on them.

I'd expect users who care about the details to create custom activities on client/server for small operations. And events would be a confusing noise for those who don't care.

I'd just rely on the existing logs (I assume there are some already) for now. OTel will eventually unify span events into logs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Let's remove events. Nothing prevents us adding new events in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest changing them to logs, and we can then pick an appropriate verbosity level.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove anything from this code path. We already have verbose logs at other layers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will those be fired within the context of this activity, so that they get tagged correctly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, but I don't think it's important enough to log every stream item. Though I'd be interested to know if we did something similar in grpc.

{
// https://github.com/open-telemetry/semantic-conventions/blob/main/docs/rpc/rpc-spans.md#server-attributes
activity.AddTag("rpc.method", methodName);
activity.AddTag("rpc.system", "signalr");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be possible to update otel semconv and add signalr constant to system enum? LMK if you need any help.

Copy link

@lmolkova lmolkova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rpc semconv are still experimental, so I'd expect some breaking changes there especially around events (verbosity, memory consumption, usefulness concerns).

So I'd suggest to

  • remove events - memory and verbosity are a good concerns to not follow the spec here

"In the lifetime of an RPC stream, an event for each message sent/received on
client and server spans SHOULD be created."

  • be prepared to make some (probably) minor breaking changes in rpc.* attribute names in .NET 10+

[Update] Also, it would be great to create an issue in otel semconv repo and share memory concern for events.

var requestContext = Activity.Current?.Context;
// Get off the parent span.
// This is likely the Http Request span and we want Hub method invocations to not be collected under a long running span.
Activity.Current = null;
Copy link
Member

@JamesNK JamesNK May 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when the activity created by this method is complete? (or never created at all because no one is listening or sampling)

Is the original request activity restored? I think it should be.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It already should be restored, that's how async locals work. And the link check in the test(s) verifies this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like the wrong place to do this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do it once in HubConnectionHandler.OnConnectedAsync and store the original activity in a field so we can access it for the linking.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want the linking?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're removing our activities from the parent span, the whole websocket request, it'd be nice to point to the parent via a link so there is some correlation. Also, the default OnConnectedAsync span we were talking about below becomes less useful if you can't see when the request started.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It already should be restored, that's how async locals work. And the link check in the test(s) verifies this.

I would expect Activity.Stop() to set Activity.Current to its parent when called, but if we clear out the parent before creating this activity and then call stop, how does it know to restore the old parent?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does it know to restore the old parent?

It wouldn't. I'm guessing @BrennanConroy was referring to the implicit restoration of async locals that occurs when an async method returns to its caller? That does work but it feels like a subtle dependency that could easily be broken in some future refactoring. Ideally I'd clear Activity.Current somewhere up the stack to make it obvious what scope of code is executing dissociated from the long-running HTTP server activity.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed where we clear the parent as mentioned in #55439 (comment)

This still relies on async local behavior to reset the parent if the user runs code after the SignalR middleware runs, but that seems fine? Unless we want to explicitly reset it when HubConnectionHandler.OnConnectedAsync finishes.

Comment on lines 800 to 801
// See https://github.com/dotnet/aspnetcore/blob/027c60168383421750f01e427e4f749d0684bc02/src/Servers/Kestrel/Core/src/Internal/Infrastructure/KestrelMetrics.cs#L308
//activity.AddTag("server.address", ...);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to move this code to a shared file. Kestrel and SignalR can share it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we can't properly get this info until #43786 is done.

try
{
// OnConnectedAsync won't work with client results (ISingleClientProxy.InvokeAsync)
InitializeHub(hub, connection, invokeAllowed: false);

activity = CreateActivity(scope.ServiceProvider, nameof(hub.OnConnectedAsync));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the OnConnectedAsync and disconnected invocations always significant? Would it make sense to only report them if the methods are overridden?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're significant in the sense that they describe when the connection is usable. So for example if the handshake took a long time you could see the gap between the connection starting and OnConnectedAsync starting where it's not usable yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep them

@davidfowl
Copy link
Member

@BrennanConroy can you show an end to end trace with the dashboard with connect/disconnect and multiple invocations?

@BrennanConroy
Copy link
Member Author

@BrennanConroy can you show an end to end trace with the dashboard with connect/disconnect and multiple invocations?

Isn't that what the first picture in the PR description shows?

@samsp-msft
Copy link
Member

Why are the resource names in your pictures GUIDs? That is more confusing than it should be.

@BrennanConroy
Copy link
Member Author

BrennanConroy commented May 3, 2024

Why are the resource names in your pictures GUIDs? That is more confusing than it should be.

I was using the preview4 Aspire dashboard docker container. Updating to preview6 cleaned it up.

@BrennanConroy
Copy link
Member Author

Any other feedback?

@JamesNK
Copy link
Member

JamesNK commented May 9, 2024

I looked at what the ASP.NET Core activity does and it calls SetEndTime. Is this necessary or a relic of an older pattern?

// Stop sets the end time if it was unset, but we want it set before we issue the write
// so we do it now.
if (activity.Duration == TimeSpan.Zero)
{
activity.SetEndTime(DateTime.UtcNow);
}

cc @noahfalk @tarekgh

@JamesNK
Copy link
Member

JamesNK commented May 9, 2024

Should Activity.SetStatus be called when there is an unhandled error and set an error status? I don't see that happening anywhere with the Microsoft.AspNetCore activity, but I'm guessing that OpenTelemetry handles changing the status in its final output based on listening to other ASP.NET Core error events.

We want to avoid needing extra logic in OpenTelemetry so this should probably be built into SignalR.

Edit: Also, error.type attribute should be set to Exception.GetType().FullName on unhandled exception.

// This is likely the Http Request span and we want Hub method invocations to not be collected under a long running span.
Activity.Current = null;

var activity = signalRActivitySource.ActivitySource.CreateActivity($"{_fullHubName}/{methodName}", ActivityKind.Server, parentId: null,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does every activity created by this method correspond to some distinct incoming SignalR message that is being processed? If yes then all good but just wanted to check.

@noahfalk
Copy link
Member

noahfalk commented May 9, 2024

I'm guessing that OpenTelemetry handles changing the status in its final output based on listening to other ASP.NET Core error events

Yep, it does. This is based on the DiagnosticSource exception notifications.

@noahfalk
Copy link
Member

noahfalk commented May 9, 2024

Is this necessary or a relic of an older pattern?

I think that was to preserve an invariant that the Activity Duration and EndTime would be set prior to delivering the DiagnosticListener callback. Since this code isn't doing any diagnostic listener callbacks (which is just fine) that SetEndTime() should be unnecessary. Activity.Stop() will take care of it.

@BrennanConroy
Copy link
Member Author

Should Activity.SetStatus be called when there is an unhandled error and set an error status? I don't see that happening anywhere with the Microsoft.AspNetCore activity, but I'm guessing that OpenTelemetry handles changing the status in its final output based on listening to other ASP.NET Core error events.

We want to avoid needing extra logic in OpenTelemetry so this should probably be built into SignalR.

Edit: Also, error.type attribute should be set to Exception.GetType().FullName on unhandled exception.

Updated to set the Error status when an exception occurs, and to set the error.type tag.

@dotnet-policy-service dotnet-policy-service bot added the pending-ci-rerun When assigned to a PR indicates that the CI checks should be rerun label May 21, 2024
try
{
InitializeHub(hub, connection);

activity = StartActivity(connection, scope.ServiceProvider, nameof(hub.OnDisconnectedAsync));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make sure this doesn't allocate?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's allocating?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah nothing should be.

// Make sure to call Activity.Stop() once the Hub method completes, and consider calling SetActivityError on exception.
private static Activity? StartActivity(HubConnectionContext connectionContext, IServiceProvider serviceProvider, string methodName)
{
if (serviceProvider.GetService<SignalRActivitySource>() is SignalRActivitySource signalRActivitySource
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do this once? Does this need to be per invocation? (We can do this change as a follow up).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have a service provider until this point. We could do the whole _activitySource ??= serviceProvider.GetService<SignalRActivitySource>() pattern to only do it once, but that relies on SignalRActivitySource always being a singleton (at least with MEDI, idk how other containers work). This is true today, but if we ever make it public that would be a concern.

@BrennanConroy BrennanConroy merged commit 83aa6b1 into main May 28, 2024
26 checks passed
@BrennanConroy BrennanConroy deleted the brecon/activity branch May 28, 2024 20:24
@dotnet-policy-service dotnet-policy-service bot added this to the 9.0-preview6 milestone May 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-signalr Includes: SignalR clients and servers pending-ci-rerun When assigned to a PR indicates that the CI checks should be rerun
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants