Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed Tracing: Adds Request charge and Payload size Threshold options #4433

Merged
merged 14 commits into from
May 2, 2024
Merged
24 changes: 18 additions & 6 deletions Microsoft.Azure.Cosmos/src/CosmosClientTelemetryOptions.cs
Expand Up @@ -10,17 +10,30 @@ namespace Microsoft.Azure.Cosmos
public class CosmosClientTelemetryOptions
{
/// <summary>
/// Disable sending telemetry to service, <see cref="Microsoft.Azure.Cosmos.CosmosThresholdOptions"/> is not applicable to this as of now.
/// Disable sending telemetry data to Microsoft, <see cref="Microsoft.Azure.Cosmos.CosmosThresholdOptions"/> is not applicable for this.
/// </summary>
/// <remarks>This option will disable sending telemetry to service.even it is opt-in from portal.</remarks>
/// <remarks>This feature has to be enabled at 2 places:
/// <list type="bullet">
/// <item>Opt-in from portal to subscribe for this feature.</item>
/// <item>Setting this property to false, to enable it for a particular client instance.</item>
/// </list>
/// </remarks>
/// <value>true</value>
public bool DisableSendingMetricsToService { get; set; } = true;
sourabh1007 marked this conversation as resolved.
Show resolved Hide resolved

/// <summary>
/// This method enable/disable generation of operation level <see cref="System.Diagnostics.Activity"/> if listener is subscribed to the Source Name "Azure.Cosmos.Operation".
/// This method enable/disable generation of operation level <see cref="System.Diagnostics.Activity"/> if listener is subscribed to the Source Name <i>"Azure.Cosmos.Operation"</i>(to capture operation level traces)
/// and <i>"Azure-Cosmos-Operation-Request-Diagnostics"</i>(to capture events with request diagnostics JSON)
/// </summary>
/// <value>false</value>
/// <remarks> Please Refer https://opentelemetry.io/docs/instrumentation/net/exporters/ to know more about open telemetry exporters</remarks>
/// <remarks>
/// You can set different thresholds values by setting <see cref="Microsoft.Azure.Cosmos.CosmosThresholdOptions"/>.
/// It would generate events with Request Diagnostics JSON, if any of the configured threshold is crossed, otherwise it would always generate events with Request Diagnostics JSON for failed requests.
/// There is some overhead of emitting the more detailed diagnostics - so recommendation is to choose these thresholds that reduce the noise level
/// and only emit detailed diagnostics when there is really business impact seen.<br></br>
/// Refer <a href="https://opentelemetry.io/docs/instrumentation/net/exporters/"></a> to know more about open telemetry exporters available. <br></br>
/// Refer <a href="https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/sdk-observability?tabs=dotnet"></a> to know more about this feature.
/// </remarks>
public bool DisableDistributedTracing { get; set; } =
#if PREVIEW
false;
Expand All @@ -30,9 +43,8 @@ public class CosmosClientTelemetryOptions

/// <summary>
/// Threshold values for Distributed Tracing.
/// These values decides whether to generate operation level <see cref="System.Diagnostics.Tracing.EventSource"/> with request diagnostics or not.
/// These values decides whether to generate an <see cref="System.Diagnostics.Tracing.EventSource"/> with request diagnostics or not.
/// </summary>
public CosmosThresholdOptions CosmosThresholdOptions { get; set; } = new CosmosThresholdOptions();

}
}
34 changes: 31 additions & 3 deletions Microsoft.Azure.Cosmos/src/CosmosThresholdOptions.cs
Expand Up @@ -7,20 +7,48 @@ namespace Microsoft.Azure.Cosmos
using System;

/// <summary>
/// Threshold values for Distributed Tracing
sourabh1007 marked this conversation as resolved.
Show resolved Hide resolved
/// This class describes the thresholds when more details diagnostics events are emitted, if subscribed, for an operation due to high latency,
/// high RU consumption or high payload sizes.
/// </summary>
sourabh1007 marked this conversation as resolved.
Show resolved Hide resolved
public class CosmosThresholdOptions
sourabh1007 marked this conversation as resolved.
Show resolved Hide resolved
{
/// <summary>
/// Latency Threshold for non point operations i.e. Query
/// Can be used to define custom latency thresholds. When the latency threshold is exceeded more detailed
/// diagnostics will be emitted (including the request diagnostics). There is some overhead of emitting the
/// more detailed diagnostics - so recommendation is to choose latency thresholds that reduce the noise level
sourabh1007 marked this conversation as resolved.
Show resolved Hide resolved
/// and only emit detailed diagnostics when there is really business impact seen.
/// The default value for the point operation latency threshold is 3 seconds.
/// all operations except (ReadItem, CreateItem, UpsertItem, ReplaceItem, PatchItem or DeleteItem)
/// </summary>
/// <value>3 seconds</value>
public TimeSpan NonPointOperationLatencyThreshold { get; set; } = TimeSpan.FromSeconds(3);

/// <summary>
/// Latency Threshold for point operations i.e operation other than Query
/// Can be used to define custom latency thresholds. When the latency threshold is exceeded more detailed
/// diagnostics will be emitted (including the request diagnostics). There is some overhead of emitting the
/// more detailed diagnostics - so recommendation is to choose latency thresholds that reduce the noise level
/// and only emit detailed diagnostics when there is really business impact seen.
/// The default value for the point operation latency threshold is 1 second.
/// Point Operations are: (ReadItem, CreateItem, UpsertItem, ReplaceItem, PatchItem or DeleteItem)
/// </summary>
/// <value>1 second</value>
public TimeSpan PointOperationLatencyThreshold { get; set; } = TimeSpan.FromSeconds(1);

/// <summary>
/// Can be used to define a custom RU (request charge) threshold. When the threshold is exceeded more detailed
/// diagnostics will be emitted (including the request diagnostics). There is some overhead of emitting the
/// more detailed diagnostics - so recommendation is to choose a request charge threshold that reduces the noise
/// level and only emits detailed diagnostics when the request charge is significantly higher than expected.
/// </summary>
public double? RequestChargeThreshold { get; set; } = null;

/// <summary>
/// Can be used to define a payload size threshold. When the threshold is exceeded for either request or
/// response payloads more detailed diagnostics will be emitted (including the request diagnostics).
/// There is some overhead of emitting the more detailed diagnostics - so recommendation is to choose a
/// payload size threshold that reduces the noise level and only emits detailed diagnostics when the payload size
/// is significantly higher than expected.
/// </summary>
public int? PayloadSizeThresholdInBytes { get; set; } = null;
}
}
Expand Up @@ -15,7 +15,7 @@ namespace Microsoft.Azure.Cosmos.Telemetry
internal sealed class CosmosDbEventSource : AzureEventSource
{
internal const string EventSourceName = "Azure-Cosmos-Operation-Request-Diagnostics";

private static CosmosDbEventSource Singleton { get; } = new CosmosDbEventSource();

private CosmosDbEventSource()
Expand All @@ -35,17 +35,26 @@ public static bool IsEnabled(EventLevel level)
Documents.OperationType operationType,
OpenTelemetryAttributes response)
{
if (!DiagnosticsFilterHelper.IsSuccessfulResponse(
response.StatusCode, response.SubStatusCode) && CosmosDbEventSource.IsEnabled(EventLevel.Warning))
{
CosmosDbEventSource.Singleton.FailedRequest(response.Diagnostics.ToString());
}
else if (DiagnosticsFilterHelper.IsLatencyThresholdCrossed(
config: config,
operationType: operationType,
response: response) && CosmosDbEventSource.IsEnabled(EventLevel.Warning))
if (CosmosDbEventSource.IsEnabled(EventLevel.Warning))
{
CosmosDbEventSource.Singleton.LatencyOverThreshold(response.Diagnostics.ToString());
if (!DiagnosticsFilterHelper.IsSuccessfulResponse(
response.StatusCode, response.SubStatusCode))
{
CosmosDbEventSource.Singleton.FailedRequest(response.Diagnostics.ToString());
}
else if (DiagnosticsFilterHelper.IsLatencyThresholdCrossed(
sourabh1007 marked this conversation as resolved.
Show resolved Hide resolved
config: config,
operationType: operationType,
response: response) ||
(config.RequestChargeThreshold is not null &&
config.RequestChargeThreshold <= response.RequestCharge) ||
(config.PayloadSizeThresholdInBytes is not null &&
DiagnosticsFilterHelper.IsPayloadSizeThresholdCrossed(
config: config,
response: response)))
{
CosmosDbEventSource.Singleton.ThresholdViolation(response.Diagnostics.ToString());
}
}
}

Expand All @@ -65,7 +74,7 @@ private void Exception(string message)
}

[Event(2, Level = EventLevel.Warning)]
private void LatencyOverThreshold(string message)
private void ThresholdViolation(string message)
sourabh1007 marked this conversation as resolved.
Show resolved Hide resolved
{
this.WriteEvent(2, message);
}
Expand Down
Expand Up @@ -21,28 +21,59 @@ internal static class DiagnosticsFilterHelper
OperationType operationType,
OpenTelemetryAttributes response)
{
return response.Diagnostics.GetClientElapsedTime() > DiagnosticsFilterHelper.DefaultThreshold(operationType, config);
return response.Diagnostics.GetClientElapsedTime() > DiagnosticsFilterHelper.DefaultLatencyThreshold(operationType, config);
}

/// <summary>
/// Allow only Payload size(request/response) is more the configured threshold
/// </summary>
/// <returns>true or false</returns>
public static bool IsPayloadSizeThresholdCrossed(
CosmosThresholdOptions config,
OpenTelemetryAttributes response)
{
int requestContentLength = 0;
int responseContentLength = 0;
try
{
requestContentLength = Convert.ToInt32(response.RequestContentLength);
}
catch (Exception)
{
// Ignore, if this conversion fails for any reason.
}

try
{
responseContentLength = Convert.ToInt32(response.ResponseContentLength);
}
catch (Exception)
{
// Ignore, if this conversion fails for any reason.
}

return config.PayloadSizeThresholdInBytes <= Math.Max(requestContentLength, responseContentLength);
}

/// <summary>
/// Check if response HTTP status code is returning successful
/// </summary>
/// <returns>true or false</returns>
public static bool IsSuccessfulResponse(HttpStatusCode statusCode, int substatusCode)
public static bool IsSuccessfulResponse(HttpStatusCode statusCode, int subStatusCode)
{
return statusCode.IsSuccess()
|| (statusCode == System.Net.HttpStatusCode.NotFound && substatusCode == 0)
|| (statusCode == System.Net.HttpStatusCode.NotModified && substatusCode == 0)
|| (statusCode == System.Net.HttpStatusCode.Conflict && substatusCode == 0)
|| (statusCode == System.Net.HttpStatusCode.PreconditionFailed && substatusCode == 0);
|| (statusCode == System.Net.HttpStatusCode.NotFound && subStatusCode == 0)
|| (statusCode == System.Net.HttpStatusCode.NotModified && subStatusCode == 0)
|| (statusCode == System.Net.HttpStatusCode.Conflict && subStatusCode == 0)
|| (statusCode == System.Net.HttpStatusCode.PreconditionFailed && subStatusCode == 0);
}

/// <summary>
/// Get default threshold value based on operation type
/// Get default Latency threshold value based on operation type
/// </summary>
/// <param name="operationType"></param>
/// <param name="config"></param>
internal static TimeSpan DefaultThreshold(OperationType operationType, CosmosThresholdOptions config)
internal static TimeSpan DefaultLatencyThreshold(OperationType operationType, CosmosThresholdOptions config)
{
config ??= DiagnosticsFilterHelper.defaultThresholdOptions;
return DiagnosticsFilterHelper.IsPointOperation(operationType) ?
Expand Down