Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Turn off spans from a scope while retaining downstream spans and without breaking traces #3867

Closed
jack-berg opened this issue Feb 7, 2024 · 16 comments · Fixed by #3877
Closed
Assignees
Labels
spec:trace Related to the specification/trace directory triaged-accepted The issue is triaged and accepted by the OTel community, one can proceed with creating a PR proposal

Comments

@jack-berg
Copy link
Member

Suppose you have a span hierarchy of ({tracerName}:{spanKind}:{spanName}):

tracerA:server:parent -> tracerB:internal:child -> tracerC:client:grandchild

And you decide that internal spans from tracerB are too noisy. You want to turn them off while:

  1. Not losing downstream spans. You want to keep tracerC:client:grandchild because it contains important information about calls to downstream systems.
  2. Not producing broken traces. tracerC:client:grandchild's parent should be tracerA:server:parent, not tracerB:internal:child.

How can you accomplish this? Well, you pretty much can't. A couple of problems prevent it:

  • Samplers dont have access to scope, so there's no SDK configuration mechanism which can do this.
  • Even if Samplers could make decisions based on scope, the sampling decision options don't make the desired behavior easy.
    • If a sampler returns DROP for tracerB and delegates to ParentBased{root=AlwaysOn} for other spans: Then all descendants of tracerB:internal:child will be dropped as well. Not what we want.
    • If a sampler returns DROP for tracerB and delegates to ParentBased{root=AlwaysOn,localParentNotSampled=AlwaysOn} for other spans: Then we end up with a broken trace since tracerC:client:grandchild will assign its parent to be tracerB:internal:child, which is dropped. Not what we want.
    • If a sampler returns DROP for tracerB and delegates to ParentBased{root=AlwaysOn,localParentNotSampled=AlwaysOn} and instrumentation is aware and only sets spans in context if Span.isRecording()=true: Then we achieve the desired affect. tracerB:internal:child is dropped and tracerC:client:grandchild will assign its parent to be tracerA:server:parent, so we have a complete trace.
  • Even if Samplers could make decisions based on scope, its cumbersome to configure a sampler to have special rules for a scope. It's simpler to think of a sampling policy as orthogonal to the decision of which scopes / instrumentation libraries should be enabled.

So there is no way to get the behavior we want today. And even if samplers were given access to scope, we'd still need: a complex / non-standard sampling policy and coordination with instrumentation (i.e. checking Span.isRecording()==true). Ouch.

This is related to #530 and #1588, but the problem framing here is a little different so I think it warrants a separate issue.

Originally opened as open-telemetry/opentelemetry-java#6197 but think we should provide a spec level solution for this.

@jack-berg jack-berg added the spec:trace Related to the specification/trace directory label Feb 7, 2024
@svrnm
Copy link
Member

svrnm commented Feb 7, 2024

@jack-berg, do you think this issue has some relationship with this one? #3205 -- both issues seem to look for a way to remove noise intermediate spans without loosing the span tree.

@jack-berg
Copy link
Member Author

I propose we solve this problem by starting to more fully realize the potential of scope as a concept. Today, scope is just a piece of metadata attached to records because there is no SDK level configuration surface area involving scopes (besides views) which allow you to select or do anything special with a scope.

I propose we introduce the the notion of scope level configuration for all three signals:

  • Give each provider (TracerProvider, MeterProvider, LoggerProvider) the ability to register configuration which applies scope configuration to some selection of matching scopes. Let scopes be selected using a combination of scope name, version, schema, and attributes.
  • When evaluating the scope config for a Tracer / Meter / Logger, use the first matching scope config.
  • Define each signal with a unique definition of the types of things which are configurable at the scope level for the respective Tracer, Meter, Logger, but introduce uniformity where it makes sense. This reflects that there will be a mixture of overlapping and signal specific scope configuration options.
  • To start, give each signal's scope config the ability specify that the scope is disabled. In the future, we might consider introducing scope specific samplers for tracing, or scope specific log level thresholds for logs, etc.
  • When a scope is disabled, the tracer / meter / logger uses the behavior of the relevant noop tracer / meter / logger.
  • For tracing, the noop tracer returns Span.wrap(parentContext). This means that disabling a tracer is an effective solution to the broken trace problem discussed in this issue.

The more I think about #1588, the more I think that extending sampler to include access to resource and scope is a leaky abstraction. Sure a sampler could be used to change the behavior of scopes, but if you play that out, we'll end up with samplers with very complicated rules. It seems much cleaner to separate concerns, with a global sampler policy applicable to all scopes, and separate scope level configuration.

We really need to start to realize the potential of the scope concept. This proposal would lay the groundwork for that by introducing the simplest possible scope level configuration (enabled | disabled) while solving a very concrete use case.

@jack-berg
Copy link
Member Author

@jack-berg, do you think this issue has some relationship with this one? #3205 -- both issues seem to look for a way to remove noise intermediate spans without loosing the span tree.

@svrnm yes I do. I think right now we only have a hammer (i.e. sampler) so everything looks like a nail. If we had a scope level configuration, it would be a no brainer to add a severity level threshold option for logs, and we probably wouldn't nearly as interested in log sampling.

@svrnm
Copy link
Member

svrnm commented Feb 7, 2024

Thanks, @jack-berg! Utilizing scope to accomplish that kind of "span-level sampling" (for a lack of a better word) is really an interesting concept. It would solve the issue(s) discussed in #3205 and probably a few more.

I just try to wrap my head around with an example:

  • Service A uses HTTP library H to call service B.
  • In Service A there are 2 scopes, one for the "app" and one for the http library, in service B there is one scope.
  • So similar to your example above there are now five spans
    Server Span 1 in scope A -> Internal Span 2 in scope A -> Internal Span 3 in scope H -> Client Span 4 in scope H -> Server Span 5 in scope B
    
  • I do not care about span 2 + 3 (which may be multiple internal spans)
  • If I follow what you have outlined above the current scoping A, H, B would not work so I would need to put those internal spans into their own scope (e.g. "internal A scope" and "internal H scope").

Is this a correct understanding?

@jack-berg
Copy link
Member Author

Yeah you're correct. In my experience most instrumentation produces a single tier of spans. I know the idea of a single instrumentation scope producing both server + internal spans, or internal + client spans exists, but maybe those instrumentations would rethink that position if the SDK gave them tools that gave them an incentive to do so. Or maybe there is a need for additional scope level configuration beyond just the enabled or disabled.

I think adding the ability to enable / disable scopes is doing the simplest possible thing while addressing a glaring omission in SDK configuration. Its the low hanging fruit. After that, we can take stock and determine what else is missing.

@svrnm
Copy link
Member

svrnm commented Feb 8, 2024

Or maybe there is a need for additional scope level configuration beyond just the enabled or disabled.

I think disable/enable is a great starting point, and I see another usecase: there are plenty of questions around disabling individual instrumentation libraries. Of course one could argue that not loading them in the first place is what you should be doing, but if you start with a Java Agent, or a Node JS metapackage that has ALL the instrumentation libraries in it, it can be easier to disable those you don't care about. Now, if an instrumentation library has a well-defined scope, you could simply turn that off, and you are done.

The screenshot in open-telemetry/opentelemetry-demo#659 is a good example of this use case. Of course boiling this down to the 3 spans you "really" need is a good solution, but what if I suspect an issue in Slim... or Psr... or I am a library developer and want to get that level of insight, if I can turn those scopes on/off, I have that level of flexibility without compromising on span creation.

@pyohannes
Copy link
Contributor

pyohannes commented Feb 8, 2024

It would be worth getting @lmolkova's input, she has been (and is) thinking about this problems. The problems she highlights are mostly focusing on nested client spans (a client span parenting another client span), but I think the solution proposed here could also apply to those cases.

Using the scope mechanism is a much more promising approach then inventing a new mechanism, as proposed in the OTEP linked above.

@lmolkova
Copy link
Contributor

lmolkova commented Feb 8, 2024

I'd like to add that .NET OTel has an existing mechanism for it.

Let's say instrumentation creates a scope (via ActivitySource aka tracer) called My.Source.
When you want to enable this instrumentation, you enable a source via traceProvider.AddSource("My.Source").

Let's say library also creates a scope called My.Source.LowLevelVerboseStuff. Users may (but don't need to) enable it as well, but separately.

If I want to enable all sources from a library, I can do it with traceProvider.AddSource("My.Source*") (which introduces some small but solvable problems like this one)

The same mechanism is used for meters.

Effectively source name acts like a logger name where you can control loggers by their namespaces.

The changes in the spec that I'd like to see to generalize this approach

  • OTel SDK should allow enabling/disabling specific instrumentation scopes (by name?) for traces, metrics, and logs
  • Instrumentation scope within one library/instrumentation should be granular enough to allow controlling scopes independently
    • Instrumentations MUST document the scopes they emit
    • Instrumentations SHOULD use the same scopes names across signals for the same instrumentation points (e.g. http spans and http metrics)
    • [Update] It could be a good idea for semconv to recommend a specific set of scopes
  • If instrumentation scope is disabled
    • OTel SDK MUST return non-recording span (null, or other language-specific convention can be used)
    • MUST NOT create new span context
    • MUST NOT change active/current context
  • OTel API SHOULD provide a way for instrumentations to check if scope is enabled before collecting pre-sampling attributes. This is necessary to allow writing verbose layers that also affect performance.

Nice-to-haves

  • I'd be great to introduce SamplingResult.Drop decision as well (plus pass instrumentation scope to samplers) - it should result in the same behavior as if the whole scope was disabled but enable more granular per-span suppression

Per-scope enablement vs verbosity

As I mentioned before per-scope enablement is like a logger configuration, but there are two modes: on and off.
I think we can add more modes later by introducing a new property on the source (verbosity or similar) with some default value and adding config options to enable all scopes >= x verbosity.
But this would be an incremental improvement.

In the meantime we can just ask instrumentations to document their stuff and users to read docs. Distros would also need to pick good defaults.

@lmolkova
Copy link
Contributor

lmolkova commented Feb 8, 2024

One more example where this could be extremely useful is messaging where we have

  • per-message span to inject unique context and make each message traceable independently
  • send span tracing send operation for a batch of messages.

This creates problems for some users:

  • Many applications always send one message in a batch even though internally SDK treats everything as a batch.
  • Or they don't really need per-message tracing (when doing stream processing/aggregation)
  • Or their batches are so big that per-message tracing becomes prohibitively expensive (perf and volume of spans)

The suppression mechanism could allow users to decide if they want per-message tracing and enable it then.
Or if they are fine if all messages share the same span context (in which they were sent).

This brings one more small requirement (which is probably already covered, but maybe worth mentioning anyway):

  • instrumentations SHOULD be able to access and inject the context (theirs or their parent's) even if the span they created is suppressed/disabled

@svrnm
Copy link
Member

svrnm commented Feb 9, 2024

Thank you @lmolkova for this detailed plan on how to tackle this issue. A question, it looks like this supersedes #3205? For the sake of having less issues, I am happy to close it then.

@LikeTheSalad
Copy link

I think disable/enable is a great starting point, and I see another usecase: there are plenty of questions around disabling individual instrumentation libraries.

This is a use case I'm dealing with right now where it could be very useful to disable/enable a specific library instrumentation dynamically. My use case involves remote configuration for Android applications where users can decide what to disable/enable after they've deployed their app, and since library instrumentations are added at compile time in Android, unloading an instrumentation at runtime is not an option, so I think this proposal can help addressing that use case in a clean way too.

@lmolkova
Copy link
Contributor

lmolkova commented Feb 9, 2024

Thank you @lmolkova for this detailed plan on how to tackle this issue. A question, it looks like this supersedes #3205? For the sake of having less issues, I am happy to close it then.

@svrnm
I believe this issue only covers on/off, but not the formal definition of verbosity per se. If this one is resolved, #3205 would become much more straightforward, but still not covered.
So I'd rather keep #3205 open (unless you feel that on/off is good enough).

@tsloughter
Copy link
Member

I think this is also related to some stuff I'd been kicking around for years now,

#2555
#3103
open-telemetry/oteps#186

I planned to attempt again to find a solution in a rebooted Convenience SIG/WG. At least I think it would fall under that SIG/WG.

Is there any interest for tackling this there?

@jack-berg jack-berg added the triaged-accepted The issue is triaged and accepted by the OTel community, one can proceed with creating a PR proposal label Feb 14, 2024
@jmacd
Copy link
Contributor

jmacd commented Feb 14, 2024

This approach sounds good to me. Note that this approach does not exclude Sampler-related mechanisms that might solve the problem of broken traces due to sampled-out intermediate spans.

@jack-berg
Copy link
Member Author

@tsloughter my immediate interest is to add the simplest possible scope level config mechanism across signals, by just providing a way to enable / disable scopes. I think that once a scope level config mechanism exists, its more straight forward to pursue something like #2555, but also, the need diminishes because at least part of the motivation is already accommodated by disabling scopes.

I think that #3103 is somewhat related, but should be separately pursued.

As for open-telemetry/oteps#186, there's definitely overlapping motivation, but I question the value of having all of the tracerprovider configuration options configurable at the tracer level. I think it makes sense to consider each property on a case by case basis. This will be easier to do once a scope level config mechanism exists.

@jack-berg
Copy link
Member Author

No sure if folks have seen this, but opened PR #3877 (with java prototype here) which largely reflects @lmolkova's proposal. Please review if you're interested in seeing this move forward. Thanks!

jack-berg added a commit that referenced this issue Apr 8, 2024
Fixes #3867.

This introduces the concept of scope-level configuration for Tracer,
Meter, and Logger. Notes:

- TracerProvider, MeterProvider, and LoggerProvider have a scope config
provider function, which accepts an `InstrumentationScope` and returns
the relevant scope config. This function is invoked when a Tracer,
Meter, Logger is created, and the resulting scope config dictates the
behavior of resulting Tracers, Meters, Loggers. Computing the config on
scope creation avoids doing extra work on the hot path.
- If TracerProvider, MeterProvider, LoggerProvider supports updating
configuration, the function can be updated, at which point the scope
config of outstanding Tracer, Meter, Logger will be re-computed. This
accommodates emerging dynamic config use cases.
- A function as the mechanism to determine relevant scope config to
maximize flexibility. However, implementations are encourages to provide
syntactic sugar / helpers for accommodating popular use cases: select
one or more scopes by name or with pattern matching, disable one or more
selected scopes, disable all scopes by default and enable one or more
selective scopes.
- Scope config is simple at the moment, consisting of on the ability to
enable / disable a particular Tracer, Meter, Logger. When a disabled,
use the respective noop Tracer, Meter, Logger.
- Scope config parameters are expected to evolve over time. When it
makes sense, there should be consistency across signals.
- Java prototype available for this proposal:
open-telemetry/opentelemetry-java#6201

Leaving as a draft for now because its not ready to merge, but I am
actively soliciting feedback. If / when we agree on direction, I'll go
back and add the appropriate experimental indications.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
spec:trace Related to the specification/trace directory triaged-accepted The issue is triaged and accepted by the OTel community, one can proceed with creating a PR proposal
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants