Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Contexts#wrap(ExecutorService) doesn't propagate OTEL span #8722

Closed
boris-senapt opened this issue May 3, 2024 · 7 comments · Fixed by #8742
Closed

Contexts#wrap(ExecutorService) doesn't propagate OTEL span #8722

boris-senapt opened this issue May 3, 2024 · 7 comments · Fixed by #8742
Assignees
Labels
4.x Version 4.x bug Something isn't working P2 tracing
Projects
Milestone

Comments

@boris-senapt
Copy link

In using Helidon SE without the Java agent I tried to propagate context information to an async task using io.helidon.common.context.Contexts#wrap(java.util.concurrent.ExecutorService) - but the OTEL span information was dropped.

This resulted in the async task traces being uncorrelated with the original request. Double wrapping resolves the immediate issue, but is counter-intuitive and breaks the abstraction Helidon Tracing provides

Contexts.wrap(Context.taskWrapping(executor))

Environment Details

  • Helidon Version: 4.0.7
  • Helidon SE
  • JDK version: GraalVM CE 22+36.1 (build 22+36-jvmci-b02)
  • OS: macOS Sonoma 14.4.1
  • Docker version (if applicable): N/A

Problem Description

I have a request that spawns a long-running async task on an ExecutorService. When the request comes in, I call executor.submit(Runnable).

I expect this async task to have the same traceId as the original request, so I wrap the ExecutorService in Contexts.wrap. This claims to propagate context across threads - which I take to mean Tracing context as well as other contexts.

However, in the traces and in the logs the trace id is not the same - and a new trace is generated. On investigation there is only one implementation of io.helidon.common.context.spi.DataPropagationProvider and that propagates the Log4j2 MDC.

Adding a second wrapping with the OTEL API - io.opentelemetry.context.Context#taskWrapping(java.util.concurrent.ExecutorService) resolves the issue, but requires direct usage of the OTEL API, breaking the Helidon Tracing abstraction layer.

@github-actions github-actions bot added this to Triage in Backlog May 3, 2024
@barchetta barchetta added tracing 4.x Version 4.x labels May 6, 2024
@tjquinno
Copy link
Member

tjquinno commented May 7, 2024

@boris-senapt Thanks for the report.

Can you provide a simple reproducer project that shows this? If not I'll start creating one but we'd prefer one from you so we are sure we are looking at the specific problem you are.

@boris-senapt
Copy link
Author

@boris-senapt Thanks for the report.

Can you provide a simple reproducer project that shows this? If not I'll start creating one but we'd prefer one from you so we are sure we are looking at the specific problem you are.

Sure thing - should be able to get you a reproducer tomorrow.

@boris-senapt
Copy link
Author

@tjquinno I have pushed a recreate up, apologies for the delay

https://github.com/boris-senapt/helidon-trace-propagation-recreate

@tjquinno
Copy link
Member

tjquinno commented May 9, 2024

Thanks very much for the reproducer.

I am seeing what you are in the logging output but it looks to me as if the span context is being propagated properly.

I made the following small change to the pingAsync method in PingService to explicitly display the span and trace ID from the current context's span context, if present. (There are two added imports not shown here):

@@ -24,7 +26,11 @@ public class PingService implements HttpService {
     private void pingAsync(final ServerRequest serverRequest, final ServerResponse serverResponse) {
         LOGGER.info("Async ping request received");
         executor.submit(() -> {
-            LOGGER.info("Running async ping request");
+            Optional<SpanContext> spanContext = Contexts.context().flatMap(c -> c.get(SpanContext.class));
+            String msg = String.format("In async ping request: span_id=%s, trace_id=%s",
+                                       spanContext.map(SpanContext::spanId).orElse("n/a"),
+                                       spanContext.map(SpanContext::traceId).orElse("n/a"));
+            LOGGER.info("Running async ping request - " +  msg);
         });
         serverResponse.send("pong");
     }

and here is the output I see:

12:52:33.165 [[0x6b79605e 0x1163ba20] WebServer socket] INFO  uk.co.borismorris.PingService {span_id=3baae8fd9de6851b, trace_flags=01, trace_id=8d901da0b6aaa22dbc47a5b01c30a126} - Async ping request received
12:52:33.167 [] INFO  uk.co.borismorris.PingService {} - Running async ping request - In async ping request: span_id=3baae8fd9de6851b, trace_id=8d901da0b6aaa22dbc47a5b01c30a126
12:52:33.170 [[0x6b79605e 0x1163ba20] WebServer socket] INFO  io.opentelemetry.exporter.logging.LoggingSpanExporter {span_id=3baae8fd9de6851b, trace_flags=01, trace_id=8d901da0b6aaa22dbc47a5b01c30a126} - 'content-write' : 8d901da0b6aaa22dbc47a5b01c30a126 5b2216a5b39de3c1 INTERNAL [tracer: helidon-service:] {}
12:52:33.174 [[0x6b79605e 0x1163ba20] WebServer socket] INFO  io.opentelemetry.exporter.logging.LoggingSpanExporter {span_id=3baae8fd9de6851b, trace_flags=01, trace_id=8d901da0b6aaa22dbc47a5b01c30a126} - 'HTTP Request' : 8d901da0b6aaa22dbc47a5b01c30a126 3baae8fd9de6851b SERVER [tracer: helidon-service:] AttributesMap{data={http.url=http://localhost:8080/ping-async, http.status_code=200, http.method=GET, http.version=1.1, component=helidon-webserver}, capacity=128, totalAddedValues=5}

Although the log message prefix from the async code does not contain the span and trace IDs the message built using the retrieved span context does show those IDs and they match the IDs logged by the code that receives the request.

Maybe the logging/MDC is not working quite correctly with the span context in this scenario.

Based on what I've seen I'm going to recategorize this as a logging issue rather than a tracing one. Someone else might carry on from here as that's not an area I'm as familiar with.

@tjquinno tjquinno removed their assignment May 9, 2024
@tjquinno tjquinno added logging and removed tracing labels May 9, 2024
@boris-senapt
Copy link
Author

Interesting - so it might be that the Helidon Tracing context has kept the span in "scope" but the underlying OTEL framework hasn't; I suppose some battle of thread locals is happening.

The logging uses opentelemetry-log4j-context-data-2.17-autoconfigure which is https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/instrumentation/log4j/log4j-context-data/log4j-context-data-2.17/library-autoconfigure/src/main/java/io/opentelemetry/instrumentation/log4j/contextdata/v2_17/OpenTelemetryContextDataProvider.java

This uses io.opentelemetry.context.Context.current() to get the context - which is presumably returning empty.


The implications are a bit deeper than just logging however - the reason I came across the issue is that we use OKhttp to make calls to other services, and those calls were also losing their traces.

i.e. whilst Helidon seems to keep its SpanContext OTEL has lost its one, and so all of the OTEL instrumentation stops working.

@tjquinno
Copy link
Member

tjquinno commented May 9, 2024

Your note sent me back to the revised test project and, while Contexts.context().get().get(SpanContext.class) in the async code does find the correct span context there, there is no current span in scope which is undoubtedly where logging looks and probably why the propagation chain is not maintained from there forward.

More digging to do on my end.

@tjquinno tjquinno self-assigned this May 9, 2024
@tjquinno tjquinno added tracing bug Something isn't working P2 and removed logging labels May 9, 2024
@tjquinno tjquinno moved this from Triage to Sprint Scope in Backlog May 9, 2024
@tjquinno tjquinno added this to the 4.0.9 milestone May 9, 2024
@tjquinno
Copy link
Member

tjquinno commented May 9, 2024

Definitely a bug in tracing. PR forthcoming.

Backlog automation moved this from Sprint Scope to Closed May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4.x Version 4.x bug Something isn't working P2 tracing
Projects
Backlog
  
Closed
Development

Successfully merging a pull request may close this issue.

3 participants