Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experimental support for Apollo tracing over OTLP #4982

Open
wants to merge 91 commits into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
91 commits
Select commit Hold shift + click to select a range
5cf5750
Add a new configuration for setting the tracing protocol for Apollo S…
timbotnik Apr 18, 2024
d1024dc
Modify some exports that we’ll need
timbotnik Apr 18, 2024
360f8fb
Add the OTLP path
timbotnik Apr 18, 2024
376f480
Fix up some loose ends, review notes
timbotnik Apr 18, 2024
f4973e4
fixes
bnjjj Apr 19, 2024
2d5c59b
Turn the ApollOtlpExporter into a SpanExporter, try direct ref instea…
timbotnik Apr 23, 2024
738e2b3
Use interior mutability with Arcs to improve lifetimes
timbotnik Apr 23, 2024
0728090
Merge remote-tracking branch 'origin/dev' into timbotnik/apollo-otlp/…
timbotnik Apr 23, 2024
e35b241
Add shutdown handling
timbotnik Apr 24, 2024
2004a6e
Code cleanup
timbotnik Apr 24, 2024
0730668
Fix an issue where we’d be stealing the spans away from the Apollo ex…
timbotnik Apr 24, 2024
2ce4d1a
Prepare SpanData’s during collection, prevents making additional copi…
timbotnik Apr 24, 2024
9a5fa87
Introduce “OTLP” only option, refactor to support peek vs. pop on the…
timbotnik Apr 25, 2024
51c43bf
Run cargo fmt
timbotnik Apr 25, 2024
e7ac75d
Run xtask lint —fmt
timbotnik Apr 25, 2024
6b08eb8
Manual lint fixes
timbotnik Apr 25, 2024
c418e2a
Clippy is sometimes wrong
timbotnik Apr 25, 2024
02246e8
Stop filtering spans for now, can revisit this later.
timbotnik Apr 25, 2024
8837262
Turn off compression temporarily till we enable on the collector
timbotnik Apr 25, 2024
c487ffa
Move ROUTER_ID to a higher order crate
timbotnik Apr 26, 2024
cd3328f
Add attribute-level filtering support for OTel spans.
timbotnik May 1, 2024
9ae6b1c
Merge remote-tracking branch 'origin/dev' into timbotnik/apollo-otlp/…
timbotnik May 17, 2024
b40e3f4
Review notes: use parking lot mutex
timbotnik May 17, 2024
c6730d7
Add “subscribe” span to the allow list.
timbotnik May 17, 2024
89a6abe
Review notes: use reference instead of clone in shutdown
timbotnik May 17, 2024
ae0b3c3
Review notes: use expect instead of unwrap
timbotnik May 17, 2024
08d3eb7
Lint
timbotnik May 17, 2024
2b838c9
cargo fmt
timbotnik May 17, 2024
73e31b7
Update the experimental config to use a percentage rollout style flag
timbotnik May 17, 2024
13d93a7
Use more consistent naming
timbotnik May 19, 2024
3b573b2
Update snapshot test
timbotnik May 19, 2024
d352ed4
Simplify some of our boolean logic; also, don’t send if there is not…
timbotnik May 20, 2024
59a1b3e
Add config option for OTLP tracing protocol (HTTP v. GRPC)
timbotnik May 20, 2024
ebc9b83
Add integration tests for Otel traces
timbotnik May 20, 2024
b6eb5b5
Updated with new snapshots
timbotnik May 20, 2024
588509c
Formatting
timbotnik May 20, 2024
a1bb06a
Formatting
timbotnik May 20, 2024
03b45e1
Exclude snapshots from gitleaks
timbotnik May 20, 2024
be3b24c
Attempt to redact some high entropy values in snapshots
timbotnik May 21, 2024
0b0656c
Formatting
timbotnik May 21, 2024
a9d41ae
Try awaiting the task abort to ensure that the address is unbound.
timbotnik May 21, 2024
68da5df
Merge remote-tracking branch 'origin/dev' into timbotnik/apollo-otlp/…
timbotnik May 21, 2024
445e1a3
Workaround: redact all attribute values in OTel snapshots
timbotnik May 22, 2024
99c7bbc
Refactor: remove features only needed for “both” scenario
timbotnik May 22, 2024
654a637
Feature: redact errors if needed in OTel path; also track error count…
timbotnik May 22, 2024
12aaca6
Formatting
timbotnik May 22, 2024
03a7c85
Gitleaks: ignore commits
timbotnik May 22, 2024
2d83349
Formatting
timbotnik May 22, 2024
8395532
Another gitleaks attempt
timbotnik May 22, 2024
93e9b50
Clean up some TODOs
timbotnik May 23, 2024
450b521
Remove an unnecessary clone of attributes
timbotnik May 23, 2024
b7197f7
Track the original span status through the cache as well
timbotnik May 23, 2024
ce3a0ee
Add more span export translation logic for subscriptions and errors
timbotnik May 23, 2024
2a1ef25
fix fp secret detection?
peakematt May 23, 2024
52d29a7
Merge remote-tracking branch 'origin/dev' into timbotnik/apollo-otlp/…
timbotnik May 23, 2024
0f552ea
Merge branch 'timbotnik/apollo-otlp/initial-support' of github.com:ap…
timbotnik May 23, 2024
86b22fa
Update snapshots with code
timbotnik May 23, 2024
2fd1ecd
Fix http headers for reports
timbotnik May 24, 2024
ee4b8ec
Move back to dynamic port for integration tests
timbotnik May 24, 2024
d6d9cb2
Implement synthetic spans for subscription events
timbotnik May 24, 2024
4949652
Include more span names
timbotnik May 24, 2024
46cf2d5
Merge remote-tracking branch 'origin/dev' into timbotnik/apollo-otlp/…
timbotnik May 24, 2024
28f64eb
Merge branch 'dev' of github.com:apollographql/router into timbotnik/…
bnjjj May 24, 2024
08000bd
Refactor: now that we only send one or the other, we don’t need to “j…
timbotnik May 27, 2024
91780a8
Analytics: measure the # of traces sent via OTLP vs. Apollo reporting
timbotnik May 27, 2024
a006989
Merge remote-tracking branch 'origin/dev' into timbotnik/apollo-otlp/…
timbotnik May 27, 2024
5a74d6d
Drop SUBSCRIPTION_EVENT spans for now.
timbotnik May 27, 2024
51a720f
Add a changeset
timbotnik May 27, 2024
1c4fb92
Merge branch 'timbotnik/apollo-otlp/initial-support' of github.com:ap…
bnjjj May 27, 2024
ba92780
fix snapshot redactions
bnjjj May 27, 2024
bad69e3
Mimic trace filtering of operations without a signature attribute.
timbotnik May 28, 2024
5f80df4
Merge branch 'timbotnik/apollo-otlp/initial-support' of github.com:ap…
timbotnik May 28, 2024
a130b5e
Integration tests: solidify redactions
timbotnik May 28, 2024
d2f4b05
Refactor: move duplicate tracing fixtures to a tracing_common module
timbotnik May 28, 2024
08bb07c
Merge remote-tracking branch 'origin/dev' into timbotnik/apollo-otlp/…
timbotnik May 28, 2024
2027c95
TBD cleanups
timbotnik May 28, 2024
b2b66c2
fix gzip compression issue
bnjjj May 29, 2024
5d87f3f
add comment
bnjjj May 29, 2024
99bb0c0
Allow apollo.telemetry metrics to be sent
timbotnik May 30, 2024
81a7e75
Merge branch 'timbotnik/apollo-otlp/initial-support' of github.com:ap…
timbotnik May 30, 2024
6e15217
Merge remote-tracking branch 'origin/dev' into timbotnik/apollo-otlp/…
timbotnik May 30, 2024
2b6739b
Rename operation subtype attribute to apollo_private.operation.subtype
timbotnik May 30, 2024
3cb48ec
Promote some span names to be exported from their respective usage sites
timbotnik May 30, 2024
86ec930
Discard subscription events for now.
timbotnik May 30, 2024
671fcca
Clean up unit tests
timbotnik May 30, 2024
9c6debb
Remove superfluous enum ApolloTracingProtocol now that we are using a…
timbotnik May 30, 2024
389b27e
Exclude experimental_otlp_tracing_protocol from json schema
timbotnik May 30, 2024
cc911db
Removing the last TBD since so far we haven’t seen a need for another…
timbotnik May 30, 2024
940c166
Merge remote-tracking branch 'origin/dev' into timbotnik/apollo-otlp/…
timbotnik May 30, 2024
e19ed23
Revert "Exclude experimental_otlp_tracing_protocol from json schema"
timbotnik May 31, 2024
71a03f6
Merge branch 'dev' into timbotnik/apollo-otlp/initial-support
bnjjj Jun 3, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
41 changes: 41 additions & 0 deletions .changesets/feat_timbotnik_experimental_studio_otlp_tracing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
### Add experimental support for sending traces to Studio via OTLP ([PR #4982](https://github.com/apollographql/router/pull/4982))

As the ecosystem around OpenTelemetry (OTel) has been expanding rapidly, we are evaluating a migration of Apollo's internal
tracing system to use an OTel-based protocol.

In the short-term, benefits include:

- A comprehensive way to visualize the Router execution path in Studio.
- Additional spans that were previously not included in Studio traces, such as query parsing, planning, execution, and more.
- Additional metadata such as subgraph fetch details, Router idle / busy timing, and more.

Long-term, we see this as a strategic enhancement to consolidate these 2 disparate tracing systems.
This will pave the way for future enhancements to more easily plug into the Studio trace visualizer.

#### Configuration

This change adds a new configuration option `experimental_otlp_tracing_sampler`. This can be used to send
a percentage of traces via OTLP instead of the native Apollo Usage Reporting protocol. Supported values:

- `always_off` (default): send all traces via Apollo Usage Reporting protocol.
- `always_on`: send all traces via OTLP.
- `0.0 - 1.0`: the ratio of traces to send via OTLP (0.5 = 50 / 50).

Note that this sampler is only applied _after_ the common tracing sampler, for example:

#### Sample 1% of traces, send all traces via OTLP:

```yaml
telemetry:
apollo:
# Send all traces via OTLP
experimental_otlp_tracing_sampler: always_on

exporters:
tracing:
common:
# Sample traces at 1% of all traffic
sampler: 0.01
```

By [@timbotnik](https://github.com/timbotnik) in https://github.com/apollographql/router/pull/4982
3 changes: 2 additions & 1 deletion .gitleaks.toml
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@

paths = [
'''^apollo-router\/src\/.+\/testdata\/.+''',
'''^apollo-router\/tests\/snapshots\/apollo_otel_traces__.+\.snap$''',
]

[[ rules ]]
Expand Down Expand Up @@ -97,4 +98,4 @@
paths = [
'''^apollo-router\/src\/.+\/testdata\/.+''',
'''^examples/telemetry/jaeger-collector.router.yaml$'''
]
]
1 change: 1 addition & 0 deletions apollo-router/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -169,6 +169,7 @@ opentelemetry-otlp = { version = "0.13.0", default-features = false, features =
"http-proto",
"metrics",
"reqwest-client",
"trace"
] }
opentelemetry-semantic-conventions = "0.12.0"
opentelemetry-zipkin = { version = "0.18.0", default-features = false, features = [
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1579,6 +1579,14 @@ expression: "&schema"
"description": "The Apollo Studio endpoint for exporting traces and metrics.",
"type": "string"
},
"experimental_otlp_tracing_protocol": {
"$ref": "#/definitions/Protocol",
"description": "#/definitions/Protocol"
},
"experimental_otlp_tracing_sampler": {
"$ref": "#/definitions/SamplerOption",
"description": "#/definitions/SamplerOption"
},
"field_level_instrumentation_sampler": {
"$ref": "#/definitions/SamplerOption",
"description": "#/definitions/SamplerOption"
Expand Down
2 changes: 1 addition & 1 deletion apollo-router/src/metrics/filter.rs
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,7 @@ impl FilterMeterProvider {
.delegate(delegate)
.allow(
Regex::new(
r"apollo\.(graphos\.cloud|router\.(operations?|lifecycle|config|schema|query|query_planning))(\..*|$)",
r"apollo\.(graphos\.cloud|router\.(operations?|lifecycle|config|schema|query|query_planning|telemetry))(\..*|$)",
)
.expect("regex should have been valid"),
)
Expand Down
20 changes: 20 additions & 0 deletions apollo-router/src/plugins/telemetry/apollo.rs
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ use std::collections::HashMap;
use std::fmt::Display;
use std::num::NonZeroUsize;
use std::ops::AddAssign;
use std::sync::OnceLock;
use std::time::SystemTime;

use http::header::HeaderName;
Expand All @@ -12,10 +13,13 @@ use serde::ser::SerializeMap;
use serde::Deserialize;
use serde::Serialize;
use url::Url;
use uuid::Uuid;

use super::config::Sampler;
use super::metrics::apollo::studio::ContextualizedStats;
use super::metrics::apollo::studio::SingleStats;
use super::metrics::apollo::studio::SingleStatsReport;
use super::otlp::Protocol;
use super::tracing::apollo::TracesReport;
use crate::plugin::serde::deserialize_header_name;
use crate::plugin::serde::deserialize_vec_header_name;
Expand All @@ -34,6 +38,9 @@ pub(crate) const ENDPOINT_DEFAULT: &str =

pub(crate) const OTLP_ENDPOINT_DEFAULT: &str = "https://usage-reporting.api.apollographql.com";

// Random unique UUID for the Router. This doesn't actually identify the router, it just allows disambiguation between multiple routers with the same metadata.
pub(crate) static ROUTER_ID: OnceLock<Uuid> = OnceLock::new();

#[derive(Clone, Deserialize, JsonSchema, Debug)]
#[serde(deny_unknown_fields, default)]
pub(crate) struct Config {
Expand Down Expand Up @@ -69,6 +76,13 @@ pub(crate) struct Config {
/// Field level instrumentation for subgraphs via ftv1. ftv1 tracing can cause performance issues as it is transmitted in band with subgraph responses.
pub(crate) field_level_instrumentation_sampler: SamplerOption,

/// Percentage of traces to send via the OTel protocol when sending to Apollo Studio.
pub(crate) experimental_otlp_tracing_sampler: SamplerOption,

/// OTLP protocol used for OTel traces.
/// Note this only applies if OTel traces are enabled and is only intended for use in tests.
pub(crate) experimental_otlp_tracing_protocol: Protocol,
timbotnik marked this conversation as resolved.
Show resolved Hide resolved

/// To configure which request header names and values are included in trace data that's sent to Apollo Studio.
pub(crate) send_headers: ForwardHeaders,
/// To configure which GraphQL variable values are included in trace data that's sent to Apollo Studio
Expand Down Expand Up @@ -134,6 +148,10 @@ const fn default_field_level_instrumentation_sampler() -> SamplerOption {
SamplerOption::TraceIdRatioBased(0.01)
}

const fn default_experimental_otlp_tracing_sampler() -> SamplerOption {
SamplerOption::Always(Sampler::AlwaysOff)
}

fn endpoint_default() -> Url {
Url::parse(ENDPOINT_DEFAULT).expect("must be valid url")
}
Expand Down Expand Up @@ -167,13 +185,15 @@ impl Default for Config {
Self {
endpoint: endpoint_default(),
experimental_otlp_endpoint: otlp_endpoint_default(),
experimental_otlp_tracing_protocol: Protocol::default(),
apollo_key: apollo_key(),
apollo_graph_ref: apollo_graph_reference(),
client_name_header: client_name_header_default(),
client_version_header: client_version_header_default(),
schema_id: "<no_schema_id>".to_string(),
buffer_size: default_buffer_size(),
field_level_instrumentation_sampler: default_field_level_instrumentation_sampler(),
experimental_otlp_tracing_sampler: default_experimental_otlp_tracing_sampler(),
send_headers: ForwardHeaders::None,
send_variable_values: ForwardValues::None,
batch_processor: BatchProcessorConfig::default(),
Expand Down
7 changes: 5 additions & 2 deletions apollo-router/src/plugins/telemetry/apollo_exporter.rs
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,9 @@ use crate::plugins::telemetry::tracing::BatchProcessorConfig;

const BACKOFF_INCREMENT: Duration = Duration::from_millis(50);
const ROUTER_REPORT_TYPE_METRICS: &str = "metrics";
const ROUTER_REPORT_TYPE_TRACES: &str = "traces";
pub(crate) const ROUTER_REPORT_TYPE_TRACES: &str = "traces";
const ROUTER_TRACING_PROTOCOL_APOLLO: &str = "apollo";
pub(crate) const ROUTER_TRACING_PROTOCOL_OTLP: &str = "otlp";

#[derive(thiserror::Error, Debug)]
pub(crate) enum ApolloExportError {
Expand Down Expand Up @@ -301,7 +303,8 @@ impl ApolloExporter {
"apollo.router.telemetry.studio.reports",
"The number of reports submitted to Studio by the Router",
1,
report.type = report_type
report.type = report_type,
report.protocol = ROUTER_TRACING_PROTOCOL_APOLLO
);
if has_traces && !self.strip_traces.load(Ordering::SeqCst) {
// If we had traces then maybe disable sending traces from this exporter based on the response.
Expand Down