Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: experiment to support Apollo tracing over OTLP #4982

Draft
wants to merge 21 commits into
base: dev
Choose a base branch
from

Conversation

timbotnik
Copy link

@timbotnik timbotnik commented Apr 18, 2024

Very early in the process here. Just trying to get some feedback on the approach and figure out why there is some complaints about a threading issue 👀

Fixes #issue_number


Checklist

Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.

  • Changes are compatible1
  • Documentation2 completed
  • Performance impact assessed and acceptable
  • Tests added and passing3
    • Unit Tests
    • Integration Tests
    • Manual Tests

Exceptions

Note any exceptions here

Notes

Footnotes

  1. It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this.

  2. Configuration is an important part of many changes. Where applicable please try to document configuration examples.

  3. Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions.

Copy link
Contributor

@timbotnik, please consider creating a changeset entry in /.changesets/. These instructions describe the process and tooling.

@timbotnik timbotnik changed the title Timbotnik/apollo otlp/initial support WIP: initial support for Apollo over OTLP Apr 18, 2024
@timbotnik timbotnik changed the title WIP: initial support for Apollo over OTLP WIP: experiment to support for Apollo over OTLP Apr 18, 2024
@timbotnik timbotnik changed the title WIP: experiment to support for Apollo over OTLP WIP: experiment to support Apollo tracing over OTLP Apr 18, 2024
@timbotnik timbotnik requested a review from bnjjj April 18, 2024 13:38
@router-perf
Copy link

router-perf bot commented Apr 18, 2024

CI performance tests

  • reload - Reload test over a long period of time at a constant rate of users
  • events_big_cap_high_rate_callback - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity using callback mode
  • events_without_dedup_callback - Stress test for events with a lot of users and deduplication DISABLED using callback mode
  • large-request - Stress test with a 1 MB request payload
  • const - Basic stress test that runs with a constant number of users
  • no-graphos - Basic stress test, no GraphOS.
  • step-jemalloc-tuning - Clone of the basic stress test for jemalloc tuning
  • events - Stress test for events with a lot of users and deduplication ENABLED
  • events_callback - Stress test for events with a lot of users and deduplication ENABLED in callback mode
  • events_big_cap_high_rate - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity
  • events_without_dedup - Stress test for events with a lot of users and deduplication DISABLED
  • xxlarge-request - Stress test with 100 MB request payload
  • xlarge-request - Stress test with 10 MB request payload
  • step - Basic stress test that steps up the number of users over time

Comment on lines 140 to 142
// TBD(tim): maybe we should move this?
// Also, maybe we can just use the actual SpanData instead of the light one?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main idea of having this LightSpanData is to consume less memory in the LRU cache. it was a good improvement when introduced

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah another approach would be grouping by trace and attributes to get rid of the duplications, which would be nice for the wire protocol as well. Probably something to visit in the next phase when we feel good about moving entirely to OTLP.

@timbotnik timbotnik force-pushed the timbotnik/apollo-otlp/initial-support branch from ea8c4a2 to b0755b2 Compare April 18, 2024 15:32
@timbotnik timbotnik force-pushed the timbotnik/apollo-otlp/initial-support branch from b0755b2 to 376f480 Compare April 18, 2024 15:41
// Use only the Apollo usage reporting protobuf over http
Apollo,
// Use both the Apollo usage reporting protobuf AND Apollo OTLP
ApolloAndOtlp,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor but there's a pretty common pattern here of using New/Legacy/Both as the enum names (though maybe we should use Otlp/Apollo/Both instead?) https://docs.google.com/document/d/1Zk6PJyCwgDIMTRBSt3UoCKPfefM25saKCDn_AtEHQnc/edit#heading=h.am5dyhwon4gw

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah cool - happy to use a convention if there is one established 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also could you change testing by experimental to be consistent with what we currently have

Copy link
Contributor

@bnjjj bnjjj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing integration and unit tests and don't use unwrap, if you're 100% sure it's ok to unwrap then use expect() instead with your message inside explaining why it's safe to do this

Comment on lines +111 to +129
const REPORTS_INCLUDE_ATTRS: [Key; 17] = [
APOLLO_PRIVATE_REQUEST,
APOLLO_PRIVATE_DURATION_NS_KEY,
APOLLO_PRIVATE_SENT_TIME_OFFSET,
APOLLO_PRIVATE_GRAPHQL_VARIABLES,
APOLLO_PRIVATE_HTTP_REQUEST_HEADERS,
APOLLO_PRIVATE_HTTP_RESPONSE_HEADERS,
APOLLO_PRIVATE_OPERATION_SIGNATURE,
APOLLO_PRIVATE_FTV1,
PATH,
SUBGRAPH_NAME,
CLIENT_NAME_KEY,
CLIENT_VERSION_KEY,
DEPENDS,
LABEL,
CONDITION,
OPERATION_NAME,
OPERATION_TYPE,
];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this make sure to use an HashSet at the end because perfs are better when you're checking if it contains an element. Here is an example here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I think you already did it. Just make sure it's done that way

// Use only the Apollo usage reporting protobuf over http
Apollo,
// Use both the Apollo usage reporting protobuf AND Apollo OTLP
ApolloAndOtlp,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also could you change testing by experimental to be consistent with what we currently have

Comment on lines +103 to +104
// TBD(tim): do we need another batch processor for this?
// Seems like we've already set up a batcher earlier in the pipe but not quite sure.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @BrynCooke you might be the best one to answer this

resource_template: Resource,
intrumentation_library: InstrumentationLibrary,
#[derivative(Debug = "ignore")]
otlp_exporter: Arc<Mutex<opentelemetry_otlp::SpanExporter>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use https://github.com/Amanieu/parking_lot mutex ? It's already in our deps and usually more perfs

exporter.export(spans)
}

pub(crate) fn shutdown(&self) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's a shutdown and we don't use the exporter itself after maybe we should take ownership ? Just a suggestion I don't know if it's valid

Suggested change
pub(crate) fn shutdown(&self) {
pub(crate) fn shutdown(self) {

];

/// Additional attributes to include when sending to the OTLP protocol.
const OTLP_EXT_INCLUDE_ATTRS: [Key; 6] = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for hashset

// https://github.com/open-telemetry/opentelemetry-rust/blob/943bb7a03f9cd17a0b6b53c2eb12acf77764c122/opentelemetry-sdk/CHANGELOG.md?plain=1#L157-L159
let max_attr_len = std::cmp::min(attr_names.len(), value.attributes.len());
let mut new_attrs = EvictedHashMap::new(max_attr_len.try_into().unwrap(), max_attr_len);
value.attributes.iter().for_each(|(key, value)| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to avoid copies then use into_iter() instead and then in the loop you'll be able to delete the clone

Suggested change
value.attributes.iter().for_each(|(key, value)| {
value.attributes.into_iter().for_each(|(key, value)| {

@@ -229,6 +321,13 @@ impl Exporter {
errors_configuration: errors_configuration.clone(),
use_legacy_request_span: use_legacy_request_span.unwrap_or_default(),
include_span_names: INCLUDE_SPANS.into(),
include_attr_names: match apollo_tracing_protocol {
// TBD(tim): consider filtering the current implementation for memory purposes to HashSet::from(REPORTS_INCLUDE_ATTRS)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can remove this comment

@@ -322,6 +421,63 @@ impl Exporter {
Ok(results)
}

fn init_spans_for_tree(&self, root_span: &LightSpanData) -> Vec<SpanData> {
let exporter = self.otlp_exporter.as_ref().unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not use unwrap


fn shutdown(&mut self) {
// currently only handled in the OTLP case
if let Some(exporter) = self.otlp_exporter.clone() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if let Some(exporter) = self.otlp_exporter.clone() {
if let Some(exporter) = &self.otlp_exporter {

or you can .take() it if you don't need it anymore ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants