Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support of chunk responses for selectors and instrument in telemetry #5022

Merged
merged 23 commits into from May 14, 2024

Conversation

bnjjj
Copy link
Contributor

@bnjjj bnjjj commented Apr 25, 2024

Giving the ability to compute new attributes everytime we receive a new event in supergraph response. Would be really helpful to create observability for subscriptions and defer.

I also added the support of on_error for selectors and especially error selector I added for every services. Which will let you create some metrics,events or span attributes containing error message. By adding this we will now have to ability to create a counter of request timed out for subgraphs for example:

telemetry:
  instrumentation:
    instruments:
      subgraph:
        requests.timeout:
          value: unit
          type: counter
          unit: request
          description: "subgraph requests containing subgraph timeout"
          attributes:
            subgraph.name: true
          condition:
            eq:
              - "request timed out"
              - error: reason

And thanks to the event support for selectors you'll be able to fetch data directly from the supergraph response body:

telemetry:
  instrumentation:
    instruments:
      acme.request.on_graphql_error:
        value: event_unit
        type: counter
        unit: error
        description: my description
        condition:
          eq:
          - MY_ERROR_CODE
          - response_errors: "$.[0].extensions.code"
        attributes:
          response_errors:
            response_errors: "$.*"

Fixes #5027 #4830


Checklist

Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.

  • Changes are compatible1
  • Documentation2 completed
  • Performance impact assessed and acceptable
  • Tests added and passing3
    • Unit Tests
    • Integration Tests
    • Manual Tests

Exceptions

Note any exceptions here

Notes

Footnotes

  1. It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this.

  2. Configuration is an important part of many changes. Where applicable please try to document configuration examples.

  3. Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions.

Signed-off-by: Benjamin Coenen <5719034+bnjjj@users.noreply.github.com>
@bnjjj bnjjj requested a review from BrynCooke April 25, 2024 15:39
@bnjjj bnjjj marked this pull request as draft April 25, 2024 15:40

This comment has been minimized.

@router-perf
Copy link

router-perf bot commented Apr 25, 2024

CI performance tests

  • step - Basic stress test that steps up the number of users over time
  • events_big_cap_high_rate_callback - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity using callback mode
  • large-request - Stress test with a 1 MB request payload
  • events - Stress test for events with a lot of users and deduplication ENABLED
  • xxlarge-request - Stress test with 100 MB request payload
  • events_without_dedup - Stress test for events with a lot of users and deduplication DISABLED
  • xlarge-request - Stress test with 10 MB request payload
  • step-jemalloc-tuning - Clone of the basic stress test for jemalloc tuning
  • events_callback - Stress test for events with a lot of users and deduplication ENABLED in callback mode
  • no-graphos - Basic stress test, no GraphOS.
  • reload - Reload test over a long period of time at a constant rate of users
  • events_big_cap_high_rate - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity
  • events_without_dedup_callback - Stress test for events with a lot of users and deduplication DISABLED using callback mode
  • const - Basic stress test that runs with a constant number of users

Signed-off-by: Benjamin Coenen <5719034+bnjjj@users.noreply.github.com>
Signed-off-by: Benjamin Coenen <5719034+bnjjj@users.noreply.github.com>
Signed-off-by: Benjamin Coenen <5719034+bnjjj@users.noreply.github.com>
Signed-off-by: Benjamin Coenen <5719034+bnjjj@users.noreply.github.com>
Signed-off-by: Benjamin Coenen <5719034+bnjjj@users.noreply.github.com>
@bnjjj bnjjj marked this pull request as ready for review April 26, 2024 12:33
@bnjjj bnjjj requested a review from a team as a code owner April 26, 2024 12:33
Signed-off-by: Benjamin Coenen <5719034+bnjjj@users.noreply.github.com>
Signed-off-by: Benjamin Coenen <5719034+bnjjj@users.noreply.github.com>
Signed-off-by: Benjamin Coenen <5719034+bnjjj@users.noreply.github.com>
Signed-off-by: Benjamin Coenen <5719034+bnjjj@users.noreply.github.com>
Signed-off-by: Benjamin Coenen <5719034+bnjjj@users.noreply.github.com>
Signed-off-by: Benjamin Coenen <5719034+bnjjj@users.noreply.github.com>
apollo-router/src/plugins/telemetry/config_new/cost/mod.rs Outdated Show resolved Hide resolved
apollo-router/src/plugins/telemetry/config_new/mod.rs Outdated Show resolved Hide resolved
"Event_for_RouterSelector": {
"oneOf": [
{
"description": "For every chunk sent through http multipart connection",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This throws me off a bit, intuitively I read that as multipart file upload chunks. Also does this actually apply only to http multipart /responses/ ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm not really, I don't know how to explain it better because it's highly related to our internals. It's basically event coming in the supergraph response stream. Maybe a better description could be For every supergraph response payload (including subscription events and defer events) ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @shorgi WDYT?

I'm going to preapprove the PR and let Edward give his opinion / suggest something :D

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bnjjj I like your updated suggestion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated with my suggestion and enable auto merge in order to have this in the next release. Once we have feedback from @shorgi I'll open a new PR to change the description

Signed-off-by: Benjamin Coenen <5719034+bnjjj@users.noreply.github.com>
bnjjj added 5 commits May 13, 2024 14:48
Signed-off-by: Benjamin Coenen <5719034+bnjjj@users.noreply.github.com>
Signed-off-by: Benjamin Coenen <5719034+bnjjj@users.noreply.github.com>
Signed-off-by: Benjamin Coenen <5719034+bnjjj@users.noreply.github.com>
@bnjjj bnjjj enabled auto-merge (squash) May 14, 2024 12:49
@bnjjj bnjjj merged commit 1f336a8 into dev May 14, 2024
13 of 14 checks passed
@bnjjj bnjjj deleted the bnjjj/feat_chunk_resp_selector branch May 14, 2024 12:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

telemetry: add selectors for errors to be able to match on a specific error
3 participants