Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add guidance for adding new metrics #5116

Open
wants to merge 6 commits into
base: dev
Choose a base branch
from
Open

Conversation

BrynCooke
Copy link
Contributor

Until now there has been little guidance on adding new metrics to the router. This PR expands the dev doc to include this.


Checklist

Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.

  • Changes are compatible1
  • Documentation2 completed
  • Performance impact assessed and acceptable
  • Tests added and passing3
    • Unit Tests
    • Integration Tests
    • Manual Tests

Exceptions

Note any exceptions here

Notes

Footnotes

  1. It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this.

  2. Configuration is an important part of many changes. Where applicable please try to document configuration examples.

  3. Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions.

Copy link
Contributor

github-actions bot commented May 8, 2024

@BrynCooke, please consider creating a changeset entry in /.changesets/. These instructions describe the process and tooling.

@router-perf
Copy link

router-perf bot commented May 8, 2024

CI performance tests

  • step - Basic stress test that steps up the number of users over time
  • events_big_cap_high_rate_callback - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity using callback mode
  • large-request - Stress test with a 1 MB request payload
  • events - Stress test for events with a lot of users and deduplication ENABLED
  • xxlarge-request - Stress test with 100 MB request payload
  • events_without_dedup - Stress test for events with a lot of users and deduplication DISABLED
  • xlarge-request - Stress test with 10 MB request payload
  • step-jemalloc-tuning - Clone of the basic stress test for jemalloc tuning
  • events_callback - Stress test for events with a lot of users and deduplication ENABLED in callback mode
  • no-graphos - Basic stress test, no GraphOS.
  • reload - Reload test over a long period of time at a constant rate of users
  • events_big_cap_high_rate - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity
  • events_without_dedup_callback - Stress test for events with a lot of users and deduplication DISABLED using callback mode
  • const - Basic stress test that runs with a constant number of users

@BrynCooke BrynCooke changed the title Add guidance for adding mew metrics Add guidance for adding new metrics May 8, 2024
@BrynCooke BrynCooke requested review from Geal and bnjjj May 8, 2024 09:18
dev-docs/metrics.md Outdated Show resolved Hide resolved
@BrynCooke BrynCooke requested a review from abernix May 13, 2024 12:33
Copy link
Contributor

@Geal Geal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a lot of what is encoded in this document is unclear to me, I think we should discuss it a bit more

## Adding new metrics
There are different types of metrics.

* Static - Used by us to monitor feature usage.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Static - Used by us to monitor feature usage.
* Static - Used by Router developers to monitor feature usage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's assume router users will end up looking at the dev docs

Comment on lines +191 to +193
> Why are static metrics no longer recommended for users to use directly?
>
> They can, but usually it'll be only a starting point for them. We can't predict the things that users will want to monitor, and if we tried we would blow up the cardinality of our metrics resulting in high costs for our users via their APMs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is not clear to me. What do we mean by "users using static metrics directly?" Is it when they would add that in their custom plugin? (which would not increase cardinality for all users) Or asking us to add a new metric to the router?


### Static metrics
When adding a new feature to the Router you must also add new static metrics to monitor the usage of that feature and users cannot turn them off.
These metrics must be low cardinality and not leak any sensitive information. Users cannot change these metrics and they are primarily for us to see how our features are used so that we can inform future development.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a lot of static metrics actually monitor standard router operations and are not for us to collect data, but for users to observe the router.
If we want this to be the defining point, let's maybe not call them static VS dynamic metrics, but internal VS monitoring or user metrics, something like that?
I'd prefer we keep the distinction between static metrics as defined directly with tracing, and dynamic metrics as the ones defined by custom instruments, that can be activated with runtime conditions, and have another clear separation beween the metrics used for internal reporting (as with the apollo.router.operations and apollo.router.config prefixes) and the user facing ones.

* Look at the [OTel semantic conventions](https://opentelemetry.io/docs/specs/semconv/general/metrics/)
* Notify `#proj-router-analytics` channel in Slack.
* Add the metrics to the spreadsheet linked in the `#proj-router-analytics` channel in Slack.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code example of what is a static metric, to be sure which is which between static and dynamic?


When defining new operation metrics use the following conventions:

**Name:** `apollo.router.operations.<feature>` - (counter)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some of the apollo.router.operations metrics are actually monitored by users. What is the strategy here? Do we keep them available for users?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants