Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancing metric observability #2791

Open
CodeDrivenMitch opened this issue Aug 1, 2023 · 1 comment
Open

Enhancing metric observability #2791

CodeDrivenMitch opened this issue Aug 1, 2023 · 1 comment
Assignees
Labels
Priority 1: Must Highest priority. A release cannot be made if this issue isn’t resolved. Type: Enhancement Use to signal an issue enhances an already existing feature of the project.

Comments

@CodeDrivenMitch
Copy link
Member

After two separate instances of providing training and on-site support for performance in AF applications, I believe there are things we can do to improve the observability.

We should separate the observability into two different categories:

  1. Capacity of system - for alerting and scaling
  2. Timings - for investigation

The reason for this separation is that timings can be very hard to alert on. They can fluctuate heavily and therefore we mention percentiles. E.g. 95% of requests are done withing 200ms.
However, this value fluctuating is not a problem, as long as it doesn't cause the capacity to be reached. When you reach the capacity (or want to optimize) you start investigating.

Capacity

The current way to measure capacity for commands and events is the capacity metric. This is the number of threads that were busy (on average) over the last 10 minutes. There are a few problems with this metric:

  • If a message takes very long, the MonitorCallback is not called so the time taken is not registered
  • It is an average. You can still have spikes and thus capacity problems and not see them (this has been improved in 4.8.0 by setting the period to 1 second - but still)
  • It is not a relative metric. I have 10 threads active! Of how many? 100? 1000? We don't know! So we cannot scale or detect capacity problems

I want to propose to:

  1. Remove/deprecate this somewhat misleading metric
  2. Expose the thread pool itself to metrics via an abstraction. This will measure how many threads are busy and if there are tasks queuing
    1. You can then alert on the queue! You have > 1 pending tasks? Alert/scale
  3. Measure the time it takes for a message to be picked up. > 0? Alert/scale
  4. Maybe: Measure the time it takes for a command to reach the localSegment bus.
    1. This would mean adding timestamps to queries and events - might be controversial
    2. But very useful. Also includes network and AS routing

Capacity monitoring for event processors is good (using eventprocessor latency). We lack any autoscaling capabilities though! And I would like monitoring on the PSEP thread pool, just like in the buses.
I have an idea for the autoscaling. Expect a blog soon.

Timings

The timings are already very good. There are some things we can improve there:

  • Measure the message response time (command/query) form the sending side
  • Measure time spent of GRPC calls (appendEvent / listAggreagteEvents)
  • Measure time taken to load aggregate
@CodeDrivenMitch CodeDrivenMitch added the Type: Enhancement Use to signal an issue enhances an already existing feature of the project. label Aug 1, 2023
@CodeDrivenMitch CodeDrivenMitch added this to the Release 4.9.0 milestone Aug 1, 2023
@CodeDrivenMitch CodeDrivenMitch self-assigned this Aug 1, 2023
@smcvb smcvb added the Priority 1: Must Highest priority. A release cannot be made if this issue isn’t resolved. label Aug 1, 2023
@smcvb smcvb removed this from the Release 4.9.0 milestone Oct 16, 2023
@smcvb
Copy link
Member

smcvb commented Oct 16, 2023

I have removed this issue from milestone 4.9.0 in favor of releasing it in a timely manner.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority 1: Must Highest priority. A release cannot be made if this issue isn’t resolved. Type: Enhancement Use to signal an issue enhances an already existing feature of the project.
Projects
None yet
Development

No branches or pull requests

2 participants