Enhancing metric observability #2791

CodeDrivenMitch · 2023-08-01T10:55:18Z

After two separate instances of providing training and on-site support for performance in AF applications, I believe there are things we can do to improve the observability.

We should separate the observability into two different categories:

Capacity of system - for alerting and scaling
Timings - for investigation

The reason for this separation is that timings can be very hard to alert on. They can fluctuate heavily and therefore we mention percentiles. E.g. 95% of requests are done withing 200ms.
However, this value fluctuating is not a problem, as long as it doesn't cause the capacity to be reached. When you reach the capacity (or want to optimize) you start investigating.

Capacity

The current way to measure capacity for commands and events is the capacity metric. This is the number of threads that were busy (on average) over the last 10 minutes. There are a few problems with this metric:

If a message takes very long, the MonitorCallback is not called so the time taken is not registered
It is an average. You can still have spikes and thus capacity problems and not see them (this has been improved in 4.8.0 by setting the period to 1 second - but still)
It is not a relative metric. I have 10 threads active! Of how many? 100? 1000? We don't know! So we cannot scale or detect capacity problems

I want to propose to:

Remove/deprecate this somewhat misleading metric
Expose the thread pool itself to metrics via an abstraction. This will measure how many threads are busy and if there are tasks queuing
1. You can then alert on the queue! You have > 1 pending tasks? Alert/scale
Measure the time it takes for a message to be picked up. > 0? Alert/scale
Maybe: Measure the time it takes for a command to reach the localSegment bus.
1. This would mean adding timestamps to queries and events - might be controversial
2. But very useful. Also includes network and AS routing

Capacity monitoring for event processors is good (using eventprocessor latency). We lack any autoscaling capabilities though! And I would like monitoring on the PSEP thread pool, just like in the buses.
I have an idea for the autoscaling. Expect a blog soon.

Timings

The timings are already very good. There are some things we can improve there:

Measure the message response time (command/query) form the sending side
Measure time spent of GRPC calls (appendEvent / listAggreagteEvents)
Measure time taken to load aggregate

The text was updated successfully, but these errors were encountered:

smcvb · 2023-10-16T08:02:32Z

I have removed this issue from milestone 4.9.0 in favor of releasing it in a timely manner.

CodeDrivenMitch added the Type: Enhancement Use to signal an issue enhances an already existing feature of the project. label Aug 1, 2023

CodeDrivenMitch added this to the Release 4.9.0 milestone Aug 1, 2023

CodeDrivenMitch self-assigned this Aug 1, 2023

smcvb added the Priority 1: Must Highest priority. A release cannot be made if this issue isn’t resolved. label Aug 1, 2023

smcvb removed this from the Release 4.9.0 milestone Oct 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancing metric observability #2791

Enhancing metric observability #2791

CodeDrivenMitch commented Aug 1, 2023

smcvb commented Oct 16, 2023

Enhancing metric observability #2791

Enhancing metric observability #2791

Comments

CodeDrivenMitch commented Aug 1, 2023

Capacity

Timings

smcvb commented Oct 16, 2023