Skip to content

Commit

Permalink
Update Metrics information with image and list
Browse files Browse the repository at this point in the history
  • Loading branch information
mikegorman-nf committed Aug 31, 2022
1 parent 800b451 commit e72e874
Show file tree
Hide file tree
Showing 3 changed files with 122 additions and 3 deletions.
Binary file added docfx_project/ziti/metrics/MetricsReference.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
54 changes: 51 additions & 3 deletions docfx_project/ziti/metrics/metric-types.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,32 @@ Ziti is instrumenting more code and adding additional metrics all of the time. T
A gauge of a single value. The value is the current metric value, and can go up and down over time

## Histogram
Standard histogram with:
Histogram metrics utilize the Go metrics module, and are set to a 128 sample exponentially decaying bucket with a alpha value of .015. This is important to understand, especially in reference to minimum and maximum values. The bucket is sample bound, not time bound. In practice this means one will often see a maximum or minimum value that carries on for several time samples; this is expected behavior. The histogram implementation allows for extremely fast and memory efficient data collection. As some of the metrics are multiplied by multiple levels of cardinality, it is critical to maintaining the operations of the software.

An exponentially decaying histogram means that as the samples age across the 128 sample window, they are weighted less than the newer samples. This makes functions, such as the mean, which is often used, able to respond more quickly to changes than a straight sliding window. An alpha value of .015 means that the sample weights range from 1 (the newest sample) to approximately .93. This means that when calculating the mean, the oldest sample in the window is weighted to 93%, reducing its contribution to the function.

A simple weighting exercise:
Given 3 samples, 10, 5, and 5, how does the weighting and order affect the mean function?
| Sample | Weight | Weighted Value |
|--------|-------------|----------------|
| 10 | 1.0 | 10.0 |
| 5 | .95 | 4.75 |
| 5 | .90 | 4.5 |
| Sum | 2.85 | 19.25 |
| Mean | 19.25/2.85 | 6.75 |

| Sample | Weight | Weighted Value |
|--------|-------------|----------------|
| 5 | 1.0 | 5.0 |
| 5 | .95 | 4.75 |
| 10 | .90 | 9.0 |
| Sum | 2.85 | 18.75 |
| Mean | 18.75/2.85 | 6.58 |



Standard histograms provide:
* min
* max
* mean
Expand All @@ -19,8 +44,12 @@ Standard histogram with:
* p999
* p9999

## Timer
Timer metric with:
It is important to note the sample size (128) means the more specific percentiles will use the same actual values, and may be repetetive.

## Meter
Meters are used for rate measurements, how much of something happened per unit time. The samples are exponentially decayed, similar to the histogram, however the values are bound to specific time intervals, such as 1, 5, and 15 minutes. They can also provide similar statistical values to histograms

Meter metric with:
* count
* m1_rate
* m5_rate
Expand All @@ -37,3 +66,22 @@ Timer metric with:
* p99
* p999
* p9999

## Timer
Timers provide statistical samples of timed events.

* min
* max
* mean
* std_dev
* variance
* percentiles
* p50
* p75
* p95
* p99
* p999
* p9999

## Gauge
Gauges present a point in time measurement of a metric. For example, the number of open database transactions at a given moment.
71 changes: 71 additions & 0 deletions docfx_project/ziti/metrics/metrics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# Metrics

OpenZiti systems provide a wide range of metrics for the monitoring of the network services, endpoints, and processes. Some of the various metrics are visualized below to understand where they fall and what they measure in a network instance. The bulk of the remaining metrics are measuring processes within the control plane, rather than network operation.

![Metrics Reference Iamge](./MetricsReference.png)

## Available Metrics
Metrics are reported to the log files, locale in /var/log/ziti by default. There are 2 primary log files for metrics, utilization-metrics.log and utilization-usage.log. These logs may be shipped to various reporting systems for easier visibility and monitoring.

| Metric | Type | Source | Description|
|------------------------|-----------|------------|-----------------------------------------------------------------------------------------------------|
|api-session.create | Histogram | controller | Time to create api sessions|
|api.session.enforcer.run | Timer | controller | How long it takes the api session policy enforcer to run|
|bolt.open_read_txs | Gauge | controller | Current number of open bbolt read transactions|
|ctrl.latency | Histogram | controller | Per control channel latency|
|ctrl.queue_time | Histogram | controller | Per control channel queue time (between send and write to wire)|
|ctrl.rx.bytesrate | Meter | controller | Per control channel receive data rate|
|ctrl.rx.msgrate | Meter | controller | Per control channel receive message rate|
|ctrl.rx.msgsize | Histogram | controller | Per control channel receive message size distribution|
|ctrl.tx.bytesrate | Meter | controller | Per control channel send data rate|
|ctrl.tx.msgrate | Meter | controller | Per control channel send message rate|
|ctrl.tx.msgsize | Histogram | controller | Per control channel send messsage size distribution|
|edge.invalid_api_tokens | Meter | router | Number of invalid api session token encountered|
|edge.invalid_api_tokens_during_sync | Meter | router | Number of invalid api session token encountered while a sync is in progress|
|egress.rx.bytesrate | Meter | router | Data rate of data received via xgress, originating from terminators. Per router.|
|egress.rx.msgrate | Meter | router | Message rate of data received via xgress, originating from terminators. Per router.|
|egress.rx.msgsize | Histogram | router | Message size distribution of data received via xgress, originating from terminators. Per router.|
|egress.tx.bytesrate | Meter | router | Data rate of data sent via xgress originating from terminators. Per router.|
|egress.tx.msgrate | Meter | router | Message rate of data sent via xgress originating from terminators. Per router.|
|egress.tx.msgsize | Histogram | router | Message size distribution of data sent via xgress, originating from terminators. Per router.|
|eventual.events | Gauge | controller | Number of background events pending processing|
|fabric.rx.bytesrate | Meter | router | Data rate of data received from fabric links|
|fabric.rx.msgrate | Meter | router | Message rate of data received from fabric links|
|fabric.rx.msgsize | Histogram | router | Message size distribution of data received from fabric links|
|fabric.tx.bytesrate | Meter | router | Data rate of data sent on fabric links|
|fabric.tx.msgrate | Meter | router | Message rate of data sent on fabric links|
|fabric.tx.msgsize | Histogram | router | Message size distribution of data sent on fabric links|
|identity.refresh | Meter | controller | How often an identity is marked, indicating that they need a full refresh of their service list|
|identity.update-sdk-info | Histogram | controller | Time to update identity sdk info|
|ingress.rx.bytesrate | Meter | router | Data rate of data received via xgress, originating from initiators. Per router.|
|ingress.rx.msgrate | Meter | router | Message rate of data received via xgress, originating from initiators. Per router.|
|ingress.rx.msgsize | Histogram | router | Message size distribution of data received via xgress, originating from initiators. Per router.|
|ingress.tx.bytesrate | Meter | router | Data rate of data sent via xgress originating from initiators. Per router.|
|ingress.tx.msgrate | Meter | router | Message rate of data sent via xgress originating from initiators. Per router.|
|ingress.tx.msgsize | Histogram | router | Message size distribution of data sent via xgress, originating from initiators. Per router.|
|link.latency | Histogram | controller | Per link latency in nanoseconds|
|link.queue_time | Histogram | controller | Per link queue time (between send and write to wire)|
|link.rx.bytesrate | Meter | controller | Per link receive data rate|
|link.rx.msgrate | Meter | controller | Per link receive message rate|
|link.rx.msgsize | Histogram | controller | Per link receive message size distribution|
|link.tx.bytesrate | Meter | controller | Per link send data rate|
|link.tx.msgrate | Meter | controller | Per link send message rate|
|link.tx.msgsize | Histogram | controller | Per link send messsage size distribution|
|service.policy.enforcer.run | Timer | controller | How long it takes the service policy enforcer to run|
|service.policy.enforcer.run.deletes | Meter | controller | How many sessions are deleted by the service policy enforcer|
|services.list | Histogram | controller | Time to list services|
|session.create | Histogram | controller | Time to create a session|
|xgress.ack_duplicates | Meter | router | Number of duplicate acks received. Indicates over-eager retransmission|
|xgress.ack_failures | Meter | router | Number of failures sending acks|
|xgress.acks.queue_size | Gauge | router | Number of acks queued to send|
|xgress.blocked_by_local_window | Gauge | router | Number of xgress instances blocked because the windowing threshold has been exceeded locally|
|xgress.blocked_by_remote_window | Gauge | router | Number of xgress instances blocked because the windowing threshold has been exceeded remotely|
|xgress.dropped_payloads | Meter | router | Number of payloads dropped because the xgress receiver side couldn't keep up|
|xgress.retransmission_failures | Meter | router | Number of retransmission send failures|
|xgress.retransmissions | Meter | router | Number of payloads retransmitted|
|xgress.retransmits.queue_size | Gauge | router | Number of payloads queued for retransmission|
|xgress.rx.acks | Meter | router | Number of acks received|
|xgress.tx.acks | Meter | router | Number of acks sent|
|xgress.tx_unacked_payload_bytes | Gauge | router | Total payload data size that has been buffered but not acked yet|
|xgress.tx_unacked_payloads | Gauge | router | Number of payload messages that have been buffered but not yet acked|
|xgress.tx_write_time | Timer | router | Time to write payloads to the xgress receiver|

0 comments on commit e72e874

Please sign in to comment.