Skip to content

Commit

Permalink
add support for publishing percentile time series for the histogram m… (
Browse files Browse the repository at this point in the history
#1689)

* add support for publishing percentile time series for the histogram metrics cql-requests, cql-messages and throttling delay.

Motivation:

Histogram metrics is generating too many metrics overloading the promethous servers. if application has 500 Vms
and 1000 cassandra nodes, The histogram metrics generates 100*500*1000 = 50,000,000 time series every 30 seconds.
This is just too much metrics.  Let us say we can generate percentile 95 timeseries for for every cassandra nodes,
then we only have 1*500 = 500 metrics and in applciation side, we can ignore the _bucket time series. This way there
will be very less metrics.

Modifications:
add configurable pre-defined percentiles to Micrometer Timer.Builder.publishPercentiles. This change is being added to
cql-requests, cql-messages and throttling delay.

Result:
Based on the configuration, we will see additonal quantile time series for cql-requests, cql-messages and throttling delay
histogram metrics.

* add support for publishing percentile time series for the histogram metrics cql-requests, cql-messages and throttling delay.

Motivation:

Histogram metrics is generating too many metrics overloading the promethous servers. if application has 500 Vms
and 1000 cassandra nodes, The histogram metrics generates 100*500*1000 = 50,000,000 time series every 30 seconds.
This is just too much metrics.  Let us say we can generate percentile 95 timeseries for for every cassandra nodes,
then we only have 1*500 = 500 metrics and in applciation side, we can ignore the _bucket time series. This way there
will be very less metrics.

Modifications:
add configurable pre-defined percentiles to Micrometer Timer.Builder.publishPercentiles. This change is being added to
cql-requests, cql-messages and throttling delay.

Result:
Based on the configuration, we will see additonal quantile time series for cql-requests, cql-messages and throttling delay
histogram metrics.

* using helper method as suggested in review

* fixes as per review comments

* add configuration option which switches aggregable histogram generation on/off for all metric flavors [default=on]

* updating java doc

* rename method to publishPercentilesIfDefined

* renmae method

---------

Co-authored-by: Nagappa Paraddi <nparaddi@walmartlabs.com>
  • Loading branch information
nparaddi-walmart and Nagappa Paraddi committed Sep 11, 2023
1 parent 1849812 commit 06947df
Show file tree
Hide file tree
Showing 10 changed files with 311 additions and 21 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -288,6 +288,34 @@ public enum DseDriverOption implements DriverOption {
* <p>Value-type: {@link java.time.Duration Duration}
*/
METRICS_NODE_GRAPH_MESSAGES_SLO("advanced.metrics.node.graph-messages.slo"),
/**
* Optional list of percentiles to publish for graph-requests metric. Produces an additional time
* series for each requested percentile. This percentile is computed locally, and so can't be
* aggregated with percentiles computed across other dimensions (e.g. in a different instance).
*
* <p>Value type: {@link java.util.List List}&#60;{@link Double}&#62;
*/
METRICS_SESSION_GRAPH_REQUESTS_PUBLISH_PERCENTILES(
"advanced.metrics.session.graph-requests.publish-percentiles"),
/**
* Optional list of percentiles to publish for node graph-messages metric. Produces an additional
* time series for each requested percentile. This percentile is computed locally, and so can't be
* aggregated with percentiles computed across other dimensions (e.g. in a different instance).
*
* <p>Value type: {@link java.util.List List}&#60;{@link Double}&#62;
*/
METRICS_NODE_GRAPH_MESSAGES_PUBLISH_PERCENTILES(
"advanced.metrics.node.graph-messages.publish-percentiles"),
/**
* Optional list of percentiles to publish for continuous paging requests metric. Produces an
* additional time series for each requested percentile. This percentile is computed locally, and
* so can't be aggregated with percentiles computed across other dimensions (e.g. in a different
* instance).
*
* <p>Value type: {@link java.util.List List}&#60;{@link Double}&#62;
*/
CONTINUOUS_PAGING_METRICS_SESSION_CQL_REQUESTS_PUBLISH_PERCENTILES(
"advanced.metrics.session.continuous-cql-requests.publish-percentiles"),
;

private final String path;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -939,6 +939,41 @@ public enum DefaultDriverOption implements DriverOption {
* <p>Value-type: List of {@link String}
*/
METADATA_SCHEMA_CHANGE_LISTENER_CLASSES("advanced.schema-change-listener.classes"),
/**
* Optional list of percentiles to publish for cql-requests metric. Produces an additional time
* series for each requested percentile. This percentile is computed locally, and so can't be
* aggregated with percentiles computed across other dimensions (e.g. in a different instance).
*
* <p>Value type: {@link java.util.List List}&#60;{@link Double}&#62;
*/
METRICS_SESSION_CQL_REQUESTS_PUBLISH_PERCENTILES(
"advanced.metrics.session.cql-requests.publish-percentiles"),
/**
* Optional list of percentiles to publish for node cql-messages metric. Produces an additional
* time series for each requested percentile. This percentile is computed locally, and so can't be
* aggregated with percentiles computed across other dimensions (e.g. in a different instance).
*
* <p>Value type: {@link java.util.List List}&#60;{@link Double}&#62;
*/
METRICS_NODE_CQL_MESSAGES_PUBLISH_PERCENTILES(
"advanced.metrics.node.cql-messages.publish-percentiles"),
/**
* Optional list of percentiles to publish for throttling delay metric.Produces an additional time
* series for each requested percentile. This percentile is computed locally, and so can't be
* aggregated with percentiles computed across other dimensions (e.g. in a different instance).
*
* <p>Value type: {@link java.util.List List}&#60;{@link Double}&#62;
*/
METRICS_SESSION_THROTTLING_PUBLISH_PERCENTILES(
"advanced.metrics.session.throttling.delay.publish-percentiles"),
/**
* Adds histogram buckets used to generate aggregable percentile approximations in monitoring
* systems that have query facilities to do so (e.g. Prometheus histogram_quantile, Atlas
* percentiles).
*
* <p>Value-type: boolean
*/
METRICS_GENERATE_AGGREGABLE_HISTOGRAMS("advanced.metrics.histograms.generate-aggregable"),
;

private final String path;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -378,6 +378,7 @@ protected static void fillWithDriverDefaults(OptionsMap map) {
map.put(TypedDriverOption.COALESCER_INTERVAL, Duration.of(10, ChronoUnit.MICROS));
map.put(TypedDriverOption.LOAD_BALANCING_DC_FAILOVER_MAX_NODES_PER_REMOTE_DC, 0);
map.put(TypedDriverOption.LOAD_BALANCING_DC_FAILOVER_ALLOW_FOR_LOCAL_CONSISTENCY_LEVELS, false);
map.put(TypedDriverOption.METRICS_GENERATE_AGGREGABLE_HISTOGRAMS, true);
}

@Immutable
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -388,6 +388,10 @@ public String toString() {
/** The consistency level to use for trace queries. */
public static final TypedDriverOption<String> REQUEST_TRACE_CONSISTENCY =
new TypedDriverOption<>(DefaultDriverOption.REQUEST_TRACE_CONSISTENCY, GenericType.STRING);
/** Whether or not to publish aggregable histogram for metrics */
public static final TypedDriverOption<Boolean> METRICS_GENERATE_AGGREGABLE_HISTOGRAMS =
new TypedDriverOption<>(
DefaultDriverOption.METRICS_GENERATE_AGGREGABLE_HISTOGRAMS, GenericType.BOOLEAN);
/** List of enabled session-level metrics. */
public static final TypedDriverOption<List<String>> METRICS_SESSION_ENABLED =
new TypedDriverOption<>(
Expand All @@ -409,6 +413,12 @@ public String toString() {
new TypedDriverOption<>(
DefaultDriverOption.METRICS_SESSION_CQL_REQUESTS_SLO,
GenericType.listOf(GenericType.DURATION));
/** Optional pre-defined percentile of cql requests to publish, as a list of percentiles . */
public static final TypedDriverOption<List<Double>>
METRICS_SESSION_CQL_REQUESTS_PUBLISH_PERCENTILES =
new TypedDriverOption<>(
DefaultDriverOption.METRICS_SESSION_CQL_REQUESTS_PUBLISH_PERCENTILES,
GenericType.listOf(GenericType.DOUBLE));
/**
* The number of significant decimal digits to which internal structures will maintain for
* requests.
Expand All @@ -433,6 +443,12 @@ public String toString() {
new TypedDriverOption<>(
DefaultDriverOption.METRICS_SESSION_THROTTLING_SLO,
GenericType.listOf(GenericType.DURATION));
/** Optional pre-defined percentile of throttling delay to publish, as a list of percentiles . */
public static final TypedDriverOption<List<Double>>
METRICS_SESSION_THROTTLING_PUBLISH_PERCENTILES =
new TypedDriverOption<>(
DefaultDriverOption.METRICS_SESSION_THROTTLING_PUBLISH_PERCENTILES,
GenericType.listOf(GenericType.DOUBLE));
/**
* The number of significant decimal digits to which internal structures will maintain for
* throttling.
Expand All @@ -457,6 +473,12 @@ public String toString() {
new TypedDriverOption<>(
DefaultDriverOption.METRICS_NODE_CQL_MESSAGES_SLO,
GenericType.listOf(GenericType.DURATION));
/** Optional pre-defined percentile of node cql messages to publish, as a list of percentiles . */
public static final TypedDriverOption<List<Double>>
METRICS_NODE_CQL_MESSAGES_PUBLISH_PERCENTILES =
new TypedDriverOption<>(
DefaultDriverOption.METRICS_NODE_CQL_MESSAGES_PUBLISH_PERCENTILES,
GenericType.listOf(GenericType.DOUBLE));
/**
* The number of significant decimal digits to which internal structures will maintain for
* requests.
Expand Down Expand Up @@ -700,6 +722,15 @@ public String toString() {
new TypedDriverOption<>(
DseDriverOption.CONTINUOUS_PAGING_METRICS_SESSION_CQL_REQUESTS_SLO,
GenericType.listOf(GenericType.DURATION));
/**
* Optional pre-defined percentile of continuous paging cql requests to publish, as a list of
* percentiles .
*/
public static final TypedDriverOption<List<Double>>
CONTINUOUS_PAGING_METRICS_SESSION_CQL_REQUESTS_PUBLISH_PERCENTILES =
new TypedDriverOption<>(
DseDriverOption.CONTINUOUS_PAGING_METRICS_SESSION_CQL_REQUESTS_PUBLISH_PERCENTILES,
GenericType.listOf(GenericType.DOUBLE));
/**
* The number of significant decimal digits to which internal structures will maintain for
* continuous requests.
Expand Down Expand Up @@ -774,6 +805,12 @@ public String toString() {
new TypedDriverOption<>(
DseDriverOption.METRICS_SESSION_GRAPH_REQUESTS_SLO,
GenericType.listOf(GenericType.DURATION));
/** Optional pre-defined percentile of graph requests to publish, as a list of percentiles . */
public static final TypedDriverOption<List<Double>>
METRICS_SESSION_GRAPH_REQUESTS_PUBLISH_PERCENTILES =
new TypedDriverOption<>(
DseDriverOption.METRICS_SESSION_GRAPH_REQUESTS_PUBLISH_PERCENTILES,
GenericType.listOf(GenericType.DOUBLE));
/**
* The number of significant decimal digits to which internal structures will maintain for graph
* requests.
Expand All @@ -798,6 +835,14 @@ public String toString() {
new TypedDriverOption<>(
DseDriverOption.METRICS_NODE_GRAPH_MESSAGES_SLO,
GenericType.listOf(GenericType.DURATION));
/**
* Optional pre-defined percentile of node graph requests to publish, as a list of percentiles .
*/
public static final TypedDriverOption<List<Double>>
METRICS_NODE_GRAPH_MESSAGES_PUBLISH_PERCENTILES =
new TypedDriverOption<>(
DseDriverOption.METRICS_NODE_GRAPH_MESSAGES_PUBLISH_PERCENTILES,
GenericType.listOf(GenericType.DOUBLE));
/**
* The number of significant decimal digits to which internal structures will maintain for graph
* requests.
Expand Down
25 changes: 22 additions & 3 deletions core/src/main/resources/reference.conf
Original file line number Diff line number Diff line change
Expand Up @@ -1434,6 +1434,16 @@ datastax-java-driver {
// prefix = "cassandra"
}

histograms {
# Adds histogram buckets used to generate aggregable percentile approximations in monitoring
# systems that have query facilities to do so (e.g. Prometheus histogram_quantile, Atlas percentiles).
#
# Required: no
# Modifiable at runtime: no
# Overridable in a profile: no
generate-aggregable = true
}

# The session-level metrics (all disabled by default).
#
# Required: yes
Expand Down Expand Up @@ -1526,7 +1536,7 @@ datastax-java-driver {
# Modifiable at runtime: no
# Overridable in a profile: no
cql-requests {

# The largest latency that we expect to record.
#
# This should be slightly higher than request.timeout (in theory, readings can't be higher
Expand Down Expand Up @@ -1569,15 +1579,19 @@ datastax-java-driver {
# time).
# Valid for: Dropwizard.
refresh-interval = 5 minutes

# An optional list of latencies to track as part of the application's service-level
# objectives (SLOs).
#
# If defined, the histogram is guaranteed to contain these boundaries alongside other
# buckets used to generate aggregable percentile approximations.
# Valid for: Micrometer.
// slo = [ 100 milliseconds, 500 milliseconds, 1 second ]


# An optional list of percentiles to be published by Micrometer. Produces an additional time series for each requested percentile.
# This percentile is computed locally, and so can't be aggregated with percentiles computed across other dimensions (e.g. in a different instance)
# Valid for: Micrometer.
// publish-percentiles = [ 0.75, 0.95, 0.99 ]
}

# Required: if the 'throttling.delay' metric is enabled, and Dropwizard or Micrometer is used.
Expand All @@ -1589,6 +1603,7 @@ datastax-java-driver {
significant-digits = 3
refresh-interval = 5 minutes
// slo = [ 100 milliseconds, 500 milliseconds, 1 second ]
// publish-percentiles = [ 0.75, 0.95, 0.99 ]
}

# Required: if the 'continuous-cql-requests' metric is enabled, and Dropwizard or Micrometer
Expand All @@ -1601,6 +1616,7 @@ datastax-java-driver {
significant-digits = 3
refresh-interval = 5 minutes
// slo = [ 100 milliseconds, 500 milliseconds, 1 second ]
// publish-percentiles = [ 0.75, 0.95, 0.99 ]
}

# Required: if the 'graph-requests' metric is enabled, and Dropwizard or Micrometer is used.
Expand All @@ -1612,6 +1628,7 @@ datastax-java-driver {
significant-digits = 3
refresh-interval = 5 minutes
// slo = [ 100 milliseconds, 500 milliseconds, 1 second ]
// publish-percentiles = [ 0.75, 0.95, 0.99 ]
}
}
# The node-level metrics (all disabled by default).
Expand Down Expand Up @@ -1776,6 +1793,7 @@ datastax-java-driver {
significant-digits = 3
refresh-interval = 5 minutes
// slo = [ 100 milliseconds, 500 milliseconds, 1 second ]
// publish-percentiles = [ 0.75, 0.95, 0.99 ]
}

# See graph-requests in the `session` section
Expand All @@ -1789,6 +1807,7 @@ datastax-java-driver {
significant-digits = 3
refresh-interval = 5 minutes
// slo = [ 100 milliseconds, 500 milliseconds, 1 second ]
// publish-percentiles = [ 0.75, 0.95, 0.99 ]
}

# The time after which the node level metrics will be evicted.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,9 @@
*/
package com.datastax.oss.driver.internal.metrics.micrometer;

import com.datastax.oss.driver.api.core.config.DefaultDriverOption;
import com.datastax.oss.driver.api.core.config.DriverExecutionProfile;
import com.datastax.oss.driver.api.core.config.DriverOption;
import com.datastax.oss.driver.internal.core.context.InternalDriverContext;
import com.datastax.oss.driver.internal.core.metrics.AbstractMetricUpdater;
import com.datastax.oss.driver.internal.core.metrics.MetricId;
Expand All @@ -27,6 +29,7 @@
import io.micrometer.core.instrument.MeterRegistry;
import io.micrometer.core.instrument.Tag;
import io.micrometer.core.instrument.Timer;
import java.util.List;
import java.util.Set;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ConcurrentMap;
Expand Down Expand Up @@ -151,12 +154,31 @@ protected Timer getOrCreateTimerFor(MetricT metric) {
}

protected Timer.Builder configureTimer(Timer.Builder builder, MetricT metric, MetricId id) {
return builder.publishPercentileHistogram();
DriverExecutionProfile profile = context.getConfig().getDefaultProfile();
if (profile.getBoolean(DefaultDriverOption.METRICS_GENERATE_AGGREGABLE_HISTOGRAMS)) {
builder.publishPercentileHistogram();
}
return builder;
}

@SuppressWarnings("unused")
protected DistributionSummary.Builder configureDistributionSummary(
DistributionSummary.Builder builder, MetricT metric, MetricId id) {
return builder.publishPercentileHistogram();
DriverExecutionProfile profile = context.getConfig().getDefaultProfile();
if (profile.getBoolean(DefaultDriverOption.METRICS_GENERATE_AGGREGABLE_HISTOGRAMS)) {
builder.publishPercentileHistogram();
}
return builder;
}

static double[] toDoubleArray(List<Double> doubleList) {
return doubleList.stream().mapToDouble(Double::doubleValue).toArray();
}

static void configurePercentilesPublishIfDefined(
Timer.Builder builder, DriverExecutionProfile profile, DriverOption driverOption) {
if (profile.isDefined(driverOption)) {
builder.publishPercentiles(toDoubleArray(profile.getDoubleList(driverOption)));
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -96,9 +96,9 @@ protected void cancelMetricsExpirationTimeout() {
@Override
protected Timer.Builder configureTimer(Timer.Builder builder, NodeMetric metric, MetricId id) {
DriverExecutionProfile profile = context.getConfig().getDefaultProfile();
super.configureTimer(builder, metric, id);
if (metric == DefaultNodeMetric.CQL_MESSAGES) {
return builder
.publishPercentileHistogram()
builder
.minimumExpectedValue(
profile.getDuration(DefaultDriverOption.METRICS_NODE_CQL_MESSAGES_LOWEST))
.maximumExpectedValue(
Expand All @@ -111,9 +111,11 @@ protected Timer.Builder configureTimer(Timer.Builder builder, NodeMetric metric,
: null)
.percentilePrecision(
profile.getInt(DefaultDriverOption.METRICS_NODE_CQL_MESSAGES_DIGITS));

configurePercentilesPublishIfDefined(
builder, profile, DefaultDriverOption.METRICS_NODE_CQL_MESSAGES_PUBLISH_PERCENTILES);
} else if (metric == DseNodeMetric.GRAPH_MESSAGES) {
return builder
.publishPercentileHistogram()
builder
.minimumExpectedValue(
profile.getDuration(DseDriverOption.METRICS_NODE_GRAPH_MESSAGES_LOWEST))
.maximumExpectedValue(
Expand All @@ -125,7 +127,10 @@ protected Timer.Builder configureTimer(Timer.Builder builder, NodeMetric metric,
.toArray(new Duration[0])
: null)
.percentilePrecision(profile.getInt(DseDriverOption.METRICS_NODE_GRAPH_MESSAGES_DIGITS));

configurePercentilesPublishIfDefined(
builder, profile, DseDriverOption.METRICS_NODE_GRAPH_MESSAGES_PUBLISH_PERCENTILES);
}
return super.configureTimer(builder, metric, id);
return builder;
}
}

0 comments on commit 06947df

Please sign in to comment.