docs inconsistency: Histograms doesn't include grpc_code #75

NBR41 · 2019-04-09T09:31:52Z

Hi,

unlike the documentation in the README file, it appairs that the grpc_code is not included as label in the histogram.

grpc_server_handling_seconds_bucket - contains the counts of RPCs by status and method in respective handling-time buckets. These buckets can be used by Prometheus to estimate SLAs (see here)

// EnableHandlingTimeHistogram enables histograms being registered when
// registering the ServerMetrics on a Prometheus registry. Histograms can be
// expensive on Prometheus servers. It takes options to configure histogram
// options such as the defined buckets.
func (m *ServerMetrics) EnableHandlingTimeHistogram(opts ...HistogramOption) {
	for _, o := range opts {
		o(&m.serverHandledHistogramOpts)
	}
	if !m.serverHandledHistogramEnabled {
		m.serverHandledHistogram = prom.NewHistogramVec(
			m.serverHandledHistogramOpts,
			[]string{"grpc_type", "grpc_service", "grpc_method"},
		)
	}
	m.serverHandledHistogramEnabled = true
}

The grpc_code is not in the predefined labels.

Is there a way to add grpc_code, or maybe it would be nice to correct the README ?

Thanks

The text was updated successfully, but these errors were encountered:

brancz · 2019-04-09T11:26:53Z

It'd be nice to hear @bwplotka's opinion as well, but I feel having the grpc_code dimension here would be too high cardinality for too little gain.

bwplotka · 2019-04-09T14:32:07Z

Mixed feelings as well, as cardinality might be the problem for almost no gain. On the other hand IMO knowing latency of error request is useful, but not sure if exact info of timings between 202 and 200 or 504 between 500 matters... Probably in some specific cases matters (:

Also with grpc_code here it will duplicate with things like grpc_server_handled_total as histogram will give you that.

Do you need this @NBR41 or just mismatch between docs and implementation is what you want to fix? I think PR to fix docs is must-have anyway.

brancz · 2019-04-09T15:05:39Z

Do you need this @NBR41 or just mismatch between docs and implementation is what you want to fix? I think PR to fix docs is must-have anyway.

Agreed.

NBR41 · 2019-04-10T07:34:08Z

no i've not this particular need for now.
I was interrested to test that as it's in the doc.
My point was just to correct the doc to avoid potential new future issues on this subject :)

bwplotka · 2019-04-10T08:16:24Z

Cool, let's correct docs for now, thank You for finding this! Marking as a docs bug.

mkmik · 2020-09-07T15:43:05Z

I think it's not unreasonable for succeeded and failed calls to behave quite differently; in other words failed calls often are outliers, latency wise. For example when a deadline is exceeded somewhere, failed calls will tend to be vastly slower than regular calls. In other cases, errors will be detected early (e.g. bad requests, failed preconditions) and thus artificially skewing the distribution towards a fat head.

A more specific example: If I see that the 95% percentile duration for a call is 1s, does this mean that it's sometimes slow but serving OK, or that those slow requests are the ones that hit some deadline exceeded errors downstream?

Looking at error ratio with grpc_server_handled_total can give me a hint, but it alone cannot fully answer that since the method could perform multiple downstream calls, some of which could have a very short deadline and thus cause the outer method to be counted in a small bucket with its grpc_code="deadline_exceeded"

EDIT: I'm not very concerned with the cardinality increase, after all the number of possible grpc error codes is bounded (unlike the method names for which we already have a label); But knowing whether it's "OK" vs "other" would already be useful.

ashu82492 · 2022-04-13T17:16:05Z

IMO it should have a label with SUCCESS/FAILURE values at least in grpc_server_handling_seconds_bucket bucket. It would allow to track SUCCESS calls latencies.

gmolau · 2022-05-19T15:21:51Z

I'm also a bit surprised about this issue, for golden signals it is usually recommended to analyze latency for ok/error separately:

It’s important to distinguish between the latency of successful requests and the latency of failed requests.

The cardinality argument doesn't seem very strong given that there are only a small number of status codes. I would appreciate if this could be fixed.

bwplotka added the enhancement label Apr 9, 2019

bwplotka added bug and removed enhancement labels Apr 10, 2019

bwplotka changed the title ~~Histograms doesn't include grpc_code~~ docs inconsistency: Histograms doesn't include grpc_code Apr 10, 2019

This was referenced Jul 28, 2023

Adding grpc_code label to prometheus grpc_server_handling_seconds metric grpc-ecosystem/go-grpc-middleware#606

Open

chore: update go-grpc-prometheus middleware to use fork with grpc_code openfga/openfga#895

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs inconsistency: Histograms doesn't include grpc_code #75

docs inconsistency: Histograms doesn't include grpc_code #75

NBR41 commented Apr 9, 2019

brancz commented Apr 9, 2019

bwplotka commented Apr 9, 2019

brancz commented Apr 9, 2019

NBR41 commented Apr 10, 2019

bwplotka commented Apr 10, 2019 •

edited

mkmik commented Sep 7, 2020 •

edited

ashu82492 commented Apr 13, 2022 •

edited

gmolau commented May 19, 2022

docs inconsistency: Histograms doesn't include grpc_code #75

docs inconsistency: Histograms doesn't include grpc_code #75

Comments

NBR41 commented Apr 9, 2019

brancz commented Apr 9, 2019

bwplotka commented Apr 9, 2019

brancz commented Apr 9, 2019

NBR41 commented Apr 10, 2019

bwplotka commented Apr 10, 2019 • edited

mkmik commented Sep 7, 2020 • edited

ashu82492 commented Apr 13, 2022 • edited

gmolau commented May 19, 2022

bwplotka commented Apr 10, 2019 •

edited

mkmik commented Sep 7, 2020 •

edited

ashu82492 commented Apr 13, 2022 •

edited