Skip to content
This repository has been archived by the owner on Apr 18, 2023. It is now read-only.

docs inconsistency: Histograms doesn't include grpc_code #75

Open
NBR41 opened this issue Apr 9, 2019 · 8 comments
Open

docs inconsistency: Histograms doesn't include grpc_code #75

NBR41 opened this issue Apr 9, 2019 · 8 comments
Labels

Comments

@NBR41
Copy link

NBR41 commented Apr 9, 2019

Hi,

unlike the documentation in the README file, it appairs that the grpc_code is not included as label in the histogram.

grpc_server_handling_seconds_bucket - contains the counts of RPCs by status and method in respective handling-time buckets. These buckets can be used by Prometheus to estimate SLAs (see here)
// EnableHandlingTimeHistogram enables histograms being registered when
// registering the ServerMetrics on a Prometheus registry. Histograms can be
// expensive on Prometheus servers. It takes options to configure histogram
// options such as the defined buckets.
func (m *ServerMetrics) EnableHandlingTimeHistogram(opts ...HistogramOption) {
	for _, o := range opts {
		o(&m.serverHandledHistogramOpts)
	}
	if !m.serverHandledHistogramEnabled {
		m.serverHandledHistogram = prom.NewHistogramVec(
			m.serverHandledHistogramOpts,
			[]string{"grpc_type", "grpc_service", "grpc_method"},
		)
	}
	m.serverHandledHistogramEnabled = true
}

The grpc_code is not in the predefined labels.

Is there a way to add grpc_code, or maybe it would be nice to correct the README ?

Thanks

@brancz
Copy link
Collaborator

brancz commented Apr 9, 2019

It'd be nice to hear @bwplotka's opinion as well, but I feel having the grpc_code dimension here would be too high cardinality for too little gain.

@bwplotka
Copy link
Collaborator

bwplotka commented Apr 9, 2019

Mixed feelings as well, as cardinality might be the problem for almost no gain. On the other hand IMO knowing latency of error request is useful, but not sure if exact info of timings between 202 and 200 or 504 between 500 matters... Probably in some specific cases matters (:

Also with grpc_code here it will duplicate with things like grpc_server_handled_total as histogram will give you that.

Do you need this @NBR41 or just mismatch between docs and implementation is what you want to fix? I think PR to fix docs is must-have anyway.

@brancz
Copy link
Collaborator

brancz commented Apr 9, 2019

Do you need this @NBR41 or just mismatch between docs and implementation is what you want to fix? I think PR to fix docs is must-have anyway.

Agreed.

@NBR41
Copy link
Author

NBR41 commented Apr 10, 2019

no i've not this particular need for now.
I was interrested to test that as it's in the doc.
My point was just to correct the doc to avoid potential new future issues on this subject :)

@bwplotka bwplotka added bug and removed enhancement labels Apr 10, 2019
@bwplotka
Copy link
Collaborator

bwplotka commented Apr 10, 2019

Cool, let's correct docs for now, thank You for finding this! Marking as a docs bug.

@bwplotka bwplotka changed the title Histograms doesn't include grpc_code docs inconsistency: Histograms doesn't include grpc_code Apr 10, 2019
@mkmik
Copy link

mkmik commented Sep 7, 2020

I think it's not unreasonable for succeeded and failed calls to behave quite differently; in other words failed calls often are outliers, latency wise. For example when a deadline is exceeded somewhere, failed calls will tend to be vastly slower than regular calls. In other cases, errors will be detected early (e.g. bad requests, failed preconditions) and thus artificially skewing the distribution towards a fat head.

A more specific example: If I see that the 95% percentile duration for a call is 1s, does this mean that it's sometimes slow but serving OK, or that those slow requests are the ones that hit some deadline exceeded errors downstream?

Looking at error ratio with grpc_server_handled_total can give me a hint, but it alone cannot fully answer that since the method could perform multiple downstream calls, some of which could have a very short deadline and thus cause the outer method to be counted in a small bucket with its grpc_code="deadline_exceeded"

EDIT: I'm not very concerned with the cardinality increase, after all the number of possible grpc error codes is bounded (unlike the method names for which we already have a label); But knowing whether it's "OK" vs "other" would already be useful.

@ashu82492
Copy link

ashu82492 commented Apr 13, 2022

IMO it should have a label with SUCCESS/FAILURE values at least in grpc_server_handling_seconds_bucket bucket. It would allow to track SUCCESS calls latencies.

@gmolau
Copy link

gmolau commented May 19, 2022

I'm also a bit surprised about this issue, for golden signals it is usually recommended to analyze latency for ok/error separately:

It’s important to distinguish between the latency of successful requests and the latency of failed requests.

The cardinality argument doesn't seem very strong given that there are only a small number of status codes. I would appreciate if this could be fixed.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

6 participants