Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Showing metrics deviation(p99, errors) while listing services/endpoints #1102

Open
subintp opened this issue Aug 31, 2021 · 2 comments
Open

Comments

@subintp
Copy link

subintp commented Aug 31, 2021

Use Case

In the microservice world, when a customer reports an issue related to the error/degradation/latency we start debugging the by asking the below questions

  • Which services error rate spiked in the given timeline?
  • Which endpoint degraded in the given timeline?

We can identify the deviation for error/latency by going to the respective service/endpoint overview dashboard and check the patterns in the errors or latency graph. This workflow is not scalable for large number of services and dependencies.

Proposal

Add metrics(p99, error) deviation while listing services and endpoints.

Screenshot 2021-09-01 at 1 10 30 AM

@kotharironak
Copy link

kotharironak commented Nov 8, 2021

I think the requirement here is to have a column showcasing the change in latency (or error) with respect to the prior hour if the current dropdown is 1 hour.

Currently, most of the attributes are calculated at ingestion time. Doing this at ingestion time will be complex as we need the information of the prior hour (predefined window) and currently, our view-gen is stateless. Secondly, it will be limited to a set of pre-defined time windows used for comparison (say 15 mins or 30 mins).

So, this seems to be more suitable by doing query time. So, here, I think, we will need to fire two queries for a two-time window (one for the current hour, and one for the prior hour) and calculate the value for that attribute. where should we do this at query service/gateway service?

Do we also have to support orderby on such a column? @jayesh, do you think of any other way to capture this requirement in UI?

@aaron-steinfeld do you have any thoughts on this?

@aaron-steinfeld
Copy link
Contributor

Currently, most of the attributes are calculated at ingestion time. Doing this at ingestion time will be complex as we need the information of the prior hour (predefined window) and currently, our view-gen is stateless.

Metrics are calculated at read time at a service (or any aggregate) level. Only individual span values are calculated at ingestion time.

The tricky bit is basically what you said, that any delta (and I think there might be some work going on for deltas elsewhere, @jake-bassett - are you aware of any?), is defined by two time ranges, the current and the comparison. Sometimes the previous window makes sense, but that's really use case driven. For example, if I'm looking at the past hour and this issue has been happening for 2 hours, the prior hour is far less useful to me than the same hour yesterday. So new controls would likely be needed, which introduces more complexity - one of the reasons we've abandoned efforts like this in the past.

As far as order by support - if we compute the delta client side, like I was assuming, we wouldn't have support for order by (we could probably hack it in for the current page of data, but I'd argue against the inconsistency). If we compute the delta server side, that's a more significant change, and I guess the answer there would be - depends on how we introduce that support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants