Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collect metrics about CHT endpoint performance #60

Closed
jkuester opened this issue Jun 12, 2023 · 8 comments
Closed

Collect metrics about CHT endpoint performance #60

jkuester opened this issue Jun 12, 2023 · 8 comments
Assignees

Comments

@jkuester
Copy link
Collaborator

From @garethbowen's Slack message:

I came across this express.js prometheus middleware that collects response status and duration information which might be a really clean way to monitor performance of certain endpoints (eg: replication) without doing much bespoke development.

@kennsippell
Copy link
Member

kennsippell commented Jun 14, 2023

This does look clean. Is there also a plan to bring log data into Grafana (via Loki et al)? If that work is already being done, the response_time and status are in the medic-api logs and it may be straightforward to bring that into Grafana. Does anybody have experience with these? Does anybody have a preference for the approach taken?

@jkuester
Copy link
Collaborator Author

We have had some long-term discussions about the possibility of bringing in log data (via something like Loki)! It is not the top of the list of priorities, but is definitely something that would be valuable and folks have already shown interest in.

@garethbowen
Copy link
Member

We definitely want to improve logging, particularly for self-hosted installs, which may be Loki+Grafana. The downside to that is it exposes PHI where there wasn't before so it may not be the right approach. I've heard mixed feedback about it, but interestingly MoH Kenya are using it for other services and they like it and might be the ideal candidate for evaluating our approach.

Looking at one of the demo videos for Loki it can pull out response times and failure rates for different endpoints so it would also do what we need. Definitely worth investigating as well as any other options to do the same thing.

@kennsippell
Copy link
Member

kennsippell commented Jun 27, 2023

Looking at a few options. Welcome thoughts and feedback on these findings. I'm new to the Grafana stack so quite possible there is more to be considered here.

joao-fontenele/express-prometheus-middleware

https://github.com/joao-fontenele/express-prometheus-middleware

  • I fear the package may be end of life. I can't find any official announcement, but the repository has been archived and no updates in 2 years. I asked about the project's future here and await a response.
  • This uses the response-time library for endpoint performance measurement. This records "... the elapsed time from when a request enters this middleware to when the headers are written". I believe this is not the metric we want because the time reported for _changes requests is often the heartbeat time and so this wouldn't meaningfully monitor the performance of offline user replication requests.
  • Monitors endpoints by path (unlike prometheus-api-metrics which uses the express route name). Has a useful interface extraMasks interface for custom grouping and differentiation of endpoints. Need this to differentiate similar routes (like offline _changes requests vs longpoll) and group requests (like /medic/resources/*, or /medic/{uuid}).
  • The importable dashboard is definitely the best available - even just its presentation of the prom-client standard metrics.

matteodisabatino/express-prometheus-middleware

https://github.com/matteodisabatino/express-prometheus-middleware

Official option on the Grafana site. Very basic with poor extensibility. No segmentation of metrics by endpoint at all. I don't think we should consider this.

prometheus-api-metrics

https://github.com/PayU/prometheus-api-metrics

  • Stewarded by PayU.com a global payment provider.
  • This measure endpoint performance via the total_time from request to response finished so this accurately reports offline user _changes requests. I believe this is what we want.
  • Routes are grouped by the express route (eg. /+medic/+:docId/{0,}) and not the path (like express-prometheus-middleware). I like it, but some issues with CHT requests (eg. resources are not properly grouped). No support for differentiating routes or custom grouping. I personally feel it is really important to monitor longpoll _changes requests differently from offline user _changes requests during replication.
  • The premade dashboards are decent for GC stuff but seem way worse than joao-fontenele for request duration. I can try to combine them?
  • Good support for custom metrics through prom-client, so we have custom metrics like yield or replication yield as first-class citizens in CHT monitoring.

prom-client

https://github.com/siimon/prom-client

  • Under the hood all these projects use prom-client.
  • A lot of the process/GC metrics by default.

Recommendation

Short-List of Tooling Options

  1. Use prometheus-api-metrics and contribute a feature to improve route differentiation/grouping
  2. Custom implementation based on prom-client
  3. Implementation based on logs (not described)

Recommendations

  • Look into contributing the needed feature to prometheus-api-metrics.
  • Make CHT-specific dashboards forked from express-prometheus-middleware dashboards (regardless of tooling option chosen).
  • Monitor endpoint performance based on total_time (last byte) not response_time (first byte)
  • Avoid joao-fontenele due to end-of-life + endpoint performance measurement is only by time-to-first-byte.

@kennsippell kennsippell self-assigned this Jun 27, 2023
@garethbowen
Copy link
Member

Great write up, thanks!

One thing to consider is assuming this requires a cht-core api code change, then it'll presumably be released in 4.3.0 and not backported to previous versions. If so, then the endpoints to monitor will change completely with the new replication algorithm. This means the _changes endpoint likely no longer needs to be monitored, which in turn means the issues you've cited with longpoll, heartbeat, etc should not be a problem.

@kennsippell
Copy link
Member

kennsippell commented Jun 27, 2023

True. My thinking is so 4.2.

So there is no heartbeat or anything similarly affecting time-to-first-byte in 4.3 replication requests?
How do you feel about adding a dependency on express-prometheus-middleware - an archived/likely unmaintained project?
Is there a preference for monitoring time to first-byte vs last-byte?

@dianabarsan
Copy link
Member

there is no heartbeat or anything similarly affecting time-to-first-byte in 4.3 replication requests

This is correct, _changes endpoint will no longer be used as of 4.3. We still need endpoint filtering, even if we're changing the protocol, we will still have at least one resource intensive endpoint - but these endpoints will less likely be shared between use-cases (there won't be a "longpoll" variant to be ignored for example).

I trust PayU, they are very big in Europe.

@kennsippell
Copy link
Member

kennsippell commented Jun 29, 2023

I like PayU's focus on Apdex. Pretty cool, but may take some thinking and data to calibrate for us.

With the PayU lib, we get a bunch of noisy reporting because of our default '*' route. I'm seeing unique routes in Grafana for each unique path handled by this route. The routes _local/{uuid} and resources are the noisiest. It would be better for these to be grouped somehow.

I've made a PR which would group them all under a single route '*'. Here are the paths that I know of that would be grouped. Is there anything here we'd definitely want to look at individually (not all together in this single group)?

  • /medic/
  • /medic/resources/*
  • /medic/branding/*
  • /*-users-meta/
  • /-users-meta/_local
  • /*-users-meta/_rev_diff
  • /-users-meta/_design/
  • /*-users-meta/_bulk_docs
  • …. probably more I don't know about

image

I'm also investigating a second issue which is causing a different pattern of noise in the prometheus routes. This seems less critical, but would be nice to fix.

Last issue I've found is that requests for static content (service-worker.js, runtime.js, etc.) are grouped together under a route called "N/A" because of our use of app.use(express.static());. I haven't filed a bug for this because I don't know how I'd go about fixing it exactly. I can't find anything else which triggers this N/A case - though it isn't guaranteed to be all static content.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants