Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitoring Grafana #3302

Closed
jaimegago opened this issue Nov 21, 2015 · 36 comments
Closed

Monitoring Grafana #3302

jaimegago opened this issue Nov 21, 2015 · 36 comments
Labels
help wanted prio/medium Important over the long term, but may not be staffed and/or may need multiple releases to complete. type/feature-request
Milestone

Comments

@jaimegago
Copy link
Contributor

It's time to monitor the monitoring! It'd be great to have a /status or /health endpoint that returns grafana health data as json.

Things I'd like to get from a status endpoint are:

  • configured sources are reachable (when I configure a new graphite source I can test the connection, I'd love to have that via the /status API)
  • DB is available
  • configured authorization sources are reachable
  • version

e.g:

/status

{ "date_sources_ok": True, "database_ok": True, "authorization_ok": True, "grafana_version": "2.5.1" }

@torkelo torkelo added type/feature-request prio/medium Important over the long term, but may not be staffed and/or may need multiple releases to complete. help wanted labels Nov 21, 2015
@anryko
Copy link
Contributor

anryko commented Nov 21, 2015

++

@kjedamzik
Copy link
Contributor

👍

@torkelo
Copy link
Member

torkelo commented Dec 8, 2015

make sure the health url does not generate sessions

@mattttt
Copy link
Contributor

mattttt commented Jan 8, 2016

👍

@williamjoy
Copy link
Contributor

+1 , this would be very useful to run grafana behind loadbalancer, loadbalancer will call the /health HTTP to verify is grafana returns HTTP 200 OK.

@theangryangel
Copy link
Contributor

I've put together something dead simple, but I'm not particularly happy with it at the moment.

If anyone would like to take a look at current state vs master: master...theangryangel:feature/health_check

It returns something like:

{"current_timestamp":"2016-06-04T18:43:49+01:00","database_ok":true,"session_ok":true,"version":{"built":1464981754,"commit":"v3.0.4+158-g7cbaf06-dirty","version":"3.1.0"}}

The database check I was originally returning some stats, but I've cut that out. I could switch the query to something much simpler like "select 1" and checking it doesn't error. Not sure if it's worth it.

The session check I'm not particularly happy with either. There doesn't seem to be an easy to test without standing up a test macaron server and recover()ing from the panic that it would throw when starting a session provider, or modifying macaron/session to add a test feature to each of the providers. As it is right now it irritating returns a Set-Cookie header, which I don't particularly want. I'd appreciate some input where to take this from someone more experienced with macaron 😞

Checking for data sources doesn't seem particularly sane to try through this given how grafana is written. Probably more sane to add to your regular monitoring system.

@wpt1313
Copy link

wpt1313 commented Jun 10, 2016

I was facing the same issue and as a workaround, I use an API call from the load balancer with a dedicated authentication API key. I'm using HAProxy, which has some useful "hidden" feature of setting custom HTTP headers in option httpchk:

option httpchk GET /api/org HTTP/1.0\r\nAccept:\ application/json\r\nContent-Type:\ application/json\r\nAuthorization:\ Bearer\ your_api_key\r\n

(I need to use HTTP/1.0 rather than 1.1, since the latter requires setting Host header and I can't get it dynamically in HAProxy config).

/api/org seems to be the simplest request with little overhead and returns HTTP 200, which is exactly what the load balancer needs -- and does not create any new sessions.

@iceycake
Copy link

iceycake commented Jul 7, 2016

Any progress or PR on this issue?

@tuxtek
Copy link

tuxtek commented Sep 29, 2016

+1

@JorritSalverda
Copy link

JorritSalverda commented Sep 29, 2016

I would split this into a separate /liveness and /readiness endpoint as is best practice in kubernetes. /liveness only indicates whether grafana itself is up and running, /readiness indicates whether its ready to receive traffic and will check whether its dependencies are reachable.

In kubernetes the liveness endpoint will be probed and when failing a number of times to respond with 200 ok the container will be killed and replaced with a new one. The readiness endpoint is used to make the container part of a service and send traffic its way. Like adding and removing it from a load balancer.

@marco-hoyer
Copy link

+1

@bigkraig
Copy link

bigkraig commented Nov 3, 2016

what about adding a /metrics Prometheus endpoint?

@bergquist bergquist added this to the 4.1.0 milestone Nov 3, 2016
@vinhlh
Copy link

vinhlh commented Nov 8, 2016

+1

@vinhlh
Copy link

vinhlh commented Nov 8, 2016

For whoever needs health checks on some services like Amazon ECS:
Use this hack: Path /public/img/grafana_icon.svg, HTTP Code: 200.

@philip-wernersbach
Copy link

+1

@envintus
Copy link

envintus commented Dec 5, 2016

In the mean time if you're only looking for a simple HTTP code: 200, then just use /login. My colleague and I just deployed Grafana to a Kubernetes cluster and using that endpoint worked just fine for the liveness/readiness probes. Also works for the Google Compute Engine load balancer.

@andyfeller
Copy link

andyfeller commented Dec 5, 2016 via email

@philip-wernersbach
Copy link

I'd like to add our specific use case: we need a simple HTTP endpoint for checking if a user can login and display graphs. I know that we can use the static resources and endpoints such as /login to work around the absence of this, but we really need something that checks that the Grafana internals are running as expected. We don't necessarily need status checks for retrieving data from data sources, as we have separate health checks for those.

@envintus
Copy link

envintus commented Dec 6, 2016 via email

@torkelo torkelo removed this from the 4.1.0 milestone Dec 14, 2016
@torkelo
Copy link
Member

torkelo commented Dec 14, 2016

So there is currently in 4.0 a /api/metrics endpoint with some internal metrics.

But the issue requests something like this

{ "date_sources_ok": True, "database_ok": True, "authorization_ok": True, "grafana_version": "2.5.1" }

Would be good with a more detailed description for what is expected here. Should the API health call do a live check with all data sources in all orgs? should it be done on the fly as the /health api call is made?
What does authorization ok mean?

@andyfeller
Copy link

@torkelo going to toss out an idea but definitely think /health should allow for both grafana-server as well as installed plugins to register arbitrary things to report on:

{
	"ok": false,
	"items": [
		"datasources": {
			"ok": true,
		},
		"database": {
			"ok": false,
			"msg": "Cannot communicate ###.###.###.###/XXXXXXX"
		},
		...
	]
}

By default, health checks perform live checks of all things when endpoint is called. If people want to isolate health checks to specific things, you can do something like elasticsearch does for cluster health. When thing is an external service (authorization, database, etc), then connectivity test is done at the minimum and any other sanity check that is reasonable for thing (e.g. SELECT 1 for database, LDAP bind test for authorization, etc).

Having output like this will allow monitoring checks to check holistically for issues while finding specific problems and output accordingly.

@aseppala
Copy link

+1

@jaimegago
Copy link
Contributor Author

@torkelo sorry for the delayed answer just saw your questions.

TL;DR
@andyfeller Did a good job in his comment and it's pretty much what I had in mind

The end point (or end points) used to monitor Grafana should answer 2 questions with details:
A) Is this Grafana instance up and ready ?
B) Is this Grafana instance running as expected according to its configuration intents?

"configuration intents" is key here, what I mean by intent is that when for example the admin adds as a data source she expects it to be available regardless of whether or not the saved configuration is right. Thus if a configured data source is not available to Grafana the monitoring end point should say so and why, in the same fashion the extremely useful "test" button works.

It helps me think in terms of a plane taking off, first I need to know the plane has finished taking off and is in the air, then I need to know the plane is flying towards its destination as expected (let's not get into what "reaching cruise altitude" means ;-) )

This can be somewhat be compared to the /live /ready others have pointed out or /health (1) /state (2) of the Elasticsearch model or /health and /info of Sensu (3).
IMHO one endpoint is enough but seeing 2 endpoints in most modern tools is kinda changing my mind; let's just say I'm not persuaded yet as I think B is a subset of A so I'd make the JSON returned reflect that instead of having 2 end points. Then one day when Grafana can be clustered a "/cluster_state" can be added.

Now regarding the details of each answer, here are my -non exhaustive- initial thoughts:
A details :

  • Status (e.g. red/yellow/green)
  • Status comment (e.g. "All is good"/"Couldn't start component Foo"/"Starting")
  • Version (e.g. v4.1.1-1)

B details:

  • DB Status (e.g. red/yellow/green)
  • DB details (e.g. "couldn't connect, bad auth", or connection ok to mySQL v4.1 at xxx.yyy.zzz:3306, schema version v34132, yes SQL schemas should be versioned (4) )
  • Authentication/Authorization (e.g. LDAP connection to xx.xx.xx:389 ok)
  • Data sources (e.g. Datasource 1, type Graphite, status Red, status comment "auth failure, Datasource 2, type Elasticsearch, status Green, status comment "all good")

There is much more that can go in B which is why breaking the monitoring into 2 end points might make more sense, meh.

As to how to go about what happens when the end point is being queried (on the fly, APIs ,etc), I would defer to who ever ends up implementing.

A couple of - obvious?- advices though:

  • be very mindful of resources used to collect monitoring data and be very "protective" with the instrumentation code, help Grafana admins avoid "my monitoring of Grafana took Grafana down" or "Grafana has slowed down by X % since I started monitoring it" situations.

  • be as certain as you can on the provided monitoring data, alert fatigue is a plague

(1) https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-health.html
(2) https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-state.html
(3) https://sensuapp.org/docs/0.23/api/health-and-info-api.html#the-info-api-endpoint
(4) https://blog.codinghorror.com/get-your-database-under-version-control/

@dynek
Copy link

dynek commented Mar 23, 2017

So 4.2.0 just came out and there still is no way to probe the service? (think k8s cluster)

@jaimegago
Copy link
Contributor Author

@torkelo I think @dynek has a point, this is not optional anymore. Whether it's a new section in the docs dedicated to "how to monitor Grafana" where what can be done today with the existing instrumentation (e.g. leverage admin or metrics page) is documented or a full fleshed dedicated API like in this proposal we need something yesterday.
Please don't take this the wrong way, I don't mean to tell you what the priorities should be, It's just that it's a tough sell for an application to be "Enterprise Ready" without a dedicated part to how to monitor it.

@torkelo torkelo added this to the 4.3.0 milestone Mar 27, 2017
@al-joshwilliams
Copy link

+1

torkelo added a commit that referenced this issue Apr 25, 2017
@torkelo
Copy link
Member

torkelo commented Apr 25, 2017

Added a simple http endpoint to check grafana health:

GET /api/health 
{
  "commit": "349f3eb",
  "database": "ok",
  "version": "4.1.0"
}

If database (mysql/postgres/sqlite3) is not reachable it will return "failing" in the database field. Grafana will still answer with status code 200. Not sure what is correct in that case.

The most important thing about this endpoint is that it will never cause sessions to be created (Something other api calls might do if you do not call them with an api key or basic auth).

@torkelo torkelo closed this as completed Apr 25, 2017
@ConorNevin
Copy link

Wouldn't it be best to return with status code 503 when the database is unreachable?

@adamcstephens
Copy link

Kubernetes uses:

Any code greater than or equal to 200 and less than 400 indicates success. Any other code indicates failure.

@torkelo
Copy link
Member

torkelo commented Apr 25, 2017

Yes, I think 503 status code when db status failed is best, will update

daniellee added a commit that referenced this issue May 10, 2017
ref #8277, ref #8250, ref #8262, ref #8165, ref #8093, ref #8056, ref #8043, ref #7970, ref #7914, ref #7864, ref #7750, ref #7740, ref #7697, ref #7619, ref #5619, ref #4030, ref #5278, ref #3302, ref #2524
@JorritSalverda
Copy link

The 503 means the /api/health endpoint is best only used for the readiness check in Kubernetes. If this check is used for liveness a database issue will lead to all pods getting killed. Is there a query parameter to leave out the database check?

@bedrin
Copy link
Contributor

bedrin commented Nov 1, 2017

@JorritSalverda you could probably use tcpSocket check in livenessProbe

@bergquist
Copy link
Contributor

/metrics will not create sessions or issue a db request.

@micachen
Copy link

we typically have agressive readiness checks and relaxed liveness checks, 1 second, 1 fail for readiness, while it's 60 seconds 10 fails 1 success for liveness, this allows for responsive rerouting when there is an issue, but at the same time if self recovery is possible, prevents unnecessary pod restarts. But a persistent DB issue would cause restart which might actually help if it was due to some bad container state.

finkr added a commit to finkr/grafana that referenced this issue Jan 25, 2019
Document the health check implemented in grafana#3302 (and grafana#935), see  grafana#3302 (comment)
This was referenced Jan 25, 2019
jschill pushed a commit that referenced this issue Jan 28, 2019
Document the health check implemented in #3302 (and #935), see  #3302 (comment)
dghubble added a commit to poseidon/typhoon that referenced this issue Mar 24, 2019
@suridaddy
Copy link

@finkr /api/health take too long to response 503. Is there any way to make it reponse in a short term?

@andyfeller
Copy link

@finkr /api/health take too long to response 503. Is there any way to make it reponse in a short term?

@suridaddy : it might be easier to visit the Grafana community forums or the more interactive support channels along with more information to troubleshoot your problem. This issue is for feature / improvement and is closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted prio/medium Important over the long term, but may not be staffed and/or may need multiple releases to complete. type/feature-request
Projects
None yet
Development

No branches or pull requests