Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add prometheus metric #57

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

tliefheid
Copy link

No description provided.

@alexliesenfeld
Copy link
Owner

Thanks! I'll look into it soon!

Copy link
Owner

@alexliesenfeld alexliesenfeld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for this PR. Your approach seems to be to add the Prometheus related functionality right into the core library itself. Because Prometheus is only one tool out of many out there, I'd like the idea of separating this functionality out into it's own handler extension. Do you think we could use the ResultWriter interface for this and allow to export Prometheus specific health check status using the WithResultWriter config option instead?

@tliefheid
Copy link
Author

not sure how you imagine this to work. Cause who will register the metrics (gauges/counters) cause that is how prometheus works. if you expect it so send back some details, then it is still up to the user to set the gauges and counters. imho prometheus is one of the most commonly used metrics tools out there.
i'm happy to update this, don't get me wrong. i'm not that familiar as you with the whole concept of the interfaces you have. just getting some more background info and context from you :)

@akhy
Copy link

akhy commented Oct 25, 2022

Agree, adding a bunch of extra deps to go.mod is not a good idea for this small lib. I also like @alexliesenfeld 's idea to make the metrics lib implement ResultWriter interface in a separate project.

About the metrics format, I don't feel mapping discrete up/down/unknown to 0/1/2 value is right. I'm still not sure but I think I'm going to make my own format which is more or less like this:

# if service1 is up
app_name_health{name="service1", status="up"} 1
app_name_health{name="service1", status="down"} 0
app_name_health{name="service1", status="unknown"} 0

# if service1 is down
app_name_health{name="service1", status="up"} 0
app_name_health{name="service1", status="down"} 1
app_name_health{name="service1", status="unknown"} 0

# if service1 is unknown
app_name_health{name="service1", status="up"} 0
app_name_health{name="service1", status="down"} 0
app_name_health{name="service1", status="unknown"} 1

This way, we can create a fine grained and unambiguous prom query/aggregation.

@tliefheid
Copy link
Author

a format like that will result in more timeseries overall (3 for each check), also you need to reset all the gauges before you update the correct one, it's a bit more complex to do.

i get that you don't want to inject into the lib itself, but i'm not sure how to do this

@alexliesenfeld
Copy link
Owner

alexliesenfeld commented Oct 26, 2022

not sure how you imagine this to work. Cause who will register the metrics (gauges/counters) cause that is how prometheus works. if you expect it so send back some details, then it is still up to the user to set the gauges and counters. imho prometheus is one of the most commonly used metrics tools out there. i'm happy to update this, don't get me wrong. i'm not that familiar as you with the whole concept of the interfaces you have. just getting some more background info and context from you :)

Sure, I can provide some more details. What I was referring to was that there is a ResultWriter interface that the library uses to write a check result into a specific format. For example, the default JSONResultWriter takes the current check status and transforms it into a JSON HTTP response.

health/handler.go

Lines 49 to 64 in 09f6eab

// Write implements ResultWriter.Write.
func (rw *JSONResultWriter) Write(result *CheckerResult, statusCode int, w http.ResponseWriter, r *http.Request) error {
jsonResp, err := json.Marshal(result)
if err != nil {
return fmt.Errorf("cannot marshal response: %w", err)
}
w.Header().Set("Content-Type", "application/json; charset=utf-8")
w.WriteHeader(statusCode)
_, err = w.Write(jsonResp)
return err
}
// NewJSONResultWriter creates a new instance of a JSONResultWriter.
func NewJSONResultWriter() *JSONResultWriter {
return &JSONResultWriter{}
}

The same way, we probably could have a PrometheusResultWriter that takes the overall check result and transforms it into a format that can be processed by Prometheus.

The user can then configure the library to use the PrometheusResultWriter rather than the default one by configuring it similar to this example:

handler := health.NewHandler(checker,
// A result writer writes a check result into an HTTP response.
// JSONResultWriter is used by default.
health.WithResultWriter(health.NewJSONResultWriter()),

@TomL-dev What do you think?

@akhy
Copy link

akhy commented Oct 29, 2022

Just taken a look at WithResultWriter. Turns out it doesn't seem appropriate for my use case, because it's directly produce HTTP responses. I want to have it easily mixed along with metrics from another library.

So I created a lib myself: https://github.com/chickenzord/go-health-prometheus

Please let me know what you think, thanks

@alexliesenfeld
Copy link
Owner

alexliesenfeld commented Oct 30, 2022

Looks nice! As of now it seems to only expose the aggregated/overall system status. I wonder if it would help if health.CheckState would actually contain all the component check results as well. AFAICS, this way you could expose components individually as well.

@akhy
Copy link

akhy commented Oct 31, 2022

@alexliesenfeld it's already showing per-component healths...

actually I haven't found a way to get the overall status 😅 for the aggregated/overall, instead of only up/down/unknown, I'd love to add one more degraded status if any non-critical component is down.. but I think that should be out of the scope of this lib because the real implementation can be different for each use cases 🤔

@tliefheid
Copy link
Author

@alexliesenfeld it's already showing per-component healths...

actually I haven't found a way to get the overall status 😅 for the aggregated/overall, instead of only up/down/unknown, I'd love to add one more degraded status if any non-critical component is down.. but I think that should be out of the scope of this lib because the real implementation can be different for each use cases 🤔

you can solve this by creating prom queries and generate alerts if the result (count or something) drops below a x percentage

@fr3fou
Copy link

fr3fou commented Feb 26, 2023

Any updates on this? :D One more thing I think would be useful – keeping track of how long each probe takes to respond

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants