New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[healthcheckextensionv2] Introduce health check extension based on component status reporting #30673
base: main
Are you sure you want to change the base?
Conversation
05ee467
to
75fa8da
Compare
18b4a68
to
c24a335
Compare
8f4ed60
to
fcc2b7c
Compare
Thanks for all this work @mwear!
Can you elaborate on why that's needed? My understanding of the current healthcheck extension is that it doesn't actually work for checking the health of pipelines (see issue: #11780 and the readme warning: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/extension/healthcheckextension#health-check). Is the implementation in this PR not already suitable as a replacement in that case? I'm asking to better understanding if the PRs listed are necessary before moving this work forward. |
As you point out, there are currently some issues with the accuracy of the current health check extension. Beyond this, we now have component status reporting as a collector feature, which allows us to report status of individual components. The overall collector health and pipeline health can be derived by looking at the health of the underlying components. All components have a basic level of automated status reporting, but there are nuanced failure and recovery cases that need to be handled on a case by case basis. That is, some components will need to do some level of manual status reporting for their health to be observable. The two PRs I mentioned add manual status reporting to the core exporters based on error conditions they can encounter. They attempt to identify the RecoverableError conditions, when and if they have recovered, and the PermanentError conditions that will require human intervention to fix. Without this information, the health check extension will be able to tell you very little about the health of your exporters. You will be able to know if they are running, or not, at that's all. tl;dr: we need to add status reporting to components that we care about for purposes of collector health. Exporters are the highest priority, then receivers and processors |
.github/CODEOWNERS
Outdated
@@ -93,6 +93,7 @@ extension/encoding/textencodingextension/ @open-telemetry/collect | |||
extension/encoding/zipkinencodingextension/ @open-telemetry/collector-contrib-approvers @MovieStoreGuy @dao-jun | |||
extension/headerssetterextension/ @open-telemetry/collector-contrib-approvers @jpkrohling | |||
extension/healthcheckextension/ @open-telemetry/collector-contrib-approvers @jpkrohling | |||
extension/healthcheckextensionv2/ @open-telemetry/collector-contrib-approvers @mwear |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd much rather this replaces the existing extension, than to add a new extension :)
But i appreciate your comment about making it more difficult to review that way
Fair enough, though I would say at this stage, it's more than the current healthcheck extension's capabilities |
I agree. I suppose we can classify those PRs as nice to have in the short term and must have in the longer term. |
@mwear i've tried your healhcheck-v2 code to build my own otel collector (v0.92.0) build failing with following errors:
here is my build-config.yaml
i am using below builder version help.. |
@vynu just updated your comment to make the error messages more readable :) |
thanks @codeboten i've been trying that .. but for some reason i am seeing error while editing .. |
There were previously different types for CollectorStatus and PipelineStatus that have been unified with a newly introduced AggregateStatus type. It cleaned up a lot of special cases.
Co-authored-by: Alex Boten <alex@boten.ca>
Co-authored-by: Alex Boten <alex@boten.ca>
Co-authored-by: Juraci Paixão Kröhling <juraci.github@kroehling.de>
- Words of caution on the readme - Remove prefix on errors - Log details on discarded events
@mwear, thanks for the tremendous work so far, looking forward to the new chunks getting sorted out |
Thanks for your continued interest @gjshao44. I'll keep everyone up to date on the progress of this work. On that note, the skeleton PR merged and I've split out second PR for event aggregation: #32695. |
Description:
This PR introduces a candidate to replace the current health check extension. This extension is based on component status reporting and provides insight into collector health at the pipeline and component level. It includes an HTTP service and a gRPC service. The HTTP service has a status endpoint that provides the ability to probe the overall collector health and the health of individual pipelines. It also provides a config endpoint to obtain the config of the currently running collector.
The gRPC service is an implementation of the grpc_health_v1 service. While not as detailed as the HTTP service, it provides the ability to check the health of the collector overall and the health of individual pipelines via a unary or streaming RPC.
Looking at the README in this PR is a good place to start to understand the changes.
I have temporarily named this extension healthcheckv2. We can discuss the best way to integrate this work going forward, but my reasoning for using a different name is to facilitate the code review (the diff would be hard to follow if we were to overwrite the current extension). If we decide to move forward with this extension and replace the existing one, we can handle the deletion of the old extension and the renaming in a followup PR.
Ultimately this extension will need some form of one of the following before it will be a suitable replacement:
The more components that adopt status reporting, the more useful this extension will be.
Link to tracking Issue: #26661
Testing: Units, Manual
Documentation: README