Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[healthcheckextensionv2] Introduce health check extension based on component status reporting #30673

Open
wants to merge 26 commits into
base: main
Choose a base branch
from

Conversation

mwear
Copy link
Member

@mwear mwear commented Jan 19, 2024

Description:
This PR introduces a candidate to replace the current health check extension. This extension is based on component status reporting and provides insight into collector health at the pipeline and component level. It includes an HTTP service and a gRPC service. The HTTP service has a status endpoint that provides the ability to probe the overall collector health and the health of individual pipelines. It also provides a config endpoint to obtain the config of the currently running collector.

The gRPC service is an implementation of the grpc_health_v1 service. While not as detailed as the HTTP service, it provides the ability to check the health of the collector overall and the health of individual pipelines via a unary or streaming RPC.

Looking at the README in this PR is a good place to start to understand the changes.

I have temporarily named this extension healthcheckv2. We can discuss the best way to integrate this work going forward, but my reasoning for using a different name is to facilitate the code review (the diff would be hard to follow if we were to overwrite the current extension). If we decide to move forward with this extension and replace the existing one, we can handle the deletion of the old extension and the renaming in a followup PR.

Ultimately this extension will need some form of one of the following before it will be a suitable replacement:

The more components that adopt status reporting, the more useful this extension will be.

Link to tracking Issue: #26661

Testing: Units, Manual

Documentation: README

@github-actions github-actions bot added cmd/otelcontribcol otelcontribcol command extension/healthcheck Health Check Extension labels Jan 19, 2024
@mwear mwear force-pushed the healthcheck-v2 branch 4 times, most recently from 05ee467 to 75fa8da Compare January 19, 2024 04:29
@mwear mwear force-pushed the healthcheck-v2 branch 6 times, most recently from 18b4a68 to c24a335 Compare January 19, 2024 06:31
@mwear mwear marked this pull request as ready for review January 19, 2024 07:14
@mwear mwear requested a review from a team as a code owner January 19, 2024 07:14
@mwear mwear force-pushed the healthcheck-v2 branch 2 times, most recently from 8f4ed60 to fcc2b7c Compare January 19, 2024 15:51
@codeboten
Copy link
Contributor

Thanks for all this work @mwear!

Ultimately this extension will need some form of one of the following before it will be a suitable replacement:

Can you elaborate on why that's needed? My understanding of the current healthcheck extension is that it doesn't actually work for checking the health of pipelines (see issue: #11780 and the readme warning: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/extension/healthcheckextension#health-check). Is the implementation in this PR not already suitable as a replacement in that case?

I'm asking to better understanding if the PRs listed are necessary before moving this work forward.

@mwear
Copy link
Member Author

mwear commented Jan 19, 2024

As you point out, there are currently some issues with the accuracy of the current health check extension. Beyond this, we now have component status reporting as a collector feature, which allows us to report status of individual components. The overall collector health and pipeline health can be derived by looking at the health of the underlying components. All components have a basic level of automated status reporting, but there are nuanced failure and recovery cases that need to be handled on a case by case basis. That is, some components will need to do some level of manual status reporting for their health to be observable. The two PRs I mentioned add manual status reporting to the core exporters based on error conditions they can encounter. They attempt to identify the RecoverableError conditions, when and if they have recovered, and the PermanentError conditions that will require human intervention to fix. Without this information, the health check extension will be able to tell you very little about the health of your exporters. You will be able to know if they are running, or not, at that's all. tl;dr: we need to add status reporting to components that we care about for purposes of collector health. Exporters are the highest priority, then receivers and processors

@@ -93,6 +93,7 @@ extension/encoding/textencodingextension/ @open-telemetry/collect
extension/encoding/zipkinencodingextension/ @open-telemetry/collector-contrib-approvers @MovieStoreGuy @dao-jun
extension/headerssetterextension/ @open-telemetry/collector-contrib-approvers @jpkrohling
extension/healthcheckextension/ @open-telemetry/collector-contrib-approvers @jpkrohling
extension/healthcheckextensionv2/ @open-telemetry/collector-contrib-approvers @mwear
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd much rather this replaces the existing extension, than to add a new extension :)

But i appreciate your comment about making it more difficult to review that way

@codeboten
Copy link
Contributor

You will be able to know if they are running, or not, at that's all.

Fair enough, though I would say at this stage, it's more than the current healthcheck extension's capabilities

@mwear
Copy link
Member Author

mwear commented Jan 19, 2024

You will be able to know if they are running, or not, at that's all.

Fair enough, though I would say at this stage, it's more than the current healthcheck extension's capabilities

I agree. I suppose we can classify those PRs as nice to have in the short term and must have in the longer term.

@vynu
Copy link

vynu commented Jan 24, 2024

@mwear i've tried your healhcheck-v2 code to build my own otel collector (v0.92.0) build failing with following errors:

280.3 Error: failed to compile the OpenTelemetry Collector distribution: go subcommand failed with args '[build -trimpath -o otelcol -ldflags=-s -w]': exit status 1. Output:
280.3 # go.opentelemetry.io/collector/exporter/exporterhelper
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.92.0/exporterhelper/obsexporter.go:65:55: undefined: obsmetrics.TagKeyExporter
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.92.0/exporterhelper/obsexporter.go:231:28: undefined: obsmetrics.ExporterSentSpans
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.92.0/exporterhelper/obsexporter.go:232:30: undefined: obsmetrics.ExporterFailedToSendSpans
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.92.0/exporterhelper/obsexporter.go:234:28: undefined: obsmetrics.ExporterSentMetricPoints
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.92.0/exporterhelper/obsexporter.go:235:30: undefined: obsmetrics.ExporterFailedToSendMetricPoints
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.92.0/exporterhelper/obsexporter.go:237:28: undefined: obsmetrics.ExporterSentLogRecords
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.92.0/exporterhelper/obsexporter.go:238:30: undefined: obsmetrics.ExporterFailedToSendLogRecords
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.92.0/exporterhelper/obsexporter.go:289:30: undefined: obsmetrics.ExporterFailedToEnqueueSpans
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.92.0/exporterhelper/obsexporter.go:291:30: undefined: obsmetrics.ExporterFailedToEnqueueMetricPoints
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.92.0/exporterhelper/obsexporter.go:293:30: undefined: obsmetrics.ExporterFailedToEnqueueLogRecords
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.92.0/exporterhelper/obsexporter.go:293:30: too many errors
280.3 # go.opentelemetry.io/collector/processor/processorhelper
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/processor@v0.92.0/processorhelper/obsreport.go:77:58: undefined: obsmetrics.TagKeyProcessor
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/processor@v0.92.0/processorhelper/obsreport.go:192:32: undefined: obsmetrics.ProcessorAcceptedSpans
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/processor@v0.92.0/processorhelper/obsreport.go:193:31: undefined: obsmetrics.ProcessorRefusedSpans
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/processor@v0.92.0/processorhelper/obsreport.go:194:31: undefined: obsmetrics.ProcessorDroppedSpans
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/processor@v0.92.0/processorhelper/obsreport.go:196:32: undefined: obsmetrics.ProcessorAcceptedMetricPoints
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/processor@v0.92.0/processorhelper/obsreport.go:197:31: undefined: obsmetrics.ProcessorRefusedMetricPoints
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/processor@v0.92.0/processorhelper/obsreport.go:198:31: undefined: obsmetrics.ProcessorDroppedMetricPoints
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/processor@v0.92.0/processorhelper/obsreport.go:200:32: undefined: obsmetrics.ProcessorAcceptedLogRecords
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/processor@v0.92.0/processorhelper/obsreport.go:201:31: undefined: obsmetrics.ProcessorRefusedLogRecords
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/processor@v0.92.0/processorhelper/obsreport.go:202:31: undefined: obsmetrics.ProcessorDroppedLogRecords
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/processor@v0.92.0/processorhelper/obsreport.go:202:31: too many errors
280.3 # go.opentelemetry.io/collector/receiver/receiverhelper
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/receiver@v0.92.0/receiverhelper/obsreport.go:76:26: undefined: obsmetrics.TagKeyReceiver
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/receiver@v0.92.0/receiverhelper/obsreport.go:77:26: undefined: obsmetrics.TagKeyTransport
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/receiver@v0.92.0/receiverhelper/obsreport.go:306:32: undefined: obsmetrics.ReceiverAcceptedSpans
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/receiver@v0.92.0/receiverhelper/obsreport.go:307:31: undefined: obsmetrics.ReceiverRefusedSpans
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/receiver@v0.92.0/receiverhelper/obsreport.go:309:32: undefined: obsmetrics.ReceiverAcceptedMetricPoints
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/receiver@v0.92.0/receiverhelper/obsreport.go:310:31: undefined: obsmetrics.ReceiverRefusedMetricPoints
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/receiver@v0.92.0/receiverhelper/obsreport.go:312:32: undefined: obsmetrics.ReceiverAcceptedLogRecords
280.3 /app/go/pkg/mod/go.opentelemetry.io/collector/receiver@v0.92.0/receiverhelper/obsreport.go:313:31: undefined: obsmetrics.ReceiverRefusedLogRecords
280.3 
------
failed to solve: process "/bin/sh -c builder --config builder-config.yaml" did not complete successfully: exit code: 1

here is my build-config.yaml

dist:
  name: otelcol
  description: Custom OTel Collector distribution
  output_path: ./otelcol-dev
  otelcol_version: 0.92.0

extensions:
  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/extension/healthcheckextensionv2 v0.92.0

exporters:
  - gomod: go.opentelemetry.io/collector/exporter/otlphttpexporter v0.92.0

  - gomod: go.opentelemetry.io/collector/exporter/otlpexporter v0.92.0

  - gomod: go.opentelemetry.io/collector/exporter/debugexporter v0.92.0

  - gomod: go.opentelemetry.io/collector/exporter/loggingexporter v0.92.0

  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/exporter/fileexporter v0.92.0
  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/exporter/splunkhecexporter v0.92.0

processors:
  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/processor/attributesprocessor v0.92.0
  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/processor/resourcedetectionprocessor v0.92.0
  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/processor/filterprocessor v0.92.0
  - gomod: go.opentelemetry.io/collector/processor/batchprocessor v0.92.0

receivers:
  - gomod: go.opentelemetry.io/collector/receiver/otlpreceiver v0.92.0
  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/receiver/filelogreceiver v0.92.0
  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver v0.92.0

replaces:
  - github.com/open-telemetry/opentelemetry-collector-contrib/extension/healthcheckextensionv2 => ../healthcheckextensionv2

i am using below builder version
go.opentelemetry.io/collector/cmd/builder@v0.92.0

help..

@codeboten
Copy link
Contributor

@vynu just updated your comment to make the error messages more readable :)

@vynu
Copy link

vynu commented Jan 24, 2024

thanks @codeboten i've been trying that .. but for some reason i am seeing error while editing ..

mwear and others added 24 commits April 24, 2024 15:16
There were previously different types for CollectorStatus and PipelineStatus
that have been unified with a newly introduced AggregateStatus type. It cleaned
up a lot of special cases.
Co-authored-by: Alex Boten <alex@boten.ca>
Co-authored-by: Alex Boten <alex@boten.ca>
Support config and responses from current healthcheck extension.
Co-authored-by: Juraci Paixão Kröhling <juraci.github@kroehling.de>
- Words of caution on the readme
- Remove prefix on errors
- Log details on discarded events
@gjshao44
Copy link

@mwear, thanks for the tremendous work so far, looking forward to the new chunks getting sorted out

@mwear
Copy link
Member Author

mwear commented Apr 25, 2024

Thanks for your continued interest @gjshao44. I'll keep everyone up to date on the progress of this work. On that note, the skeleton PR merged and I've split out second PR for event aggregation: #32695.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet