Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the kubelet check error reporting in the output of agent status #6315

Merged
merged 3 commits into from
Sep 7, 2020

Conversation

L3n41c
Copy link
Member

@L3n41c L3n41c commented Sep 3, 2020

What does this PR do?

In combination with DataDog/integrations-core#7495 , this change improves the kubelet check error reported in the output of agent status when the agent cannot properly connect to the kubelet.

Motivation

In case the agent cannot properly connect to the kubelet, the useful details were in the logs but the output of the agent status command gave no clue about the reasons.

Here is an example of the agent status output in such a case:

$ agent status
[…]

=========
Collector
=========

  Running Checks
  ==============

    kubelet (4.1.1)
    ---------------
      Instance ID: kubelet:d884b5186b651429 [ERROR]
      Configuration Source: file:/etc/datadog-agent/conf.d/kubelet.d/conf.yaml.default
      Total Runs: 1
      Metric Samples: Last Run: 0, Total: 0
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 2ms
      Last Execution Date : 2020-09-03 11:40:31.000000 UTC
      Last Successful Execution Date : Never
      Error: Unable to detect the kubelet URL automatically.
      Traceback (most recent call last):
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/base.py", line 827, in run
          self.check(instance)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/kubelet/kubelet.py", line 297, in check
          raise CheckException("Unable to detect the kubelet URL automatically.")
      datadog_checks.base.errors.CheckException: Unable to detect the kubelet URL automatically.

Here is what the output becomes with this PR:

$ agent status
[…]

=========
Collector
=========

  Running Checks
  ==============

    kubelet (4.1.1)
    ---------------
      Instance ID: kubelet:d884b5186b651429 [ERROR]
      Configuration Source: file:/etc/datadog-agent/conf.d/kubelet.d/conf.yaml.default
      Total Runs: 1
      Metric Samples: Last Run: 0, Total: 0
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 3ms
      Last Execution Date : 2020-09-03 11:14:20.000000 UTC
      Last Successful Execution Date : Never
      Error: Unable to detect the kubelet URL automatically: cannot set a valid kubelet host: cannot connect to kubelet using any of the given hosts: [1.2.3.4] [], Errors: [Get https://1.2.3.4:10250/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) cannot connect: http: "Get http://1.2.3.4:10255/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"]
      Traceback (most recent call last):
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/base.py", line 827, in run
          self.check(instance)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/kubelet/kubelet.py", line 297, in check
          raise CheckException("Unable to detect the kubelet URL automatically: " + kubelet_conn_info.get('err'))
      datadog_checks.base.errors.CheckException: Unable to detect the kubelet URL automatically: cannot set a valid kubelet host: cannot connect to kubelet using any of the given hosts: [1.2.3.4] [], Errors: [Get https://1.2.3.4:10250/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) cannot connect: http: "Get http://1.2.3.4:10255/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"]

Additional Notes

This would help the investigation of issues like DataDog/integrations-core#2582.

Describe your test plan

Start the agent in a context where it schedules the kubelet check, but it cannot connect to it:

docker run --rm --name datadog-agent -e DD_API_KEY=$DD_API_KEY -e KUBERNETES=yes -e DD_KUBERNETES_KUBELET_HOST=1.2.3.4 datadog/agent:7.23.0

Copy link
Contributor

@xornivore xornivore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM just a small comment on using typecast.

log.Errorf("connection to kubelet failed: %v", err)
return nil
if e, ok := err.(*retry.Error); ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could probably use errors.As here? https://golang.org/pkg/errors/#example_As

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point !
0a3af81

Copy link
Contributor

@clamoriniere clamoriniere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the usage of errors.As() !!! 💯

Copy link
Member

@olivielpeau olivielpeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small question, other than that LGTM 👍

@@ -104,10 +105,10 @@ func (r *Retrier) doTry() *Error {
}
method := r.cfg.AttemptMethod
r.RUnlock()
err := method()
r.lastTryError = method()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it safe to write to r.lastTryError while the mutex is unlocked?

@L3n41c L3n41c merged commit 61ae476 into master Sep 7, 2020
@L3n41c L3n41c deleted the lenaic/kubelet_error_status branch September 7, 2020 10:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants