Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to detect the kubelet URL automatically / cannot validate certificate #2582

Closed
jcassee opened this issue Nov 13, 2018 · 73 comments
Closed

Comments

@jcassee
Copy link
Contributor

jcassee commented Nov 13, 2018

Output of the info page

Getting the status from the agent.

==============
Agent (v6.6.0)
==============

  Status date: 2018-11-13 23:10:34.603102 UTC
  Pid: 342
  Python Version: 2.7.15
  Logs:
  Check Runners: 4
  Log Level: info

  Paths
  =====
    Config File: /etc/datadog-agent/datadog.yaml
    conf.d: /etc/datadog-agent/conf.d
    checks.d: /etc/datadog-agent/checks.d

  Clocks
  ======
    NTP offset: 1.461ms
    System UTC time: 2018-11-13 23:10:34.603102 UTC

  Host Info
  =========
    bootTime: 2018-11-08 08:50:28.000000 UTC
    kernelVersion: 4.9.0-7-amd64
    os: linux
    platform: debian
    platformFamily: debian
    platformVersion: buster/sid
    procs: 70
    uptime: 133h51m42s
    virtualizationRole: host
    virtualizationSystem: kvm

  Hostnames
  =========
    hostname: reverent-kapitsa-1us
    socket-fqdn: datadog-agent-pxkhm
    socket-hostname: datadog-agent-pxkhm
    hostname provider: container
    unused hostname providers:
      aws: not retrieving hostname from AWS: the host is not an ECS instance, and other providers already retrieve non-default hostnames
      configuration/environment: hostname is empty
      gce: unable to retrieve hostname from GCE: status code 404 trying to GET http://169.254.169.254/computeMetadata/v1/instance/hostname

=========
Collector
=========

  Running Checks
  ==============

    cpu
    ---
        Instance ID: cpu [OK]
        Total Runs: 114
        Metric Samples: 6, Total: 678
        Events: 0, Total: 0
        Service Checks: 0, Total: 0
        Average Execution Time : 0s


    disk (1.4.0)
    ------------
        Instance ID: disk:e5dffb8bef24336f [OK]
        Total Runs: 114
        Metric Samples: 190, Total: 21,660
        Events: 0, Total: 0
        Service Checks: 0, Total: 0
        Average Execution Time : 197ms


    docker
    ------
        Instance ID: docker [OK]
        Total Runs: 113
        Metric Samples: 216, Total: 23,850
        Events: 0, Total: 6
        Service Checks: 1, Total: 113
        Average Execution Time : 203ms


    file_handle
    -----------
        Instance ID: file_handle [OK]
        Total Runs: 114
        Metric Samples: 5, Total: 570
        Events: 0, Total: 0
        Service Checks: 0, Total: 0
        Average Execution Time : 0s


    io
    --
        Instance ID: io [OK]
        Total Runs: 113
        Metric Samples: 39, Total: 4,380
        Events: 0, Total: 0
        Service Checks: 0, Total: 0
        Average Execution Time : 0s


    kubelet (2.2.0)
    ---------------
        Instance ID: kubelet:d884b5186b651429 [ERROR]
        Total Runs: 114
        Metric Samples: 0, Total: 0
        Events: 0, Total: 0
        Service Checks: 0, Total: 0
        Average Execution Time : 8ms
        Error: Unable to detect the kubelet URL automatically.
        Traceback (most recent call last):
        File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/base/checks/base.py", line 366, in run
          self.check(copy.deepcopy(self.instances[0]))
        File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/kubelet/kubelet.py", line 113, in check
          raise CheckException("Unable to detect the kubelet URL automatically.")
      CheckException: Unable to detect the kubelet URL automatically.

    kubernetes_apiserver
    --------------------
        Instance ID: kubernetes_apiserver [OK]
        Total Runs: 113
        Metric Samples: 0, Total: 0
        Events: 0, Total: 0
        Service Checks: 0, Total: 0
        Average Execution Time : 11ms


    load
    ----
        Instance ID: load [OK]
        Total Runs: 114
        Metric Samples: 6, Total: 684
        Events: 0, Total: 0
        Service Checks: 0, Total: 0
        Average Execution Time : 2ms


    memory
    ------
        Instance ID: memory [OK]
        Total Runs: 113
        Metric Samples: 17, Total: 1,921
        Events: 0, Total: 0
        Service Checks: 0, Total: 0
        Average Execution Time : 0s


    network (1.7.0)
    ---------------
        Instance ID: network:2a218184ebe03606 [OK]
        Total Runs: 114
        Metric Samples: 74, Total: 8,754
        Events: 0, Total: 0
        Service Checks: 0, Total: 0
        Average Execution Time : 9ms


    ntp
    ---
        Instance ID: ntp:b4579e02d1981c12 [OK]
        Total Runs: 113
        Metric Samples: 1, Total: 113
        Events: 0, Total: 0
        Service Checks: 1, Total: 113
        Average Execution Time : 2ms


    uptime
    ------
        Instance ID: uptime [OK]
        Total Runs: 114
        Metric Samples: 1, Total: 114
        Events: 0, Total: 0
        Service Checks: 0, Total: 0
        Average Execution Time : 2ms

========
JMXFetch
========

  Initialized checks
  ==================
    no checks

  Failed checks
  =============
    no checks

=========
Forwarder
=========

  CheckRunsV1: 113
  Dropped: 0
  DroppedOnInput: 0
  Events: 0
  HostMetadata: 0
  IntakeV1: 11
  Metadata: 0
  Requeued: 0
  Retried: 0
  RetryQueueSize: 0
  Series: 0
  ServiceChecks: 0
  SketchSeries: 0
  Success: 237
  TimeseriesV1: 113

  API Keys status
  ===============
    API key ending with 1ed66 on endpoint https://app.datadoghq.com: API Key valid

==========
Logs Agent
==========

  container_collect_all
  ---------------------
    Type: docker
    Status: Pending

=========
DogStatsD
=========

  Checks Metric Sample: 65,227
  Event: 7
  Events Flushed: 7
  Number Of Flushes: 113
  Series Flushed: 53,494
  Service Check: 1,478
  Service Checks Flushed: 1,578
  Dogstatsd Metric Sample: 11,877

Additional environment details (Operating System, Cloud provider, etc):

Kubernetes 1.12 cluster on DigitalOcean.

Steps to reproduce the issue:

  1. Deploy the Datadog agent using the provider Kubernetes resources.
  2. View logs

Describe the results you received:

[ AGENT ] 2018-11-13 22:42:31 UTC | ERROR | (kubeutil.go:50 in GetKubeletConnectionInfo) | connection to kubelet failed: temporary failure in kubeutil, will retry later: try delay not elapsed yet
[ AGENT ] 2018-11-13 22:42:31 UTC | ERROR | (runner.go:289 in work) | Error running check kubelet: [{"message": "Unable to detect the kubelet URL automatically.", "traceback": "Traceback (most recent call last):\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/base/checks/base.py\", line 366, in run\n    self.check(copy.deepcopy(self.instances[0]))\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/kubelet/kubelet.py\", line 113, in check\n    raise CheckException(\"Unable to detect the kubelet URL automatically.\")\nCheckException: Unable to detect the kubelet URL automatically.\n"}]
[...]
[ AGENT ] 2018-11-13 22:42:39 UTC | ERROR | (autoconfig.go:608 in collect) | Unable to collect configurations from provider Kubernetes: temporary failure in kubeutil, will retry later: cannot connect: https: "Get https://10.133.78.180:10250/pods: x509: cannot validate certificate for 10.133.78.180 because it doesn't contain any IP SANs", http: "Get http://10.133.78.180:10255/pods: dial tcp 10.133.78.180:10255: connect: connection refused"
[ AGENT ] 2018-11-13 22:42:39 UTC | INFO | (autoconfig.go:362 in initListenerCandidates) | kubelet listener cannot start, will retry: temporary failure in kubeutil, will retry later: cannot connect: https: "Get https://10.133.78.180:10250/pods: x509: cannot validate certificate for 10.133.78.180 because it doesn't contain any IP SANs", http: "Get http://10.133.78.180:10255/pods: dial tcp 10.133.78.180:10255: connect: connection refused"

Many dashboard entries remain empty.

Describe the results you expected:

No errors, access to kubelet, functional Kubernetes dashboard.

Additional information you deem important (e.g. issue happens only occasionally):

Seems to be the same problem as #1829, however that issue is closed. Hosted Kubernetes services like DigitalOcean do not allow editing the kubelet configuration as far as I know.

@stale
Copy link

stale bot commented Dec 14, 2018

This issue has been automatically marked as stale because it has not had activity in the last 30 days. Note that the issue will not be automatically closed, but this notification will remind us to investigate why there's been inactivity. Thank you for participating in the Datadog open source community.

@stale stale bot added the stale label Dec 14, 2018
@mjhuber
Copy link

mjhuber commented Dec 24, 2018

I'm seeing this issue in kubernetes v1.12 on digital ocean as well

@stale stale bot removed the stale label Dec 24, 2018
@jcassee
Copy link
Contributor Author

jcassee commented Dec 25, 2018

@mjhuber I opened a ticket on the Datadog issue tracker. Advice was to set DD_KUBELET_TLS_VERIFY=false for now. Hopefully DO will start using real certificates for the Kubelet API.

@PHameete
Copy link
Contributor

PHameete commented Jan 11, 2019

Im running into this same issue using AWS EKS, using the EKS Optimized AMI Image for a worker node.

Using DD_KUBELET_TLS_VERIFY=false is not a solution for us: the problem seems to be that the read only port on kubelet is deprecated on recent versions of Kubernetes (I'm on 1.11)

I suppose the Datadog agent should get stats from kubelet using a different method.

@praseodym
Copy link

praseodym commented Jan 15, 2019

The recommendation for newer Kubernetes versions is to use kube-state-metrics for cluster-level metrics and use the metrics API (powered by e.g. metrics-server) for node-level and pod-level metrics.

@PHameete
Copy link
Contributor

@praseodym oh hi Mark ;-) Where did you find this recommendation? Because both the integrations page in Datadog, and the documentation pointed me towards a 'standard' kubernetes deployment that uses the kubelet readonly port.

Someone then pointed me to the helm chart for deploying datadog which uses the method you suggest and that works for me.

@aerostitch
Copy link
Contributor

@PHameete did you solve your issue using DD_KUBERNETES_HTTPS_KUBELET_PORT or weren't you able to find a solution?

@aerostitch
Copy link
Contributor

Sorry seems I had an old version of the page open.

@praseodym
Copy link

praseodym commented Jan 15, 2019

@PHameete Sorry for the confusion here: I meant that the Datadog agent itself should be updated to use kube-state-metrics and the metrics API, which should prevent it from needing access to the kubelets directly. This is more of an improvement than an actual bug, though.

Regarding your issue with EKS, you should still be able to connect to the TLS port (10250) if RBAC is configured correctly so that the agent can authenticate to kubelet. We’re running without the read-only port on Kubernetes v1.13.2 and disabling TLS verification in the agent was all we had to do.

Edit: I only now noticed that the Helm chart you linked does mention the agent using kube-state-metrics, so I guess that part is already implemented :)

@VinayVanama
Copy link

Even I ran into same problem ! Problem is with the EKS AMI(worker node) with the new AMI datadog and even some cpu and memory related metrics are also not working properly. I have used ami-0a0b913ef3249b655 and its working fine.

@bendrucker
Copy link

bendrucker commented Feb 7, 2019

Just worked through this and wanted to share what it I understand it would take to not set DD_KUBELET_TLS_VERIFY=false.

We use typhoon which runs the kubelet via systemd. It disables the readonly port and passes --authentication-token-webhook --authorization-mode=Webhook to enable bearer token auth with the kubelet API. We install the datadog agent via Helm and found that disabling TLS verification was all we needed to do in order to collect metrics without the read only port.

I hopped onto a worker and tried curl https://localhost:10250 --cacert /etc/kubernetes/ca.crt, thinking the kubelet api's certs were signed with the same CA used to sign the apiserver's certs. Turns out that's not the case.

https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet-tls-bootstrapping/#client-and-serving-certificates

By default, the kubelet is creating a self-signed key/cert for its server on start. If you specified --tls-{private-key,cert}-file and provide the CA cert used to sign them to the client (i.e. curl or the datadog agent), it should work.

Here's some discussion about addressing this issue in kubeadm:

kubernetes/kubeadm#1223

Given that the datadog role is effectively read-only, we felt the risks of unverified TLS were acceptable until we have an opportunity to look at ways to sign kubelet API certs with a known CA or have the kubelet write its CA cert out to disk.

@sridhar81
Copy link

sridhar81 commented Feb 9, 2019

We are seeing this error even after setting DD_KUBELET_TLS_VERIFY=false. Any help is much appreciated

I have tried running all versions from 6.6.0 to 6.9.0

@VinayVanama
Copy link

We are seeing this error even after setting DD_KUBELET_TLS_VERIFY=false. Any help is much appreciated

I have tried running all versions from 6.6.0 to 6.9.0

@sridhar81 have you tried my solution ?

@sridhar81
Copy link

We are seeing this error even after setting DD_KUBELET_TLS_VERIFY=false. Any help is much appreciated
I have tried running all versions from 6.6.0 to 6.9.0

@sridhar81 have you tried my solution ?

@VinayVanama Thanks for the pointer. We are not using EKS. We are running our own cluster. Changing the AMI is going to be hard.

@Simwar
Copy link
Contributor

Simwar commented Mar 8, 2019

Hi everyone,

There seems to be several problems here.

For @jcassee:
As you mentioned, we previously suggested setting kubelet_tls_verify to false. We understand it’s not a great solution, security wise.
As we can see from the logs, it seems that the issue is from the certificate.
If we take a deeper look, we can see on this log:
[ AGENT ] 2018-11-13 22:42:39 UTC | INFO | (autoconfig.go:362 in initListenerCandidates) | kubelet listener cannot start, will retry: temporary failure in kubeutil, will retry later: cannot connect: https: "Get https://10.133.78.180:10250/pods: x509: cannot validate certificate for 10.133.78.180 because it doesn't contain any IP SANs", http: "Get http://10.133.78.180:10255/pods: dial tcp 10.133.78.180:10255: connect: connection refused"

The certificate cannot be validated because there is no SAN for the IP address of the node.
The certificate most likely uses the hostname of the node, as its common Name.
What you could do is deploy our agent to use the node name instead of the IP to connect to the kubelet, by modifying the daemonset.
Replace:

- name: DD_KUBERNETES_KUBELET_HOST
           valueFrom:
             fieldRef:
               fieldPath: status.hostIP

By:

- name: DD_KUBERNETES_KUBELET_HOST
            valueFrom:
              fieldRef:
                fieldPath: spec.nodeName

We are also in touch with DigitalOcean to suggest adding the node IP as a SAN in the certificate.

For @mjhuber
Could you try the same work-around? Thanks!

For @PHameete and @praseodym
We do try querying the Kubelet on the (read-only) port 10255 to get Kubernetes metrics: https://docs.datadoghq.com/agent/kubernetes/metrics/#kubelet
But only after trying the HTTPS port (10250) which should always be open. I suspect the message you’re seeing happens after the https call fails.

We also have the kubernetes_state integration which queries the KSM pod and get these metrics: https://docs.datadoghq.com/agent/kubernetes/metrics/#kube-state-metrics
Which are different.

Disabling the TLS verification should not be needed if the correct certificates are used.
If it doesn't work, we would be happy to investigate. Please reach out to our support team if needed: support@datadoghq.com

For @bendrucker
Indeed, if your kubelet configuration doesn't use this certificate /etc/kubernetes/ca.crt, our integration won't work out of the box.
However, if you have access to the certificate and the key, you can mount them in the agent pod, and use these env vars to specify the new paths:
DD_KUBELET_CLIENT_CA : to specify the path of the ca.crt
DD_KUBELET_CLIENT_CRT : to specify the path of the crt
DD_KUBELET_CLIENT_KEY : to specify the path of the key

Please reach out to our support team if you need further details: support@datadoghq.com

For @sridhar81
If DD_KUBELET_TLS_VERIFY=false doesn't work. It might not be a certificate issue.
Did you use the RBACs provided in our documentation?
https://docs.datadoghq.com/agent/kubernetes/daemonset_setup/#configure-rbac-permissions

Please reach out to our support team if more troubleshooting is needed: support@datadoghq.com

@jcassee
Copy link
Contributor Author

jcassee commented Mar 8, 2019

@Simwar I determined that the certificate does not, unfortunately, has the plain hostname as the common name:

root@datadog-agent-4tfsc:/# gnutls-cli -p 10250 $DD_KUBERNETES_KUBELET_HOST
Processed 0 CA certificate(s).
Resolving '10.133.96.228:10250'...
Connecting to '10.133.96.228:10250'...
- Certificate type: X.509
- Got a certificate list of 2 certificates.
- Certificate[0] info:
 - subject `CN=priceless-easley-c7ti@1551951286', issuer `CN=priceless-easley-c7ti-ca@1551951286', serial 0x02, RSA key 2048 bits, signed using RSA-SHA256, activated `2019-03-07 08:34:46 UTC', expires `2020-03-06 08:34:46 UTC', pin-sha256="WfR4bITrk3Hh7r4ogdDSmzjFIbMDbf2+jsfg1Xb2QGo="
        Public Key ID:
                sha1:ab0f29c4eb8269623802dbb9e473bfc89ff73a72
                sha256:59f4786c84eb9371e1eebe2881d0d29b38c521b3036dfdbe8ec7e0d576f6406a
        Public Key PIN:
                pin-sha256:WfR4bITrk3Hh7r4ogdDSmzjFIbMDbf2+jsfg1Xb2QGo=

- Certificate[1] info:
 - subject `CN=priceless-easley-c7ti-ca@1551951286', issuer `CN=priceless-easley-c7ti-ca@1551951286', serial 0x01, RSA key 2048 bits, signed using RSA-SHA256, activated `2019-03-07 08:34:46 UTC', expires `2020-03-06 08:34:46 UTC', pin-sha256="F3NkU0M9gT1FID/MzxTFK8eohaAZrMElr1Yz1Bj1Rr4="
- Status: The certificate is NOT trusted. The certificate issuer is unknown. The name in the certificate does not match the expected.
*** PKI verification of server certificate failed...
*** Fatal error: Error in the certificate.

Also, the node hostname cannot be resolved from within the pod:

root@datadog-agent-4tfsc:/# host priceless-easley-c7ti
Host priceless-easley-c7ti not found: 3(NXDOMAIN)

@chris-short
Copy link

Can confirm adding DD_KUBELET_TLS_VERIFY=false does indeed work.

For posterity: https://github.com/chris-short/wingedblade/blob/master/datadog-agent.yaml#L35

@bitva77
Copy link

bitva77 commented Mar 25, 2019

Confirming that with Kubernetes 1.13 installed via kubeadm DD_KUBELET_TLS_VERIFY=false fixed my issues as well :)

@stale
Copy link

stale bot commented Apr 24, 2019

This issue has been automatically marked as stale because it has not had activity in the last 30 days. Note that the issue will not be automatically closed, but this notification will remind us to investigate why there's been inactivity. Thank you for participating in the Datadog open source community.

@stale stale bot added the stale label Apr 24, 2019
@Haadka
Copy link

Haadka commented Apr 27, 2019

I am running on k8s v1.11.5 on digital ocean
it says:
cannot validate certificate for local-ipbecause it doesn't contain any IP SANs
http: "Get http://<local-ip>:10255/pods: dial tcp 10.131.62.180:10255: connect: connection refused"

This confirms the findings of @jcassee
DD_KUBELET_TLS_VERIFY=false did not work for me
I am getting a different error:
Failed to establish a new connection: [Errno 113] No route to host

along with warning

Network "" not found, trying bridge IP instead

@stale stale bot removed the stale label Apr 27, 2019
@groodt
Copy link

groodt commented May 6, 2019

We're seeing the same thing on EKS running 1.12. Setting DD_KUBELET_TLS_VERIFY=false does not work.

Anybody got a workaround for this?

@Simwar
Copy link
Contributor

Simwar commented May 6, 2019

Hi @groodt
This is maybe because the agent is not authorized to reach the kubelet.
Did you use the RBACs provided in our documentation?
https://docs.datadoghq.com/agent/kubernetes/daemonset_setup/#configure-rbac-permissions
Make sure the correct service account name is used in the daemonset.

If it still doesn't work after redeploying the agent with the RBACs provided and the correct service account, feel free to reach out to our support team: support@datadoghq.com

@stale
Copy link

stale bot commented Jun 5, 2019

This issue has been automatically marked as stale because it has not had activity in the last 30 days. Note that the issue will not be automatically closed, but this notification will remind us to investigate why there's been inactivity. Thank you for participating in the Datadog open source community.

@stale stale bot added the stale label Jun 5, 2019
@apeschel
Copy link

@jonhoare I tried to confirm what you wrote here:

It appears that AKS has changed the location of the Kubelet Client CA Cert, at least between ASK 1.16.7 and 1.16.9.

However, I compared a 1.14 cluster AKS cluster to a 1.16 cluster, and the path to the cert appeared identical on both. Both the 1.14 cluster and the 1.16 cluster had a certificate in /etc/kubernetes/certs/kubeletserver.crt, and I could find no evidence of a certificate anywhere else on the 1.14 node.

Setting DD_KUBELET_TLS_VERIFY=false fixes the issue on 1.16, which suggests that the problem is somehow related to TLS, but the location of the certificate doesn't seem to be the problem.

@PSanetra
Copy link

@apeschel according to the config_template.yaml /var/run/secrets/kubernetes.io/serviceaccount/ca.crt is the default value of kubelet_client_ca. Can you verify that /etc/kubernetes/certs/kubeletserver.crt was signed by that CA in 1.14 while it is self-signed in 1.16?

@apeschel
Copy link

apeschel commented Aug 14, 2020

@PSanetra It appears as though /etc/kubernetes/certs/kubeletserver.crt is self signed on both 1.14 and 1.16

(Note: in these examples, I mounted /etc/kubernetes on the base host to /opt/etc/kubernetes in the Datadog agent container)

1.16:

# openssl x509 -in /opt/etc/kubernetes/certs/kubeletserver.crt -noout -subject_hash -issuer_hash
85362e12
85362e12
# openssl x509 -in /var/run/secrets/kubernetes.io/serviceaccount/ca.crt -noout -subject_hash -issuer_hash
56c899cd
56c899cd

1.14:

# openssl x509 -in /opt/etc/kubernetes/certs/kubeletserver.crt -noout -subject_hash -issuer_hash
c7c32b8a
c7c32b8a
# openssl x509 -in /var/run/secrets/kubernetes.io/serviceaccount/ca.crt -noout -subject_hash -issuer_hash
56c899cd
56c899cd

@apeschel
Copy link

apeschel commented Aug 14, 2020

Further, the proposed cause for this issue was that the file /etc/kubernetes/ca.crt was moved on the base node, but I can find no evidence of this file existing prior to 1.16, which makes me seriously doubt this is actually the cause.

(Note: /etc/kubernetes from the node is mounted to /opt/etc/kubernetes in these examples)

1.14:

# ls /opt/etc/kubernetes/ca.crt
ls: cannot access '/opt/etc/kubernetes/ca.crt': No such file or directory

Here's a comparison of /etc/kubernetes on 1.14 and 1.16. The contents appear to be identical to me:

1.16:

# find /opt/etc/kubernetes/ | sort
/opt/etc/kubernetes/
/opt/etc/kubernetes/azure.json
/opt/etc/kubernetes/certs
/opt/etc/kubernetes/certs/apiserver.crt
/opt/etc/kubernetes/certs/ca.crt
/opt/etc/kubernetes/certs/client.crt
/opt/etc/kubernetes/certs/client.key
/opt/etc/kubernetes/certs/kubeletserver.crt
/opt/etc/kubernetes/certs/kubeletserver.key
/opt/etc/kubernetes/manifests
/opt/etc/kubernetes/volumeplugins

1.14:

# find /opt/etc/kubernetes/ | sort
/opt/etc/kubernetes/
/opt/etc/kubernetes/azure.json
/opt/etc/kubernetes/certs
/opt/etc/kubernetes/certs/apiserver.crt
/opt/etc/kubernetes/certs/ca.crt
/opt/etc/kubernetes/certs/client.crt
/opt/etc/kubernetes/certs/client.key
/opt/etc/kubernetes/certs/kubeletserver.crt
/opt/etc/kubernetes/certs/kubeletserver.key
/opt/etc/kubernetes/manifests
/opt/etc/kubernetes/volumeplugins

@apeschel
Copy link

apeschel commented Aug 14, 2020

I dug into the cause that @mopalinski suggested, and was able to verify that is the actual cause for the breakage on AKS. All this discussion about moving CA files and self signed certificates is incorrect and misleading.

The truth is that Datadog has never worked correctly on AKS, and has been silently relying on the unsecured Kubelet fallback port this whole time. The removal of this unsecure port has only revealed the truth that Datadog has been broken this whole time.

It appears this unsecured fallback port was removed at some point in the AKS 1.16 line, which is what ultimately revealed the problem with the Datadog agent. It's trivially easy to verify this is the actual cause:

1.14:

# nc -z -v -w 1 "$DD_KUBERNETES_KUBELET_HOST" 10255
Connection to 10.240.0.9 10255 port [tcp/*] succeeded!

1.16:

# nc -z -v -w 1 "$DD_KUBERNETES_KUBELET_HOST" 10255
nc: connect to 10.240.0.5 port 10255 (tcp) failed: Connection refused

Datadog should hopefully prioritize a fix for this problem on their end, since it actually affects all versions of AKS.

Until that time, it seems like the simplest workaround is to set DD_KUBELET_TLS_VERIFY=false, or to use the self-signed certificate as its own trusted CA.

@TheHodge1234
Copy link

Some more info for you on all on this from MS direct -

Even with AKS version 1.16.x, the kubelet is accessible over HTTP on port 10255 if the cluster is upgraded from a previous version.
If you launch a new NodePool in this version, kubelet is not accessible over port 10255. This is mentioned on the following Github Issues page:

The plan to discontinue this has been rolled out, and is planned to be introduced in the upcoming versions : 1.18.x

I did a repro in my lab environment and found that the new version of AKS, does not allow access to Kubelet on plain HTTP, and that the port 10255 is discontinued.

I launched a cluster with version 1.17.5:
PS C:\Users\rissing> k get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP
aks-nodepool1-20955998-vmss000000 Ready agent 17m v1.17.5 10.240.0.4

Then I tried to access the plain HTTP port 10255 for kubelet:
root@aks-ssh:/# curl http://10.240.0.4:10255/pods
curl: (7) Failed to connect to 10.240.0.4 port 10255: Connection refused


I can confirm on my nodepools upgraded to 1.16.x the kubelet checks do work if I have the DD_KUBELET_TLS_VERIFY=false set. However on brand new node pools I can't get any access to the kubelet with via datadog.

@ams0
Copy link

ams0 commented Aug 21, 2020

If someone is looking how to deploy on AKS with DD_KUBELET_TLS_VERIFY using helm, here's and handy command line:

 helm upgrade --install dd datadog/datadog  --set datadog.apiKey=<apikey> \
--set agents.containers.agent.env[0].name=DD_KUBELET_TLS_VERIFY  \
--set-string agents.containers.agent.env[0].value="false"

I can see my Kubernetes metrics in DD now!

@onpaws
Copy link

onpaws commented Aug 21, 2020

Those of you deploying DD with Helm aka
helm upgrade datadog -f values.yaml datadog/datadog

Here's something you can copy into your values.yaml, sample

  containers:
    agent:
      ## @param env - list - required
      ## Additional environment variables for the agent container.
      #
      env:
      - name: DD_KUBELET_TLS_VERIFY
        value: "false"

@SychevIgor
Copy link

As a work around - disable tls - is ok, but not sure that it's recommended production ready recommendation. otherwise -we this tls verification exist

@lognarly
Copy link

For AKS version 1.17.9 disabling TLS verify appears to work. The solution provided by @jonhoare appears to work for Linux nodes, but I am not positive it is the same for Windows node pools. I have attempted mounting C:\var\lib\kubelet\pki\kubelet.crt with DD_KUBELET_CLIENT_CA set and the error still appears. When I use the following config, the kubernetes_state* metrics come in for the windows node, but shown under a separate host tagged as host:-<cluster_name>. The kubernetes* metrics still do not come in, though.

  volumes: 
    - name: kubelet-certs
      hostPath:
        path: C:\var\lib\kubelet\pki
        type: ''
    - name: kubelet-ca
      hostPath:
        path: C:\k
        type: ''
  volumeMounts:
    - name: kubelet-certs
      readOnly: true
      mountPath: C:\kubelet_certs
    - name: kubelet-ca
      readOnly: true
      mountPath: C:\kubelet_ca
  env: 
    - name: DD_KUBELET_CLIENT_CRT
      value: C:\kubelet_certs\kubelet.crt
    - name: DD_KUBELET_CLIENT_KEY
      value: C:\kubelet_certs\kubelet.key
    - name: DD_KUBELET_CLIENT_CA
      value: C:\kubelet_ca\ca.crt

@josefschabasser
Copy link

josefschabasser commented Sep 17, 2020

Hi! I have 2 clusters here:

  • one upgraded from earlier release to 1.17.9
  • one created directly with 1.17.9

The solution from @jonhoare works for the upgraded one, but not for the newly created one.
Both initially showed the issue, and both were "fixed" by this solution. But now one always fails, while the other one is working great.
What's going on there?

Grml...

2020-09-17 11:08:28 UTC | CORE | ERROR | (pkg/autodiscovery/config_poller.go:123 in collect) | Unable to collect configurations from provider kubernetes: temporary failure in kubeutil, will retry later: cannot set a valid kubelet host: cannot connect to kubelet using any of the given hosts: [10.122.0.35] [aks-nodepool1-xxxxx-vmss000001], Errors: [Get https://10.122.0.35:10250/pods: x509: cannot validate certificate for 10.122.0.35 because it doesn't contain any IP SANs Get https://aks-nodepool1-xxxxx-vmss000001:10250/: dial tcp: lookup aks-nodepool1-xxxxx-vmss000001: no such host cannot connect: http: "Get http://10.122.0.35:10255/: dial tcp 10.122.0.35:10255: connect: connection refused" cannot connect: http: "Get http://aks-nodepool1-xxxxx-vmss000001:10255/: dial tcp: lookup aks-nodepool1-xxxxx-vmss000001: no such host"]

EDIT 2:
az aks rotate-certs fixed the issue. How come they don't contain host addresses?

@crielly
Copy link

crielly commented Oct 5, 2020

Just wanted to relate my experience using EKS 1.17/eks.3:

Experienced this issue deploying using the instructions here: https://docs.datadoghq.com/agent/cluster_agent/setup/?tab=secret

I basically did this:

  • Converted all manifests from yaml to hcl (I'm deploying using Terraform, yeah I know)
  • installed kube-state-metrics
  • spun wheels on this error for awhile

Eventually I noticed that the pods for both cluster-agent and node-agents weren't mounting anything at /var/run/secrets/kubernetes.io/serviceaccount - resulting in a failure to auth to kubelet. The unable to detect kubelet URL error was actually a symptom of this problem.

This for me turned out to be quirk of the terraform kubernetes provider - the fix was to specify automount_service_account_token = true for both the cluster-agent deployment and the node-agent daemonset. Agents could then successfully auth to kubelets to get metrics and the spice began to flow.

Note that I did not have to disable DD_KUBELET_TLS_VERIFY 🎉

@marciocamurati
Copy link

marciocamurati commented Oct 12, 2020

Just wanted to relate my experience using EKS 1.17/eks.3:

Experienced this issue deploying using the instructions here: https://docs.datadoghq.com/agent/cluster_agent/setup/?tab=secret

I basically did this:

  • Converted all manifests from yaml to hcl (I'm deploying using Terraform, yeah I know)
  • installed kube-state-metrics
  • spun wheels on this error for awhile

Eventually I noticed that the pods for both cluster-agent and node-agents weren't mounting anything at /var/run/secrets/kubernetes.io/serviceaccount - resulting in a failure to auth to kubelet. The unable to detect kubelet URL error was actually a symptom of this problem.

This for me turned out to be quirk of the terraform kubernetes provider - the fix was to specify automount_service_account_token = true for both the cluster-agent deployment and the node-agent daemonset. Agents could then successfully auth to kubelets to get metrics and the spice began to flow.

Note that I did not have to disable DD_KUBELET_TLS_VERIFY

@crielly Can you share the mount configuration that Terraform did when generate the deployment at your eks?

@crielly
Copy link

crielly commented Oct 14, 2020

@marciocamurati

@crielly Can you share the mount configuration that Terraform did when generate the deployment at your eks?

All I did was take the yaml manifest from the link above and curl manifest-url.yml | k2tf

Then add automount_service_account_token = true to spec.template.spec{} in the cluster-agent deployment resource and the node-agent daemonset

@discordianfish
Copy link

I'm the last person casually disabling TLS verification but in this case with the connection staying on localhost, it shouldn't be a big deal or am I missing something?

@stale
Copy link

stale bot commented Dec 25, 2020

This issue has been automatically marked as stale because it has not had activity in the last 30 days. Note that this will not be automatically closed, but the notification will remind us to investigate why there's been inactivity.

If you would like this issue to remain open:

  1. Verify that you can still reproduce the issue in the latest version of the integration.
  2. Comment that the issue is still reproducible and include updated details if possible.

Thank you for participating in the Datadog open source community!

@stale stale bot added the stale label Dec 25, 2020
@joelharkes
Copy link

@crielly does this setup work for helm3 as well?
automount_service_account_token=true

@stale stale bot removed the stale label Mar 24, 2021
@JS-Jake
Copy link

JS-Jake commented Apr 30, 2021

We're using a custom DNS server using private dns zones on our vnet and have run into similar issue
To fix it we:

  • applied the fix mentioned above
  • added a dnsConfig mapping to add a search suffix of our private dns zone
     dnsConfig:
         searches:
                 - foo.bar.com
  • Finally, updated DD_KUBERNETES_KUBELET_HOST to fieldPath: spec.nodeName

@apeschel
Copy link

apeschel commented Jul 20, 2021

This is still an issue, and the root problem is still the same. The method used for TLS verification by the Datadog image is still completely broken, and the most viable workaround at the moment is to just disable TLS verification.

For those using the Datadog Helm chart, you can fix it by setting

datadog:
  kubelet:
    tlsVerify: false

@jmturwy
Copy link

jmturwy commented Aug 27, 2021

  • name: DD_KUBERNETES_KUBELET_HOST
    valueFrom:
    fieldRef:
    fieldPath: spec.nodeName

This is the case for my AKS clusters as well:

Changing to
- name: DD_KUBERNETES_KUBELET_HOST valueFrom: fieldRef: fieldPath: spec.nodeName

Resolved the issue

@nemethloci
Copy link

nemethloci commented Oct 13, 2021

Dear datadog team: would it be possible to implement: #2582 (comment)

I've tested with EKS 1.21 and with Datadog 2.22.15 and it solved my issue. The solution could be something like this for the nodeagent, but the same applies to the cluster agent:

--- daemonset.yaml	2021-10-07 08:45:08.000000000 +0000
+++ daemonset_new.yaml	2021-10-13 08:41:48.868214040 +0000
@@ -56,6 +56,7 @@
 {{ tpl (toYaml .Values.agents.podAnnotations) . | indent 8 }}
       {{- end }}
     spec:
+      automountServiceAccountToken: true
       {{- if .Values.datadog.securityContext }}
       securityContext:
 {{ toYaml .Values.datadog.securityContext| indent 8 }}

Without the above patch either one manually modifies the agent daemonset/deployment resources like above or one needs to disable TLS verification, which is by no means a best practice IMHO.

@vboulineau
Copy link
Contributor

Hello,

Multiple issues were reported over time in this issue, we've added a documentation dedicated to Kubernetes distribution specificities here (including AKS spec.nodeName):
https://docs.datadoghq.com/agent/kubernetes/distributions/?tab=helm

One note about automountServiceAccountToken. It's true by default since Kubernetes 1.5, that's why it's not included in our Helm chart.
We'll explicitly add it in case some hardening/setups change this to false by default.

Feel free to open more dedicated issues or contact our support if your issue is not solved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests