Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing datadog metrics for the prodpublick8s AKS cluster #3123

Closed
dduportal opened this issue Sep 9, 2022 · 7 comments
Closed

Missing datadog metrics for the prodpublick8s AKS cluster #3123

dduportal opened this issue Sep 9, 2022 · 7 comments
Assignees
Labels

Comments

@dduportal
Copy link
Contributor

dduportal commented Sep 9, 2022

Service(s)

Azure, Other

Summary

Datadog dashboards are only reporting a partial set of the metrics from the prodpublick8s cluster
Capture d’écran 2022-09-09 à 09 54 40
2 Az
Capture d’écran 2022-09-09 à 09 54 29
ure Kubernetes clusters compared to thes AWS and Digital Ocean clusters as you can see on the screenshots below:

Capture d’écran 2022-09-09 à 09 54 12

On short term, we have our own Grafana installation to gather metrics for prodpublick8s, but it's running... in this cluster

Reproduction steps

No response

@dduportal dduportal added the triage Incoming issues that need review label Sep 9, 2022
@dduportal dduportal self-assigned this Sep 9, 2022
@dduportal dduportal added this to the infra-team-sync-2022-09-13 milestone Sep 9, 2022
@dduportal dduportal removed the triage Incoming issues that need review label Sep 9, 2022
@dduportal
Copy link
Contributor Author

dduportal commented Sep 9, 2022

The logs of the cluster-agent pods for prodpublick8s are filled with the following error message:

2022-09-09 08:04:42 UTC | CLUSTER | INFO | (pkg/clusteragent/admission/controllers/webhook/controller_base.go:171 in processNextWorkItem) | Couldn't reconcile Webhook datadog-webhook: Operation cannot be fulfilled on mutatingwebhookconfigurations.admissionregistration.k8s.io "datadog-webhook": the object has been modified; please apply your changes to the latest version and try again

which looks like DataDog/datadog-agent#10413 and DataDog/datadog-agent#10764

@dduportal
Copy link
Contributor Author

Closing as it is now working.

@dduportal dduportal closed this as not planned Won't fix, can't repro, duplicate, stale Oct 26, 2022
@dduportal dduportal changed the title Missing datadog metrics for AKS clusters Missing datadog metrics for the prodpublick8s AKS cluster Nov 4, 2022
@dduportal dduportal added this to the infra-team-sync-2022-11-08 milestone Nov 4, 2022
@dduportal dduportal reopened this Nov 4, 2022
@dduportal
Copy link
Contributor Author

dduportal commented Nov 4, 2022

While working on the datadog integration with the artifact caching proxy (#2752 ) we discovered that the datadog agents of this cluster are failing with the following errors:

2022-11-04 17:41:05 UTC | CORE | ERROR | (pkg/collector/worker/check_logger.go:69 in Error) | check:kubelet | Error running check: [{"message": "Unable to detect the kubelet URL automatically: impossible to reach Kubelet with host: aks-agentpool-34540872-vmss000123. Please check if your setup requires kubelet_tls_verify = false. Activate debug logs to see all attempts made", "traceback": "Traceback (most recent call last):\n  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/base.py\", line 1116, in run\n    self.check(instance)\n  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/kubelet/kubelet.py\", line 311, in check\n    raise CheckException(\"Unable to detect the kubelet URL automatically: \" + kubelet_conn_info.get('err', ''))\ndatadog_checks.base.errors.CheckException: Unable to detect the kubelet URL automatically: impossible to reach Kubelet with host: aks-agentpool-34540872-vmss000123. Please check if your setup requires kubelet_tls_verify = false. Activate debug logs to see all attempts made\n"}]

Also a LOT of warnings like below, related to our custom python checks (embedded in the Docker image from jenkins-infra/docker-datadog):

2022-11-04 17:41:05 UTC | CORE | WARN | (pkg/collector/python/datadog_agent.go:125 in LogMessage) | http_check:plugins.jenkins.io:f6eaba3b8709a028 | (http.py:388) | An unverified HTTPS request is being made to https://plugins.jenkins.io/

@dduportal
Copy link
Contributor Author

@dduportal
Copy link
Contributor Author

dduportal commented Nov 4, 2022

@dduportal
Copy link
Contributor Author

By the way: https://docs.datadoghq.com/agent/troubleshooting/debug_mode/?tab=agentv6v7#containerized-agent is really useful to enable debug log on a given agent while it is running

@dduportal
Copy link
Contributor Author

Applied and fixed 🥳

Capture d’écran 2022-11-04 à 20 12 21

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant