Failed to scrape node" err="Get \"https://10.100.93.58:10250/metrics/resource\": context deadline exceeded" #1352

MahiraTechnology · 2023-10-25T21:25:09Z

What happened:
Looks like pods are not scaling based on the load which is causing the pods to restart

What you expected to happen:
HPA should scale based on the load
Anything else we need to know?:

Environment:
Running on EKS 1.27
metrics-server 0.6.3

Kubernetes distribution (GKE, EKS, Kubeadm, the hard way, etc.):
Container Network Setup (flannel, calico, etc.):
Kubernetes version (use kubectl version):
Metrics Server manifest

spoiler for Metrics Server manifest:

Using helm chart

Kubelet config:

spoiler for Kubelet config:

Metrics server logs:

spoiler for Metrics Server logs:

I1025 19:47:59.616036 1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
E1025 19:48:28.004348 1 scraper.go:140] "Failed to scrape node" err="Get "https://10.100.55.152:10250/metrics/resource\": context deadline exceeded" node="ip-10-100-55-152.ca-central-1.compute.internal"
E1025 19:48:58.004680 1 scraper.go:140] "Failed to scrape node" err="Get "https://10.100.48.155:10250/metrics/resource\": dial tcp 10.100.48.155:10250: i/o timeout" node="ip-10-100-48-155.ca-central-1.compute.internal"
E1025 19:49:13.005190 1 scraper.go:140] "Failed to scrape node" err="Get "https://10.100.48.155:10250/metrics/resource\": context deadline exceeded" node="ip-10-100-48-155.ca-central-1.compute.internal"
E1025 19:49:28.003975 1 scraper.go:140] "Failed to scrape node" err="Get "https://10.100.48.155:10250/metrics/resource\": context deadline exceeded" node="ip-10-100-48-155.ca-central-1.compute.internal"
E1025 19:53:29.599618 1 scraper.go:140] "Failed to scrape node" err="Get "https://10.100.78.163:10250/metrics/resource\": remote error: tls: internal error" node="ip-10-100-78-163.ca-central-1.compute.internal"
E1025 19:54:44.588439 1 scraper.go:140] "Failed to scrape node" err="Get "https://10.100.50.210:10250/metrics/resource\": remote error: tls: internal error" node="ip-10-100-50-210.ca-central-1.compute.internal"
E1025 19:55:28.004773 1 scraper.go:140] "Failed to scrape node" err="Get "https://10.100.73.41:10250/metrics/resource\": context deadline exceeded" node="ip-10-100-73-41.ca-central-1.compute.internal"

Status of Metrics API:

spolier for Status of Metrics API:

kubectl describe apiservice v1beta1.metrics.k8s.io

kubectl describe apiservices v1beta1.metrics.k8s.io
Name: v1beta1.metrics.k8s.io
Namespace:
Labels: app.kubernetes.io/instance=metrics-server
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=metrics-server
app.kubernetes.io/version=0.6.3
helm.sh/chart=metrics-server-3.10.0
Annotations: meta.helm.sh/release-name: metrics-server
meta.helm.sh/release-namespace: kube-system
API Version: apiregistration.k8s.io/v1
Kind: APIService
Metadata:
Creation Timestamp: 2023-07-03T08:59:27Z
Resource Version: 82108513
UID: de273b86-9ba6-4d8d-929c-b972d87717e1
Spec:
Group: metrics.k8s.io
Group Priority Minimum: 100
Insecure Skip TLS Verify: true
Service:
Name: metrics-server
Namespace: kube-system
Port: 443
Version: v1beta1
Version Priority: 100
Status:
Conditions:
Last Transition Time: 2023-10-25T19:48:23Z
Message: all checks passed
Reason: Passed
Status: True
Type: Available
Events:

/kind bug

The text was updated successfully, but these errors were encountered:

brosef · 2023-10-25T22:53:27Z

Are you using any CNI plugin (calico, weave, vpc-cni, etc.)? If so, setting hostnetwork: true in your deployment might help.

MahiraTechnology · 2023-10-25T23:00:06Z

Yes using aws VPC-cni. Let me try the setting up hostnetwork: true in deployment of metrics server and update the observation here.

brosef · 2023-10-25T23:58:00Z

@MahiraTechnology Let me know how it goes - we're on EKS and noticed in newer versions of the vpc-cni plugin, there was a communication breakdown somewhere between the pod VPC (where metrics-server runs), the node VPC, and the control plane. After setting hostnetwork: true, everything worked A-Ok.

MahiraTechnology · 2023-10-26T17:52:45Z

@brosef i tried to deploy the Metrics server with hostnetwork: true, starting seeing below issue.

panic: failed to create listener: failed to listen on 0.0.0.0:10250: listen tcp 0.0.0.0:10250: bind: address already in use
goroutine 1 [running]:
main.main()
/go/src/sigs.k8s.io/metrics-server/cmd/metrics-server/metrics-server.go:37 +0xa5

brosef · 2023-10-26T18:04:54Z

you probably have to change the port to something else, 10250 will clash with the kubelet API port. try setting containerPort: 4443

MahiraTechnology · 2023-10-26T18:25:57Z

@brosef i deployed with port 4443 still i am seeing the same issue in metrics server pod.

I1026 18:09:39.580732 1 serving.go:342] Generated self-signed cert (/tmp/apiserver.crt, /tmp/apiserver.key)
I1026 18:09:40.087766 1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController
I1026 18:09:40.087788 1 shared_informer.go:240] Waiting for caches to sync for RequestHeaderAuthRequestController
I1026 18:09:40.087790 1 configmap_cafile_content.go:201] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
I1026 18:09:40.087800 1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I1026 18:09:40.087777 1 configmap_cafile_content.go:201] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file"
I1026 18:09:40.087969 1 secure_serving.go:267] Serving securely on [::]:4443
I1026 18:09:40.088007 1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
I1026 18:09:40.087979 1 dynamic_serving_content.go:131] "Starting controller" name="serving-cert::/tmp/apiserver.crt::/tmp/apiserver.key"
I1026 18:09:40.087993 1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
W1026 18:09:40.088063 1 shared_informer.go:372] The sharedIndexInformer has started, run more than once is not allowed
I1026 18:09:40.188199 1 shared_informer.go:247] Caches are synced for RequestHeaderAuthRequestController
I1026 18:09:40.188217 1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I1026 18:09:40.188231 1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
E1026 18:23:38.585351 1 scraper.go:140] "Failed to scrape node" err="Get "https://10.100.69.44:10250/metrics/resource\": context deadline exceeded" node="ip-10-100-69-44.ca-central-1.compute.internal"

brosef · 2023-10-26T18:36:10Z

check your security group and firewall rules. ensure tcp 10250 is open between nodes

MahiraTechnology · 2023-10-26T18:41:41Z

@brosef
i see below error in events

network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized

brosef · 2023-10-26T18:45:03Z

that could mean one of many things. try running through this article: https://repost.aws/knowledge-center/eks-cni-plugin-troubleshooting

MahiraTechnology · 2023-10-26T20:42:54Z

@brosef i went through above shared link , everthing looks ok. I see below error msg on HPA

-failed to get cpu utilization: did not receive metrics for targeted pods (pods might be unready)
-invalid metrics (1 invalid out of 1), first error is: failed to get cpu resource metric value: failed to get cpu utilization: did not receive metrics for targeted pods (pods might be unready)
-failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API
-failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server is currently unable to handle the request (get pods.metrics.k8s.io)

Metrics server continue to print the same logs as before.

dashpole · 2023-11-02T16:48:02Z

/assign @CatherineF-dev @dgrisonnet
/triage accepted

tengqm · 2023-11-05T11:06:01Z

Connection to node hostname or IP from within the metrics-server Pod is a problem to me. I'm facing the same issue when using flannel.

vgokul984 · 2023-11-09T10:44:04Z

@MahiraTechnology After opening the port 10250 and 443 on Node SG with the source range of VPC the issue fixed

brankodjurkic · 2024-03-02T11:26:40Z

Same as @vgokul984
I have containerPort: 4443 and hostNetwork: enabled in the values.yaml
Security Group on Node open 10250, both inbound and outbound solve an issue:

kubectl top nodes
ip-10-144-0-146.eu-west-1.compute.internal 45m 2% 2256Mi 31%
ip-10-144-0-17.eu-west-1.compute.internal 108m 5% 3244Mi 45%
ip-10-144-0-97.eu-west-1.compute.internal 78m 4% 3510Mi 49%

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 25, 2023

MahiraTechnology mentioned this issue Oct 25, 2023

error: Metrics API not available #1282

Open

k8s-ci-robot assigned CatherineF-dev Nov 2, 2023

k8s-ci-robot added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Nov 2, 2023

k8s-ci-robot assigned dgrisonnet Nov 2, 2023

k8s-ci-robot removed the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Nov 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to scrape node" err="Get \"https://10.100.93.58:10250/metrics/resource\": context deadline exceeded" #1352

Failed to scrape node" err="Get \"https://10.100.93.58:10250/metrics/resource\": context deadline exceeded" #1352

MahiraTechnology commented Oct 25, 2023

brosef commented Oct 25, 2023

MahiraTechnology commented Oct 25, 2023

brosef commented Oct 25, 2023

MahiraTechnology commented Oct 26, 2023

brosef commented Oct 26, 2023

MahiraTechnology commented Oct 26, 2023

brosef commented Oct 26, 2023

MahiraTechnology commented Oct 26, 2023

brosef commented Oct 26, 2023

MahiraTechnology commented Oct 26, 2023

dashpole commented Nov 2, 2023

tengqm commented Nov 5, 2023 •

edited

vgokul984 commented Nov 9, 2023 •

edited

brankodjurkic commented Mar 2, 2024 •

edited

Failed to scrape node" err="Get \"https://10.100.93.58:10250/metrics/resource\": context deadline exceeded" #1352

Failed to scrape node" err="Get \"https://10.100.93.58:10250/metrics/resource\": context deadline exceeded" #1352

Comments

MahiraTechnology commented Oct 25, 2023

brosef commented Oct 25, 2023

MahiraTechnology commented Oct 25, 2023

brosef commented Oct 25, 2023

MahiraTechnology commented Oct 26, 2023

brosef commented Oct 26, 2023

MahiraTechnology commented Oct 26, 2023

@brosef i deployed with port 4443 still i am seeing the same issue in metrics server pod.

brosef commented Oct 26, 2023

MahiraTechnology commented Oct 26, 2023

@brosef i see below error in events

brosef commented Oct 26, 2023

MahiraTechnology commented Oct 26, 2023

dashpole commented Nov 2, 2023

tengqm commented Nov 5, 2023 • edited

vgokul984 commented Nov 9, 2023 • edited

brankodjurkic commented Mar 2, 2024 • edited

@brosef
i see below error in events

tengqm commented Nov 5, 2023 •

edited

vgokul984 commented Nov 9, 2023 •

edited

brankodjurkic commented Mar 2, 2024 •

edited