Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to scrape node" err="Get \"https://10.100.93.58:10250/metrics/resource\": context deadline exceeded" #1352

Open
MahiraTechnology opened this issue Oct 25, 2023 · 14 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@MahiraTechnology
Copy link

What happened:
Looks like pods are not scaling based on the load which is causing the pods to restart

What you expected to happen:
HPA should scale based on the load
Anything else we need to know?:

Environment:
Running on EKS 1.27
metrics-server 0.6.3

  • Kubernetes distribution (GKE, EKS, Kubeadm, the hard way, etc.):

  • Container Network Setup (flannel, calico, etc.):

  • Kubernetes version (use kubectl version):

  • Metrics Server manifest

spoiler for Metrics Server manifest:

Using helm chart

  • Kubelet config:
spoiler for Kubelet config:
  • Metrics server logs:
spoiler for Metrics Server logs:

I1025 19:47:59.616036 1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
E1025 19:48:28.004348 1 scraper.go:140] "Failed to scrape node" err="Get "https://10.100.55.152:10250/metrics/resource\": context deadline exceeded" node="ip-10-100-55-152.ca-central-1.compute.internal"
E1025 19:48:58.004680 1 scraper.go:140] "Failed to scrape node" err="Get "https://10.100.48.155:10250/metrics/resource\": dial tcp 10.100.48.155:10250: i/o timeout" node="ip-10-100-48-155.ca-central-1.compute.internal"
E1025 19:49:13.005190 1 scraper.go:140] "Failed to scrape node" err="Get "https://10.100.48.155:10250/metrics/resource\": context deadline exceeded" node="ip-10-100-48-155.ca-central-1.compute.internal"
E1025 19:49:28.003975 1 scraper.go:140] "Failed to scrape node" err="Get "https://10.100.48.155:10250/metrics/resource\": context deadline exceeded" node="ip-10-100-48-155.ca-central-1.compute.internal"
E1025 19:53:29.599618 1 scraper.go:140] "Failed to scrape node" err="Get "https://10.100.78.163:10250/metrics/resource\": remote error: tls: internal error" node="ip-10-100-78-163.ca-central-1.compute.internal"
E1025 19:54:44.588439 1 scraper.go:140] "Failed to scrape node" err="Get "https://10.100.50.210:10250/metrics/resource\": remote error: tls: internal error" node="ip-10-100-50-210.ca-central-1.compute.internal"
E1025 19:55:28.004773 1 scraper.go:140] "Failed to scrape node" err="Get "https://10.100.73.41:10250/metrics/resource\": context deadline exceeded" node="ip-10-100-73-41.ca-central-1.compute.internal"

  • Status of Metrics API:
spolier for Status of Metrics API:
kubectl describe apiservice v1beta1.metrics.k8s.io

kubectl describe apiservices v1beta1.metrics.k8s.io
Name: v1beta1.metrics.k8s.io
Namespace:
Labels: app.kubernetes.io/instance=metrics-server
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=metrics-server
app.kubernetes.io/version=0.6.3
helm.sh/chart=metrics-server-3.10.0
Annotations: meta.helm.sh/release-name: metrics-server
meta.helm.sh/release-namespace: kube-system
API Version: apiregistration.k8s.io/v1
Kind: APIService
Metadata:
Creation Timestamp: 2023-07-03T08:59:27Z
Resource Version: 82108513
UID: de273b86-9ba6-4d8d-929c-b972d87717e1
Spec:
Group: metrics.k8s.io
Group Priority Minimum: 100
Insecure Skip TLS Verify: true
Service:
Name: metrics-server
Namespace: kube-system
Port: 443
Version: v1beta1
Version Priority: 100
Status:
Conditions:
Last Transition Time: 2023-10-25T19:48:23Z
Message: all checks passed
Reason: Passed
Status: True
Type: Available
Events:

/kind bug

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 25, 2023
@brosef
Copy link

brosef commented Oct 25, 2023

Are you using any CNI plugin (calico, weave, vpc-cni, etc.)? If so, setting hostnetwork: true in your deployment might help.

@MahiraTechnology
Copy link
Author

Yes using aws VPC-cni. Let me try the setting up hostnetwork: true in deployment of metrics server and update the observation here.

@brosef
Copy link

brosef commented Oct 25, 2023

@MahiraTechnology Let me know how it goes - we're on EKS and noticed in newer versions of the vpc-cni plugin, there was a communication breakdown somewhere between the pod VPC (where metrics-server runs), the node VPC, and the control plane. After setting hostnetwork: true, everything worked A-Ok.

@MahiraTechnology
Copy link
Author

@brosef i tried to deploy the Metrics server with hostnetwork: true, starting seeing below issue.

panic: failed to create listener: failed to listen on 0.0.0.0:10250: listen tcp 0.0.0.0:10250: bind: address already in use
goroutine 1 [running]:
main.main()
/go/src/sigs.k8s.io/metrics-server/cmd/metrics-server/metrics-server.go:37 +0xa5

@brosef
Copy link

brosef commented Oct 26, 2023

you probably have to change the port to something else, 10250 will clash with the kubelet API port. try setting containerPort: 4443

@MahiraTechnology
Copy link
Author

@brosef i deployed with port 4443 still i am seeing the same issue in metrics server pod.

I1026 18:09:39.580732 1 serving.go:342] Generated self-signed cert (/tmp/apiserver.crt, /tmp/apiserver.key)
I1026 18:09:40.087766 1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController
I1026 18:09:40.087788 1 shared_informer.go:240] Waiting for caches to sync for RequestHeaderAuthRequestController
I1026 18:09:40.087790 1 configmap_cafile_content.go:201] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
I1026 18:09:40.087800 1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I1026 18:09:40.087777 1 configmap_cafile_content.go:201] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file"
I1026 18:09:40.087969 1 secure_serving.go:267] Serving securely on [::]:4443
I1026 18:09:40.088007 1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
I1026 18:09:40.087979 1 dynamic_serving_content.go:131] "Starting controller" name="serving-cert::/tmp/apiserver.crt::/tmp/apiserver.key"
I1026 18:09:40.087993 1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
W1026 18:09:40.088063 1 shared_informer.go:372] The sharedIndexInformer has started, run more than once is not allowed
I1026 18:09:40.188199 1 shared_informer.go:247] Caches are synced for RequestHeaderAuthRequestController
I1026 18:09:40.188217 1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I1026 18:09:40.188231 1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
E1026 18:23:38.585351 1 scraper.go:140] "Failed to scrape node" err="Get "https://10.100.69.44:10250/metrics/resource\": context deadline exceeded" node="ip-10-100-69-44.ca-central-1.compute.internal"

@brosef
Copy link

brosef commented Oct 26, 2023

check your security group and firewall rules. ensure tcp 10250 is open between nodes

@MahiraTechnology
Copy link
Author

@brosef
i see below error in events

network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized

@brosef
Copy link

brosef commented Oct 26, 2023

that could mean one of many things. try running through this article: https://repost.aws/knowledge-center/eks-cni-plugin-troubleshooting

@MahiraTechnology
Copy link
Author

@brosef i went through above shared link , everthing looks ok. I see below error msg on HPA

-failed to get cpu utilization: did not receive metrics for targeted pods (pods might be unready)
-invalid metrics (1 invalid out of 1), first error is: failed to get cpu resource metric value: failed to get cpu utilization: did not receive metrics for targeted pods (pods might be unready)
-failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API
-failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server is currently unable to handle the request (get pods.metrics.k8s.io)

Metrics server continue to print the same logs as before.

@dashpole
Copy link

dashpole commented Nov 2, 2023

/assign @CatherineF-dev @dgrisonnet
/triage accepted

@k8s-ci-robot k8s-ci-robot added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Nov 2, 2023
@k8s-ci-robot k8s-ci-robot removed the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Nov 2, 2023
@tengqm
Copy link

tengqm commented Nov 5, 2023

Connection to node hostname or IP from within the metrics-server Pod is a problem to me. I'm facing the same issue when using flannel.

@vgokul984
Copy link

vgokul984 commented Nov 9, 2023

@MahiraTechnology After opening the port 10250 and 443 on Node SG with the source range of VPC the issue fixed

@brankodjurkic
Copy link

brankodjurkic commented Mar 2, 2024

Same as @vgokul984
I have containerPort: 4443 and hostNetwork: enabled in the values.yaml
Security Group on Node open 10250, both inbound and outbound solve an issue:

kubectl top nodes
ip-10-144-0-146.eu-west-1.compute.internal 45m 2% 2256Mi 31%
ip-10-144-0-17.eu-west-1.compute.internal 108m 5% 3244Mi 45%
ip-10-144-0-97.eu-west-1.compute.internal 78m 4% 3510Mi 49%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

9 participants