Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metrics and objects deployments generating tons of zombie processes and using up cluster node process limits #857

Open
gvoden opened this issue Apr 27, 2023 · 1 comment

Comments

@gvoden
Copy link

gvoden commented Apr 27, 2023

What happened:
Deploying metrics, metrics aggregator and kube-objects (all images tagged with 1.2.1) seems to cause lots of zombie processes to be created on the cluster node where the deployment is and eventually cluster node is overwhelmed and crashes (Amazon EKS 1.22)

What you expected to happen:
Metrics and object collections should function normally.

How to reproduce it (as minimally and precisely as possible):
Deploy Splunk Connect with below YAML

global:
logLevel: info
splunk:
hec:
host: http-inputs-hoopp.splunkcloud.com
insecureSSL: false
port: 443
protocol: https
token:
splunk-kubernetes-logging:
enabled: true
journalLogPath: /var/log/journal
logs:
isg-containers:
logFormatType: cri
from:
container: isg-
pod: '*'
multiline:
firstline: /^\d{4}-\d{2}-\d{2} \d{1,2}:\d{1,2}:\d{1,2}.\d{3}/
sourcetype: kube:container
timestampExtraction:
format: '%Y-%m-%d %H:%M:%S.%NZ'
regexp: time="(?\d{4}-\d{2}-\d{2}T[0-2]\d:[0-5]\d:[0-5]\d.\d{9}Z)"
image:
registry: docker.io
name: splunk/fluentd-hec
tag: 1.3.1
pullPolicy: Always
resources:
limits:
memory: 1.5Gi
splunk:
hec:
indexName: eks_logs
splunk-kubernetes-metrics:
image:
registry: docker.io
name: splunk/k8s-metrics
tag: 1.2.1
pullPolicy: Always
imageAgg:
registry: docker.io
name: splunk/k8s-metrics-aggr
tag: 1.2.1
pullPolicy: Always
rbac:
create: true
serviceAccount:
create: true
name: splunk-kubernetes-metrics
splunk:
hec:
indexName: eks_metrics
splunk-kubernetes-objects:
image:
registry: docker.io
name: splunk/kube-objects
tag: 1.2.1
pullPolicy: Always
kubernetes:
insecureSSL: true
objects:
apps:
v1:
- interval: 30s
name: deployments
- interval: 30s
name: daemon_sets
- interval: 30s
name: replica_sets
- interval: 30s
name: stateful_sets
core:
v1:
- interval: 30s
name: pods
- interval: 30s
name: namespaces
- interval: 30s
name: nodes
- interval: 30s
name: services
- interval: 30s
name: config_maps
- interval: 30s
name: secrets
- interval: 30s
name: persistent_volumes
- interval: 30s
name: service_accounts
- interval: 30s
name: persistent_volume_claims
- interval: 30s
name: resource_quotas
- interval: 30s
name: component_statuses
- mode: watch
name: events
rbac:
create: true
serviceAccount:
create: true
name: splunk-kubernetes-objects
splunk:
hec:
indexName: eks_meta
Anything else we need to know?:

Scaling down the deployment for metrics and objects to 0 makes the zombie processes disappear immediately
Environment:

  • Kubernetes version (use kubectl version): EKS 1.22

  • Ruby version (use ruby --version):

  • OS (e.g: cat /etc/os-release):

  • NAME="Amazon Linux"
    VERSION="2"
    ID="amzn"
    ID_LIKE="centos rhel fedora"
    VERSION_ID="2"
    PRETTY_NAME="Amazon Linux 2"
    ANSI_COLOR="0;33"
    CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
    HOME_URL="https://amazonlinux.com/"

  • Splunk version: - check YAML above

  • Splunk Connect for Kubernetes helm chart version: 1.4.3

  • Others:

@gvoden
Copy link
Author

gvoden commented Apr 27, 2023

We found the following in our logs:
2023-04-27 18:06:26 +0000 [error]: #0 unexpected error error_class=Kubeclient::HttpError error="HTTP status code 403, v1 is forbidden: User "system:serviceaccount:splunk-connect-k8s:splunk-kubernetes-objects" cannot list resource "v1" in API group "" at the cluster scope for GET https://10.100.0.1/api/apps/v1"

The service account was missing permissions to list the v1 resource. After updating permissions in our clusterrole we no longer see this error and the zombie process creation is cleared and issue is resolved.
Question, why does the pod need access to this v1 endpoint? And did it require access to it in prior versions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant