-
Notifications
You must be signed in to change notification settings - Fork 676
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Add liveness/readiness agent probe. #7791
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
The SIGTERM is possibly part of the bootstrap process where Elastic Agent needs to enroll with the Fleet Server instance it is supervising itself. This process is documented in https://github.com/elastic/elastic-agent/blob/main/docs/fleet-server-bootstrap.asciidoc. The URLs in the logs are a bit surprising since I see attempting to enroll through the k8s service URL. Where is the logic that configures Fleet Server to run for reference? P.S. we added a /liveness endpoint to elastic-agent itself in elastic/elastic-agent#4586 but it won't be available until 8.15.0. |
cloud-on-k8s/pkg/controller/agent/pod.go Lines 525 to 542 in 08c73c3
Which URL where you expecting, localhost? How is that controlled? |
We expect that the agent running fleet-server would be given the url that other agents can connect to. Our bootstrapping process will use that URL to enrol the agent that runs fleet-server, then switch to using The error message Is the container/pod able to route requests to |
I'm sorry @michel-laterman , I think there may be a bit of confusion here (likely my fault), but this is the fleet-server itself going into crashloopbackoff, not the agents connected to fleet-server. This: |
Rereading your statement, maybe it is clear. We give this address in the manifest, in the kibana section. There's nothing to prevent the pod from attempting to contact that endpoint, but the endpoint will not be live because the fleet server itself exposes it. |
Just so my statement is clear. the |
I'm going to try and clarify what I'm seeing:
here is the full configuration of that pod: ❯ kc get pod -n default fleet-server-agent-79d9f44966-lx5l6 -o yaml | yq '.spec.containers[0]'
command:
- /usr/bin/env
- bash
- -c
- |
#!/usr/bin/env bash
set -e
if [[ -f /mnt/elastic-internal/elasticsearch-association/default/elasticsearch/certs/ca.crt ]]; then
if [[ -f /usr/bin/update-ca-trust ]]; then
cp /mnt/elastic-internal/elasticsearch-association/default/elasticsearch/certs/ca.crt /etc/pki/ca-trust/source/anchors/
/usr/bin/update-ca-trust
elif [[ -f /usr/sbin/update-ca-certificates ]]; then
cp /mnt/elastic-internal/elasticsearch-association/default/elasticsearch/certs/ca.crt /usr/local/share/ca-certificates/
/usr/sbin/update-ca-certificates
fi
fi
/usr/bin/tini -- /usr/local/bin/docker-entrypoint -e
env:
- name: FLEET_CA
value: /usr/share/fleet-server/config/http-certs/ca.crt
- name: FLEET_ENROLL
value: "true"
- name: FLEET_ENROLLMENT_TOKEN
valueFrom:
secretKeyRef:
key: FLEET_ENROLLMENT_TOKEN
name: fleet-server-agent-envvars
optional: false
- name: FLEET_SERVER_CERT
value: /usr/share/fleet-server/config/http-certs/tls.crt
- name: FLEET_SERVER_CERT_KEY
value: /usr/share/fleet-server/config/http-certs/tls.key
- name: FLEET_SERVER_ELASTICSEARCH_CA
value: /mnt/elastic-internal/elasticsearch-association/default/elasticsearch/certs/ca.crt
- name: FLEET_SERVER_ELASTICSEARCH_HOST
value: https://elasticsearch-es-http.default.svc:9200
- name: FLEET_SERVER_ENABLE
value: "true"
- name: FLEET_SERVER_POLICY_ID
value: eck-fleet-server
- name: FLEET_SERVER_SERVICE_TOKEN
value: AAEAAWVsYXN0aWMvZmxlZXQtc2VydmVyL2RlZmF1bHRfZmxlZXQtc2VydmVyX2FiMDhkM2Y1LTU4OWEtNDNlOS1iYmRhLWY5NWJiMGExODI2OTpyYzAxYzlDR3VJdGVTUURHM0tIaG9NT0dGOFB1TXlYdG5QMkpXMGJmbWNldElGbDlMRTFSU3ZnUTdqZkxPalFH
- name: FLEET_URL
value: https://fleet-server-agent-http.default.svc:8220
- name: CONFIG_PATH
value: /usr/share/elastic-agent
- name: NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
image: docker.elastic.co/beats/elastic-agent:8.13.2
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
periodSeconds: 10
successThreshold: 1
tcpSocket:
port: 8220
timeoutSeconds: 1
name: agent
ports:
- containerPort: 8220
name: https
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /api/status
port: 8220
scheme: HTTPS
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
cpu: 200m
memory: 1Gi
requests:
cpu: 200m
memory: 1Gi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /usr/share/elastic-agent/state
name: agent-data
- mountPath: /etc/agent.yml
name: config
readOnly: true
subPath: agent.yml
- mountPath: /mnt/elastic-internal/elasticsearch-association/default/elasticsearch/certs
name: elasticsearch-certs
readOnly: true
- mountPath: /usr/share/fleet-server/config/http-certs
name: fleet-certs
readOnly: true
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-584s2
readOnly: true note that this is the fleet-server, not a standard agent:
here is the service in the namespace:
The logs are the same as previously noted:
@michel-laterman Why is the Fleet Server acting like a normal fleet-controlled agent and trying to check in to itself? |
To my knowledge fleet-server has always ran in agent-mode in ECK deployments. |
Ok, but if I start up ECK version which doesn't add liveness/readiness probe Fleet-server pod comes up without issues:
With the same log line where it's enrolling via the service...but it succeeds here. @michel-laterman Are you saying that fleet-server begins listening on
|
@michel-laterman Are you needing anything further from our team to understand this failure? |
Yes, the enrollment request the agent sends is against itself (on the service address) For the questions:
|
related: #6808
Ignore for now please. This isn't working, and is exhibiting odd behavior. Opening for discussion.
This is attempting to add both a liveness and readiness probe to the Fleet server agent (not an agent in fleet-mode, but the actual agent running the fleet server)
Notes
Details
/bin/sh -c "true"
.SIGTERM
being received, but looking at the agent code, it seems possible (likely) that an underlying thread failed, and sent aSIGTERM
to the whole group of goroutines.Logs
To replicate:
make run
kubectl apply -n default -f //raw.githubusercontent.com/elastic/cloud-on-k8s/main/config/recipes/elastic-agent/fleet-kubernetes-integration.yaml