Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Add liveness/readiness agent probe. #7791

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

naemono
Copy link
Contributor

@naemono naemono commented May 10, 2024

related: #6808
Ignore for now please. This isn't working, and is exhibiting odd behavior. Opening for discussion.

This is attempting to add both a liveness and readiness probe to the Fleet server agent (not an agent in fleet-mode, but the actual agent running the fleet server)

Notes

  1. I'm not sure what the liveness probe is getting us here, tbh (as I suspect in nearly all cases it fails, the process will completely fail), but readiness probe seems useful.

Details

  1. If adding a single liveness or readiness probe, this works fine
  2. If using both probes, the agent in fleet-server mode continuously goes into a crash loop (unless it's already checked in initially, which in that case it (sometimes?) works).
  3. The type of liveness probe does not matter. It literally can be /bin/sh -c "true".
  4. I've checked the kubelet, and it is not killing the agent process.
  5. I've checked for OOM, and it's not being OOM killed.
  6. something is causing the whole agent process to fail, which gets reported in the lots as a SIGTERM being received, but looking at the agent code, it seems possible (likely) that an underlying thread failed, and sent a SIGTERM to the whole group of goroutines.

Logs

{"log.level":"info","@timestamp":"2024-05-10T14:13:52.449Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":862},"message":"Fleet Server - Starting","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-05-10T14:13:52.775Z","message":"Running on policy with Fleet Server integration: eck-fleet-server; missing config fleet.agent.id (expected during bootstrap process)","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","service.type":"fleet-server","state":"DEGRADED","ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2024-05-10T14:13:52.776Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":624},"message":"Unit state changed fleet-server-default-fleet-server (STARTING->DEGRADED): Running on policy with Fleet Server integration: eck-fleet-server; missing config fleet.agent.id (expected during bootstrap process)","log":{"source":"elastic-agent"},"component":{"id":"fleet-server-default","state":"HEALTHY"},"unit":{"id":"fleet-server-default-fleet-server","type":"input","state":"DEGRADED","old_state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2024-05-10T14:13:52.776Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":624},"message":"Unit state changed fleet-server-default (STARTING->DEGRADED): Running on policy with Fleet Server integration: eck-fleet-server; missing config fleet.agent.id (expected during bootstrap process)","log":{"source":"elastic-agent"},"component":{"id":"fleet-server-default","state":"HEALTHY"},"unit":{"id":"fleet-server-default","type":"output","state":"DEGRADED","old_state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-05-10T14:13:58.291Z","message":"Running on policy with Fleet Server integration: eck-fleet-server; missing config fleet.agent.id (expected during bootstrap process)","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","service.type":"fleet-server","state":"DEGRADED","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-05-10T14:14:00.458Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":843},"message":"Fleet Server - Running on policy with Fleet Server integration: eck-fleet-server; missing config fleet.agent.id (expected during bootstrap process)","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-05-10T14:14:01.411Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":519},"message":"Starting enrollment to URL: https://fleet-server-agent-http.default.svc:8220/","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-05-10T14:14:01.647Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":528},"message":"1st enrollment attempt failed, retrying for 10m0s, every 1m0s enrolling to URL: https://fleet-server-agent-http.default.svc:8220/","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-05-10T14:14:01.648Z","log.origin":{"file.name":"cmd/run.go","file.line":346},"message":"signal \"terminated\" received","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-05-10T14:14:01.648Z","log.origin":{"file.name":"cmd/run.go","file.line":358},"message":"Shutting down Elastic Agent and sending last events...","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-05-10T14:14:01.648Z","message":"On signal","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","service.type":"fleet-server","sig":"terminated","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-05-10T14:14:01.848Z","log.origin":{"file.name":"cmd/run.go","file.line":367},"message":"Shutting down completed.","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-05-10T14:14:01.848Z","log.origin":{"file.name":"reload/reload.go","file.line":68},"message":"Stopping server","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-05-10T14:14:01.849Z","log.logger":"api","log.origin":{"file.name":"api/server.go","file.line":80},"message":"Stats endpoint (127.0.0.1:6791) finished: accept tcp 127.0.0.1:6791: use of closed network connection","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
Error: fail to enroll: fail to execute request to fleet-server: dial tcp 10.51.175.157:8220: connect: connection refused
For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.13/fleet-troubleshooting.html
Error: enrollment failed: exit status 1
For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.13/fleet-troubleshooting.html

To replicate:

  1. download branch; ensure no other eck process is running ; make run
  2. kubectl apply -n default -f //raw.githubusercontent.com/elastic/cloud-on-k8s/main/config/recipes/elastic-agent/fleet-kubernetes-integration.yaml

Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
@cmacknz
Copy link
Member

cmacknz commented May 10, 2024

The SIGTERM is possibly part of the bootstrap process where Elastic Agent needs to enroll with the Fleet Server instance it is supervising itself. This process is documented in https://github.com/elastic/elastic-agent/blob/main/docs/fleet-server-bootstrap.asciidoc. The URLs in the logs are a bit surprising since I see attempting to enroll through the k8s service URL.

Where is the logic that configures Fleet Server to run for reference?

P.S. we added a /liveness endpoint to elastic-agent itself in elastic/elastic-agent#4586 but it won't be available until 8.15.0.

@pebrc
Copy link
Collaborator

pebrc commented May 13, 2024

Where is the logic that configures Fleet Server to run for reference?

if agent.Spec.FleetServerEnabled { //nolint:nestif
fleetURL, err := association.ServiceURL(
client,
types.NamespacedName{Namespace: agent.Namespace, Name: HTTPServiceName(agent.Name)},
agent.Spec.HTTP.Protocol(),
)
if err != nil {
return nil, err
}
fleetCfg[FleetURL] = fleetURL
if agent.Spec.HTTP.TLS.Enabled() && fleetCerts.HasCA() {
fleetCfg[FleetCA] = path.Join(FleetCertsMountPath, certificates.CAFileName)
}
// Fleet Server needs a policy ID to bootstrap itself unless a policy marked as default is used.
if agent.Spec.KibanaRef.IsDefined() && !fleetToken.isEmpty() {
fleetCfg[FleetServerPolicyID] = fleetToken.PolicyID
}

The URLs in the logs are a bit surprising since I see attempting to enroll through the k8s service URL.

Which URL where you expecting, localhost? How is that controlled?

@michel-laterman
Copy link

We expect that the agent running fleet-server would be given the url that other agents can connect to.

Our bootstrapping process will use that URL to enrol the agent that runs fleet-server, then switch to using localhost:8221 to use a separate set of rate limiters.

The error message 1st enrollment attempt failed, retrying for 10m0s, every 1m0s enrolling to URL: https://fleet-server-agent-http.default.svc:8220/ does not indicate why it failed (if the API returned a 503 or the port did not accept connections).

Is the container/pod able to route requests to fleet-server-agent-http.default.svc:8220 during startup?

@naemono
Copy link
Contributor Author

naemono commented May 13, 2024

We expect that the agent running fleet-server would be given the url that other agents can connect to.

Our bootstrapping process will use that URL to enrol the agent that runs fleet-server, then switch to using localhost:8221 to use a separate set of rate limiters.

The error message 1st enrollment attempt failed, retrying for 10m0s, every 1m0s enrolling to URL: https://fleet-server-agent-http.default.svc:8220/ does not indicate why it failed (if the API returned a 503 or the port did not accept connections).

Is the container/pod able to route requests to fleet-server-agent-http.default.svc:8220 during startup?

I'm sorry @michel-laterman , I think there may be a bit of confusion here (likely my fault), but this is the fleet-server itself going into crashloopbackoff, not the agents connected to fleet-server. This: fleet-server-agent-http.default.svc:8220 is basically pointing to itself, via the service in Kubernetes.

@naemono
Copy link
Contributor Author

naemono commented May 13, 2024

We expect that the agent running fleet-server would be given the url that other agents can connect to.
Our bootstrapping process will use that URL to enrol the agent that runs fleet-server, then switch to using localhost:8221 to use a separate set of rate limiters.
The error message 1st enrollment attempt failed, retrying for 10m0s, every 1m0s enrolling to URL: https://fleet-server-agent-http.default.svc:8220/ does not indicate why it failed (if the API returned a 503 or the port did not accept connections).
Is the container/pod able to route requests to fleet-server-agent-http.default.svc:8220 during startup?

I'm sorry @michel-laterman , I think there may be a bit of confusion here (likely my fault), but this is the fleet-server itself going into crashloopbackoff, not the agents connected to fleet-server. This: fleet-server-agent-http.default.svc:8220 is basically pointing to itself, via the service in Kubernetes.

Rereading your statement, maybe it is clear. We give this address in the manifest, in the kibana section. There's nothing to prevent the pod from attempting to contact that endpoint, but the endpoint will not be live because the fleet server itself exposes it.

@michel-laterman
Copy link

Just so my statement is clear. the fleet-server-agent-http.default.svc:8220 address was from the agent logs you provided in the description.
The agent is trying to enroll at that address, so it must be available

@naemono
Copy link
Contributor Author

naemono commented May 14, 2024

I'm going to try and clarify what I'm seeing:

  • I install this manifest into a GKE cluster with ECK running from this branch kubectl apply -n default -f https://raw.githubusercontent.com/elastic/cloud-on-k8s/main/config/recipes/elastic-agent/fleet-kubernetes-integration.yaml
  • Elasticsearch and Kibana become healthy:
NAME                         READY   STATUS    RESTARTS   AGE
elasticsearch-es-default-0   1/1     Running   0          40s
elasticsearch-es-default-2   1/1     Running   0          43s
kibana-kb-5dc4cf-mpkhl       1/1     Running   0          40s
elasticsearch-es-default-1   1/1     Running   0          46s
  • Fleet server is attempted to bring online and after a bit of time, it fails:
fleet-server-agent-79d9f44966-lx5l6   0/1     Pending   0          0s
fleet-server-agent-79d9f44966-lx5l6   0/1     Pending   0          0s
fleet-server-agent-79d9f44966-lx5l6   0/1     ContainerCreating   0          0s
fleet-server-agent-79d9f44966-lx5l6   0/1     ContainerCreating   0          1s
fleet-server-agent-79d9f44966-lx5l6   0/1     Running             0          1s
fleet-server-agent-79d9f44966-lx5l6   0/1     Error               0            25s

here is the full configuration of that pod:

❯ kc get pod -n default fleet-server-agent-79d9f44966-lx5l6 -o yaml | yq '.spec.containers[0]'
command:
  - /usr/bin/env
  - bash
  - -c
  - |
    #!/usr/bin/env bash
    set -e
    if [[ -f /mnt/elastic-internal/elasticsearch-association/default/elasticsearch/certs/ca.crt ]]; then
      if [[ -f /usr/bin/update-ca-trust ]]; then
        cp /mnt/elastic-internal/elasticsearch-association/default/elasticsearch/certs/ca.crt /etc/pki/ca-trust/source/anchors/
        /usr/bin/update-ca-trust
      elif [[ -f /usr/sbin/update-ca-certificates ]]; then
        cp /mnt/elastic-internal/elasticsearch-association/default/elasticsearch/certs/ca.crt /usr/local/share/ca-certificates/
        /usr/sbin/update-ca-certificates
      fi
    fi
    /usr/bin/tini -- /usr/local/bin/docker-entrypoint -e
env:
  - name: FLEET_CA
    value: /usr/share/fleet-server/config/http-certs/ca.crt
  - name: FLEET_ENROLL
    value: "true"
  - name: FLEET_ENROLLMENT_TOKEN
    valueFrom:
      secretKeyRef:
        key: FLEET_ENROLLMENT_TOKEN
        name: fleet-server-agent-envvars
        optional: false
  - name: FLEET_SERVER_CERT
    value: /usr/share/fleet-server/config/http-certs/tls.crt
  - name: FLEET_SERVER_CERT_KEY
    value: /usr/share/fleet-server/config/http-certs/tls.key
  - name: FLEET_SERVER_ELASTICSEARCH_CA
    value: /mnt/elastic-internal/elasticsearch-association/default/elasticsearch/certs/ca.crt
  - name: FLEET_SERVER_ELASTICSEARCH_HOST
    value: https://elasticsearch-es-http.default.svc:9200
  - name: FLEET_SERVER_ENABLE
    value: "true"
  - name: FLEET_SERVER_POLICY_ID
    value: eck-fleet-server
  - name: FLEET_SERVER_SERVICE_TOKEN
    value: AAEAAWVsYXN0aWMvZmxlZXQtc2VydmVyL2RlZmF1bHRfZmxlZXQtc2VydmVyX2FiMDhkM2Y1LTU4OWEtNDNlOS1iYmRhLWY5NWJiMGExODI2OTpyYzAxYzlDR3VJdGVTUURHM0tIaG9NT0dGOFB1TXlYdG5QMkpXMGJmbWNldElGbDlMRTFSU3ZnUTdqZkxPalFH
  - name: FLEET_URL
    value: https://fleet-server-agent-http.default.svc:8220
  - name: CONFIG_PATH
    value: /usr/share/elastic-agent
  - name: NODE_NAME
    valueFrom:
      fieldRef:
        apiVersion: v1
        fieldPath: spec.nodeName
image: docker.elastic.co/beats/elastic-agent:8.13.2
imagePullPolicy: IfNotPresent
livenessProbe:
  failureThreshold: 3
  periodSeconds: 10
  successThreshold: 1
  tcpSocket:
    port: 8220
  timeoutSeconds: 1
name: agent
ports:
  - containerPort: 8220
    name: https
    protocol: TCP
readinessProbe:
  failureThreshold: 3
  httpGet:
    path: /api/status
    port: 8220
    scheme: HTTPS
  periodSeconds: 10
  successThreshold: 1
  timeoutSeconds: 1
resources:
  limits:
    cpu: 200m
    memory: 1Gi
  requests:
    cpu: 200m
    memory: 1Gi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
  - mountPath: /usr/share/elastic-agent/state
    name: agent-data
  - mountPath: /etc/agent.yml
    name: config
    readOnly: true
    subPath: agent.yml
  - mountPath: /mnt/elastic-internal/elasticsearch-association/default/elasticsearch/certs
    name: elasticsearch-certs
    readOnly: true
  - mountPath: /usr/share/fleet-server/config/http-certs
    name: fleet-certs
    readOnly: true
  - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
    name: kube-api-access-584s2
    readOnly: true

note that this is the fleet-server, not a standard agent:

  - name: FLEET_SERVER_ENABLE
    value: "true"

here is the service in the namespace:

❯ kc describe svc -n default fleet-server-agent-http
Name:              fleet-server-agent-http
Namespace:         default
Labels:            agent.k8s.elastic.co/name=fleet-server
                   common.k8s.elastic.co/type=agent
Annotations:       <none>
Selector:          agent.k8s.elastic.co/name=fleet-server,common.k8s.elastic.co/type=agent
Type:              ClusterIP
IP Family Policy:  SingleStack
IP Families:       IPv4
IP:                10.51.161.168
IPs:               10.51.161.168
Port:              https  8220/TCP
TargetPort:        8220/TCP
Endpoints:
Session Affinity:  None
Events:            <none>

The logs are the same as previously noted:

{"log.level":"info","@timestamp":"2024-05-14T14:00:20.245Z","message":"Running on policy with Fleet Server integration: eck-fleet-server; missing config fleet.agent.id (expected during bootstrap process)","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","service.type":"fleet-server","state":"DEGRADED","ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2024-05-14T14:00:20.246Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":624},"message":"Unit state changed fleet-server-default-fleet-server (STARTING->DEGRADED): Running on policy with Fleet Server integration: eck-fleet-server; missing config fleet.agent.id (expected during bootstrap process)","log":{"source":"elastic-agent"},"component":{"id":"fleet-server-default","state":"HEALTHY"},"unit":{"id":"fleet-server-default-fleet-server","type":"input","state":"DEGRADED","old_state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2024-05-14T14:00:20.246Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":624},"message":"Unit state changed fleet-server-default (STARTING->DEGRADED): Running on policy with Fleet Server integration: eck-fleet-server; missing config fleet.agent.id (expected during bootstrap process)","log":{"source":"elastic-agent"},"component":{"id":"fleet-server-default","state":"HEALTHY"},"unit":{"id":"fleet-server-default","type":"output","state":"DEGRADED","old_state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-05-14T14:00:25.766Z","message":"Running on policy with Fleet Server integration: eck-fleet-server; missing config fleet.agent.id (expected during bootstrap process)","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","service.type":"fleet-server","state":"DEGRADED","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-05-14T14:00:26.076Z","message":"http: TLS handshake error from 10.51.128.21:34782: EOF\n","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","service.type":"fleet-server","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-05-14T14:00:27.315Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":843},"message":"Fleet Server - Running on policy with Fleet Server integration: eck-fleet-server; missing config fleet.agent.id (expected during bootstrap process)","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-05-14T14:00:27.723Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":519},"message":"Starting enrollment to URL: https://fleet-server-agent-http.default.svc:8220/","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-05-14T14:00:28.012Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":528},"message":"1st enrollment attempt failed, retrying for 10m0s, every 1m0s enrolling to URL: https://fleet-server-agent-http.default.svc:8220/","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-05-14T14:00:28.012Z","log.origin":{"file.name":"cmd/run.go","file.line":346},"message":"signal \"terminated\" received","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-05-14T14:00:28.013Z","log.origin":{"file.name":"cmd/run.go","file.line":358},"message":"Shutting down Elastic Agent and sending last events...","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-05-14T14:00:28.014Z","message":"On signal","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","service.type":"fleet-server","sig":"terminated","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-05-14T14:00:28.213Z","log.origin":{"file.name":"cmd/run.go","file.line":367},"message":"Shutting down completed.","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-05-14T14:00:28.214Z","log.origin":{"file.name":"reload/reload.go","file.line":68},"message":"Stopping server","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-05-14T14:00:28.214Z","log.logger":"api","log.origin":{"file.name":"api/server.go","file.line":80},"message":"Stats endpoint (127.0.0.1:6791) finished: accept tcp 127.0.0.1:6791: use of closed network connection","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
Error: fail to enroll: fail to execute request to fleet-server: dial tcp 10.51.161.168:8220: connect: connection refused
For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.13/fleet-troubleshooting.html
Error: enrollment failed: exit status 1
For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.13/fleet-troubleshooting.html

@michel-laterman Why is the Fleet Server acting like a normal fleet-controlled agent and trying to check in to itself? "Starting enrollment to URL: https://fleet-server-agent-http.default.svc:8220/

@michel-laterman
Copy link

To my knowledge fleet-server has always ran in agent-mode in ECK deployments.

@naemono
Copy link
Contributor Author

naemono commented May 14, 2024

To my knowledge fleet-server has always ran in agent-mode in ECK deployments.

Ok, but if I start up ECK version which doesn't add liveness/readiness probe Fleet-server pod comes up without issues:

{"log.level":"info","@timestamp":"2024-05-14T15:04:11.480Z","message":"Running on policy with Fleet Server integration: eck-fleet-server; missing config fleet.agent.id (expected during bootstrap process)","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","service.type":"fleet-server","state":"DEGRADED","@timestamp":"2024-05-14T15:04:11.48Z","ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2024-05-14T15:04:11.480Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":624},"message":"Unit state changed fleet-server-default-fleet-server (STARTING->DEGRADED): Running on policy with Fleet Server integration: eck-fleet-server; missing config fleet.agent.id (expected during bootstrap process)","log":{"source":"elastic-agent"},"component":{"id":"fleet-server-default","state":"HEALTHY"},"unit":{"id":"fleet-server-default-fleet-server","type":"input","state":"DEGRADED","old_state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2024-05-14T15:04:11.480Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":624},"message":"Unit state changed fleet-server-default (STARTING->DEGRADED): Running on policy with Fleet Server integration: eck-fleet-server; missing config fleet.agent.id (expected during bootstrap process)","log":{"source":"elastic-agent"},"component":{"id":"fleet-server-default","state":"HEALTHY"},"unit":{"id":"fleet-server-default","type":"output","state":"DEGRADED","old_state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-05-14T15:04:11.531Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":1382},"message":"component model updated","log":{"source":"elastic-agent"},"changes":{"components":{"count":1},"outputs":{}},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-05-14T15:04:11.531Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":1181},"message":"Updating running component model","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-05-14T15:04:13.012Z","message":"Elastic Agent successfully enrolled","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"service.name":"fleet-server","service.type":"fleet-server","server.address":"","ecs.version":"1.6.0","http.request.id":"01HXVT3TSN8CGDKC5A5CKNQMK2","fleet.agent.id":"73c264f4-14b9-49c5-926e-916ed81b70e4","fleet.policy.id":"eck-agent","fleet.access.apikey.id":"v9Ohd48B4pQQGpfO7prA","event.duration":1686596185,"mod":"enroll","fleet.enroll.apikey.id":"u9Nkd48B4pQQGpfOiZqp","http.response.body.bytes":1487,"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-05-14T15:04:13.868Z","message":"Elastic Agent successfully enrolled","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"service.name":"fleet-server","fleet.enroll.apikey.id":"vNNkd48B4pQQGpfOiZqp","ecs.version":"1.6.0","http.request.id":"01HXVT3WMZ44JVDSXP48TG9XHP","server.address":"","fleet.agent.id":"6885a012-b50a-4c13-8b46-99f4aa90ba90","event.duration":567143413,"http.response.body.bytes":1028,"service.type":"fleet-server","mod":"enroll","fleet.policy.id":"eck-fleet-server","fleet.access.apikey.id":"wdOhd48B4pQQGpfO85r7","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-05-14T15:04:13.868Z","message":"Elastic Agent successfully enrolled","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"mod":"enroll","service.type":"fleet-server","http.request.id":"01HXVT3WQMTYM0GRSXD0Z8GNYB","fleet.policy.id":"eck-fleet-server","http.response.body.bytes":1027,"ecs.version":"1.6.0","server.address":"","fleet.agent.id":"68c22a98-7d1b-4c92-880b-7333f256b1d1","fleet.access.apikey.id":"gsyhd48Bydb5_Fmb8xX_","event.duration":558550960,"service.name":"fleet-server","fleet.enroll.apikey.id":"vNNkd48B4pQQGpfOiZqp","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-05-14T15:04:14.133Z","message":"Elastic Agent successfully enrolled","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"http.request.id":"01HXVT3X7VGQTHV00BYDKARTB7","server.address":"","fleet.enroll.apikey.id":"u9Nkd48B4pQQGpfOiZqp","fleet.access.apikey.id":"hMyhd48Bydb5_Fmb9RUJ","service.type":"fleet-server","fleet.policy.id":"eck-agent","http.response.body.bytes":1335,"event.duration":301266621,"ecs.version":"1.6.0","fleet.agent.id":"3dd1474f-1855-4273-9547-a8662d5e57b6","mod":"enroll","service.name":"fleet-server","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-05-14T15:04:15.537Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":843},"message":"Fleet Server - Running on policy with Fleet Server integration: eck-fleet-server; missing config fleet.agent.id (expected during bootstrap process)","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-05-14T15:04:15.872Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":519},"message":"Starting enrollment to URL: https://fleet-server-agent-http.default.svc:8220/","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-05-14T15:04:16.436Z","message":"Elastic Agent successfully enrolled","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"mod":"enroll","http.request.id":"01HXVT3ZETJB4EGT4XGECK4X7Y","ecs.version":"1.6.0","service.name":"fleet-server","service.type":"fleet-server","fleet.access.apikey.id":"w9Ohd48B4pQQGpfO_poB","http.response.body.bytes":1025,"fleet.agent.id":"1a6429cd-5f44-4a02-b677-abef8ee2f0da","fleet.enroll.apikey.id":"vNNkd48B4pQQGpfOiZqp","fleet.policy.id":"eck-fleet-server","event.duration":308229276,"server.address":"","ecs.version":"1.6.0"}
Successfully enrolled the Elastic Agent.
{"log.level":"info","@timestamp":"2024-05-14T15:04:16.650Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":304},"message":"Elastic Agent has been enrolled; start Elastic Agent","ecs.version":"1.6.0"}

With the same log line where it's enrolling via the service...but it succeeds here. @michel-laterman Are you saying that fleet-server begins listening on :8220, then attempts to check-in via the service (which essentially is checking in via itself)? If so

  1. I wonder why would the liveness/readiness probes prevent this check-in?
  2. Why is it not re-attemped like the log line seems to suggest instead of immediately failing? 1st enrollment attempt failed, retrying for 10m0s, every 1m0s

@naemono
Copy link
Contributor Author

naemono commented May 16, 2024

@michel-laterman Are you needing anything further from our team to understand this failure?

@michel-laterman
Copy link

With the same log line where it's enrolling via the service...but it succeeds here. @michel-laterman Are you saying that fleet-server begins listening on :8220, then attempts to check-in via the service (which essentially is checking in via itself)? If so

  1. I wonder why would the liveness/readiness probes prevent this check-in?
  2. Why is it not re-attemped like the log line seems to suggest instead of immediately failing? 1st enrollment attempt failed, retrying for 10m0s, every 1m0s

Yes, the enrollment request the agent sends is against itself (on the service address)

For the questions:

  1. When do these probes execute? Is there a delay? Can it be possible that a trivial probe (/bin/sh -c "true") is set to execute after the agent has attempted to enrol such that the address is not routable at that point in time?
  2. The logs indicate that the agent received a termination signal before it could be re-attempted

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants