Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single node installation (SNO) sometimes fails because of etcd pods unable to start #8225

Open
mlacko64 opened this issue Apr 2, 2024 · 0 comments

Comments

@mlacko64
Copy link

mlacko64 commented Apr 2, 2024

Version

$ openshift-install version
4.15.0
4.15.2
stable-4.14

Platform:

Azure
AWS

IPI

What happened?

Single node OpenShift deployment sometimes fails ending with timeout waiting on API. After investigation on master node, API pod is not running because cannot contact etcd. Etcd pod is restarting itself in neverending loop because it does wait on bootstrap response (which but never gets, because bootstrap is already removed by installer).

etcd log does contain these repeating messages:

{"level":"info","ts":"2024-03-08T14:14:10.737756Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"831e2e3a53e15a35 is starting a new election at term 3"}
{"level":"info","ts":"2024-03-08T14:14:10.737794Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"831e2e3a53e15a35 became pre-candidate at term 3"}
{"level":"info","ts":"2024-03-08T14:14:10.737804Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"831e2e3a53e15a35 received MsgPreVoteResp from 831e2e3a53e15a35 at term 3"}
{"level":"info","ts":"2024-03-08T14:14:10.737817Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"831e2e3a53e15a35 [logterm: 3, index: 19588] sent MsgPreVote request to 7e5e8569a1ffd12a at term 3"}
{"level":"warn","ts":"2024-03-08T14:14:11.201809Z","caller":"etcdserver/v3_server.go:897","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":6500257895037094893,"retry-timeout":"500ms"}
{"level":"warn","ts":"2024-03-08T14:14:11.702107Z","caller":"etcdserver/v3_server.go:897","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":6500257895037094893,"retry-timeout":"500ms"}
{"level":"warn","ts":"2024-03-08T14:14:11.846431Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"7e5e8569a1ffd12a","rtt":"519.251µs","error":"dial tcp 10.242.20.6:2380: connect: no route to host"}
{"level":"warn","ts":"2024-03-08T14:14:11.947647Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"7e5e8569a1ffd12a","rtt":"10.743675ms","error":"dial tcp 10.242.20.6:2380: connect: no route to host"}
{"level":"warn","ts":"2024-03-08T14:14:12.202645Z","caller":"etcdserver/v3_server.go:897","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":6500257895037094893,"retry-timeout":"500ms"}
{"level":"warn","ts":"2024-03-08T14:14:12.278501Z","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2024-03-08T14:14:09.278445Z","time spent":"3.000051118s","remote":"[::1]:49170","response type":"/etcdserverpb.Lease/LeaseGrant","request count":-1,"request size":-1,"response count":-1,"response size":-1,"request content":""}
{"level":"warn","ts":"2024-03-08T14:14:12.703398Z","caller":"etcdserver/v3_server.go:897","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":6500257895037094893,"retry-timeout":"500ms"}
{"level":"warn","ts":"2024-03-08T14:14:13.204015Z","caller":"etcdserver/v3_server.go:897","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":6500257895037094893,"retry-timeout":"500ms"}
{"level":"warn","ts":"2024-03-08T14:14:13.704522Z","caller":"etcdserver/v3_server.go:897","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":6500257895037094893,"retry-timeout":"500ms"}
{"level":"warn","ts":"2024-03-08T14:14:14.205129Z","caller":"etcdserver/v3_server.go:897","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":6500257895037094893,"retry-timeout":"500ms"}

IP address 10.242.20.6 in message above is (or better said was) bootstrap , which does not exist anymore (installer removed it). So it looks like some race condition that sometimes bootstrap is removed too soon.

I'll attach full etcd log here. I do have also sosreport from master, but it is too big to upload, still I can provide it, if needed.
etcd.zip

What you expected to happen?

Build will finish always successfully.

How to reproduce it (as minimally and precisely as possible)?

Run several SNO IPI builds to reproduce issue, usually about five builds.

Minimalistic install config is enough, I was able to reproduce it with SNO private and public cluster in Azure and private cluster in AWS (doing builds mostly in Azure). Example of my install config is here:

apiVersion: v1
baseDomain: azureipi.mcs
controlPlane:
  hyperthreading: Enabled
  name: master
  platform:
    azure:
      osDisk:
        diskSizeGB: 128
        diskType: Premium_LRS
      type: Standard_D8ls_v5
  replicas: 1
compute:
- architecture: amd64
  hyperthreading: Enabled
  name: worker
  platform:
    azure:
      osDisk:
        diskSizeGB: 128
        diskType: Premium_LRS
      type: Standard_D4as_v5
      zones:
      - "1"
      - "2"
      - "3"
  replicas: 0
metadata:
  name: pr-merge-6636
networking:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  machineNetwork:
  - cidr: 10.242.20.0/22
  networkType: OVNKubernetes
  serviceNetwork:
  - 172.30.0.0/16
platform:
  azure:
    baseDomainResourceGroupName: pr-merge-6636-ocp-rg
    cloudName: AzurePublicCloud
    outboundType: UserDefinedRouting
    region: westus2
    networkResourceGroupName: JenkinsAutoGroup
    virtualNetwork: mcs-azure-nw
    controlPlaneSubnet: mcs-azure-subnet01
    computeSubnet: mcs-azure-subnet02
    resourceGroupName: pr-merge-6636-ocp-rg
publish: "Internal"
pullSecret: '{"auths":...<removed>}'
sshKey: ssh-rsa ...<removed>

Anything else we need to know?

If some more logs or test should be done, just let me know what should I collect. I opened also ticket with RedHat support, but as this occurs just sometimes , they are asking for more proofs...

References

These two can be related, but hard to say as there are not logs at all
#8049
#7982

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant