You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Single node OpenShift deployment sometimes fails ending with timeout waiting on API. After investigation on master node, API pod is not running because cannot contact etcd. Etcd pod is restarting itself in neverending loop because it does wait on bootstrap response (which but never gets, because bootstrap is already removed by installer).
etcd log does contain these repeating messages:
{"level":"info","ts":"2024-03-08T14:14:10.737756Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"831e2e3a53e15a35 is starting a new election at term 3"}
{"level":"info","ts":"2024-03-08T14:14:10.737794Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"831e2e3a53e15a35 became pre-candidate at term 3"}
{"level":"info","ts":"2024-03-08T14:14:10.737804Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"831e2e3a53e15a35 received MsgPreVoteResp from 831e2e3a53e15a35 at term 3"}
{"level":"info","ts":"2024-03-08T14:14:10.737817Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"831e2e3a53e15a35 [logterm: 3, index: 19588] sent MsgPreVote request to 7e5e8569a1ffd12a at term 3"}
{"level":"warn","ts":"2024-03-08T14:14:11.201809Z","caller":"etcdserver/v3_server.go:897","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":6500257895037094893,"retry-timeout":"500ms"}
{"level":"warn","ts":"2024-03-08T14:14:11.702107Z","caller":"etcdserver/v3_server.go:897","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":6500257895037094893,"retry-timeout":"500ms"}
{"level":"warn","ts":"2024-03-08T14:14:11.846431Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"7e5e8569a1ffd12a","rtt":"519.251µs","error":"dial tcp 10.242.20.6:2380: connect: no route to host"}
{"level":"warn","ts":"2024-03-08T14:14:11.947647Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"7e5e8569a1ffd12a","rtt":"10.743675ms","error":"dial tcp 10.242.20.6:2380: connect: no route to host"}
{"level":"warn","ts":"2024-03-08T14:14:12.202645Z","caller":"etcdserver/v3_server.go:897","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":6500257895037094893,"retry-timeout":"500ms"}
{"level":"warn","ts":"2024-03-08T14:14:12.278501Z","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2024-03-08T14:14:09.278445Z","time spent":"3.000051118s","remote":"[::1]:49170","response type":"/etcdserverpb.Lease/LeaseGrant","request count":-1,"request size":-1,"response count":-1,"response size":-1,"request content":""}
{"level":"warn","ts":"2024-03-08T14:14:12.703398Z","caller":"etcdserver/v3_server.go:897","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":6500257895037094893,"retry-timeout":"500ms"}
{"level":"warn","ts":"2024-03-08T14:14:13.204015Z","caller":"etcdserver/v3_server.go:897","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":6500257895037094893,"retry-timeout":"500ms"}
{"level":"warn","ts":"2024-03-08T14:14:13.704522Z","caller":"etcdserver/v3_server.go:897","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":6500257895037094893,"retry-timeout":"500ms"}
{"level":"warn","ts":"2024-03-08T14:14:14.205129Z","caller":"etcdserver/v3_server.go:897","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":6500257895037094893,"retry-timeout":"500ms"}
IP address 10.242.20.6 in message above is (or better said was) bootstrap , which does not exist anymore (installer removed it). So it looks like some race condition that sometimes bootstrap is removed too soon.
I'll attach full etcd log here. I do have also sosreport from master, but it is too big to upload, still I can provide it, if needed. etcd.zip
What you expected to happen?
Build will finish always successfully.
How to reproduce it (as minimally and precisely as possible)?
Run several SNO IPI builds to reproduce issue, usually about five builds.
Minimalistic install config is enough, I was able to reproduce it with SNO private and public cluster in Azure and private cluster in AWS (doing builds mostly in Azure). Example of my install config is here:
If some more logs or test should be done, just let me know what should I collect. I opened also ticket with RedHat support, but as this occurs just sometimes , they are asking for more proofs...
References
These two can be related, but hard to say as there are not logs at all #8049 #7982
The text was updated successfully, but these errors were encountered:
Version
Platform:
Azure
AWS
IPI
What happened?
Single node OpenShift deployment sometimes fails ending with timeout waiting on API. After investigation on master node, API pod is not running because cannot contact etcd. Etcd pod is restarting itself in neverending loop because it does wait on bootstrap response (which but never gets, because bootstrap is already removed by installer).
etcd log does contain these repeating messages:
IP address 10.242.20.6 in message above is (or better said was) bootstrap , which does not exist anymore (installer removed it). So it looks like some race condition that sometimes bootstrap is removed too soon.
I'll attach full etcd log here. I do have also sosreport from master, but it is too big to upload, still I can provide it, if needed.
etcd.zip
What you expected to happen?
Build will finish always successfully.
How to reproduce it (as minimally and precisely as possible)?
Run several SNO IPI builds to reproduce issue, usually about five builds.
Minimalistic install config is enough, I was able to reproduce it with SNO private and public cluster in Azure and private cluster in AWS (doing builds mostly in Azure). Example of my install config is here:
Anything else we need to know?
If some more logs or test should be done, just let me know what should I collect. I opened also ticket with RedHat support, but as this occurs just sometimes , they are asking for more proofs...
References
These two can be related, but hard to say as there are not logs at all
#8049
#7982
The text was updated successfully, but these errors were encountered: