Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCP 4.13@ AWS metal instances bootstrap timeout #7326

Open
naumvd95 opened this issue Jul 13, 2023 · 8 comments
Open

OCP 4.13@ AWS metal instances bootstrap timeout #7326

naumvd95 opened this issue Jul 13, 2023 · 8 comments

Comments

@naumvd95
Copy link

naumvd95 commented Jul 13, 2023

Version

root@68f4a14c6905:/repo# ./openshift-install version
./openshift-install 4.13.4
built from commit 90acb3fa2990c35c9beeff4a188fb133fedba432
release image quay.io/openshift-release-dev/ocp-release@sha256:e3fb8ace9881ae5428ae7f0ac93a51e3daa71fa215b5299cd3209e134cadfc9c
release architecture amd64

Platform:

aws

  • IPI (automated install with openshift-install. If you don't know, then it's IPI)

What happened?

  1. Deploy OCP_4.13@AWS based on the c5n.metal instance types using the Installing a cluster on installer-provisioned infrastructure way

quite often (2/3 cases) - deployment failed on the bootstrap phase

[2023-07-11T13:24:20.439Z] DEBUG Using Install Config loaded from state file                                                                                                                                                                                     
INFO Waiting up to 30m0s (until 1:54PM) for bootstrapping to complete...                                                                                                                                                                                         
[2023-07-11T13:26:26.957Z] Waiting for clusters to be deployed...                                                                                                                                                                                                
[2023-07-11T13:36:26.958Z] Waiting for clusters to be deployed...                                                                                                                                                                                                
[2023-07-11T13:46:26.959Z] Waiting for clusters to be deployed...                                                                                                                                                                                                
[2023-07-11T13:54:20.453Z] DEBUG Fetching Bootstrap SSH Key Pair...    

skipping log collection.....

[2023-07-11T13:54:21.808Z] INFO Cluster operator etcd Progressing is True with NodeInstaller: NodeInstallerProgressing: 1 nodes are at revision 0; 1 nodes are at revision 2; 1 nodes are at revision 5                                                          
INFO Cluster operator etcd RecentBackup is Unknown with ControllerStarted: The etcd backup controller is starting, and will decide if recent backups are available or if a backup is required                                                                    
INFO Cluster operator image-registry Progressing is True with DeploymentNotCompleted: Progressing: The deployment has not completed                                                                                                                              
INFO NodeCADaemonProgressing: The daemon set node-ca is deployed                                                                                                                                                                                                 
INFO Cluster operator ingress EvaluationConditionsDetected is False with AsExpected:                                                                                                                                                                             
INFO Cluster operator insights ClusterTransferAvailable is Unknown with :                                                                                                                                                                                        
INFO Cluster operator insights Disabled is False with AsExpected:                                                                                                                                                                                                
INFO Cluster operator insights SCAAvailable is Unknown with :                                                                                                                                                                                                    
ERROR Cluster operator kube-apiserver Degraded is True with GuardController_SyncError: GuardControllerDegraded: [Missing operand on node ip-10-0-155-163.us-west-2.compute.internal, Missing operand on node ip-10-0-194-34.us-west-2.compute.internal]          
INFO Cluster operator kube-apiserver Progressing is True with NodeInstaller: NodeInstallerProgressing: 3 nodes are at revision 0; 0 nodes have achieved new revision 6                                                                                           
ERROR Cluster operator kube-apiserver Available is False with StaticPods_ZeroNodesActive: StaticPodsAvailable: 0 nodes are active; 3 nodes are at revision 0; 0 nodes have achieved new revision 6                                                               
INFO Cluster operator kube-controller-manager Progressing is True with NodeInstaller: NodeInstallerProgressing: 3 nodes are at revision 5; 0 nodes have achieved new revision 6                                                                                  
INFO Cluster operator kube-scheduler Progressing is True with NodeInstaller: NodeInstallerProgressing: 1 nodes are at revision 0; 2 nodes are at revision 5; 0 nodes have achieved new revision 6                                                                
ERROR Cluster operator monitoring Available is Unknown with :                                                                                                                                                                                                    
ERROR Cluster operator monitoring Degraded is Unknown with :                                                                                                                                                                                                     
INFO Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack.                                                                                                                                                              
INFO Cluster operator network ManagementStateDegraded is False with :                                                                                                                                                                                            
INFO Cluster operator openshift-apiserver Progressing is True with APIServerDeployment_PodsUpdating: APIServerDeploymentProgressing: deployment/apiserver.openshift-apiserver: 1/3 pods have been updated to the latest generation                               
ERROR Cluster operator openshift-samples Available is False with SampleUpsertsPending:                                                                                                                                                                           
ERROR Bootstrap failed to complete: timed out waiting for the condition                                                                                                                                                                                          
ERROR Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane.            

Actually cluster became ready after extra ~5 minutes

Successful examples

Shows bootstrap timings really close to 30m

level=debug msg=Time elapsed per stage:
level=debug msg=           cluster: 5m53s
level=debug msg=         bootstrap: 1m33s
level=debug msg=Bootstrap Complete: 29m43s
level=debug msg=               API: 1m50s
level=debug msg= Bootstrap Destroy: 7m19s
level=debug msg= Cluster Operators: 17m22s

and

time="2023-07-13T13:31:01Z" level=debug msg="Time elapsed per stage:"
time="2023-07-13T13:31:01Z" level=debug msg="           cluster: 6m19s"
time="2023-07-13T13:31:01Z" level=debug msg="         bootstrap: 1m34s"
time="2023-07-13T13:31:01Z" level=debug msg="Bootstrap Complete: 29m26s"
time="2023-07-13T13:31:01Z" level=debug msg="               API: 1m44s"
time="2023-07-13T13:31:01Z" level=debug msg=" Bootstrap Destroy: 8m55s"
time="2023-07-13T13:31:01Z" level=debug msg=" Cluster Operators: 1m13s"
time="2023-07-13T13:31:01Z" level=info msg="Time elapsed: 48m13s"

What you expected to happen?

high success rate for deployment OCP4.13@AWS metal

actual result
Success rate is about 40-45% dated on 10th July 2023

How to reproduce it (as minimally and precisely as possible)?

  1. Deploy OCP_4.13@AWS based on the c5n.metal instance types using the Installing a cluster on installer-provisioned infrastructure way
  2. AWS region: us-west-2
  3. Amount of Control planes, workers does not matter, we've tried 3-node setup (mixed cp+w) and regular 3w+3cp setup

Anything else we need to know?

I think root cause is the insufficient condition for timeout bump in #6010
It should include AWS Metal instances

References

Are there any other GitHub issues (open or closed) or Pull Requests that should be linked here? For example:

@pzi123
Copy link

pzi123 commented Aug 7, 2023

I see the same behavior of the openshift-install on a different platform - vsphere. I build OKD/openshift clusters in my vsphere cluster for quite some time now and since the release 4.9.0 the installations times out and leaves the mess behind. The bootstrap node stays active and some 5-10 min past arbitrary 30 minute timeout the bootstrap completes and stands up 3 worker nodes. Unfortunately the bootstrap process has some additional steps to perform and leaves the cluster with certificate issues. The worst is the current 4.13.0 release that creates some temporary volumes in the vsphere datastore in an invalid state. This in turn makes the ESX hypervisor not bootable as the ESX boot sequence triggers bogus NFS volume checks and blocks forever.

I looked at the code here https://github.com/openshift/installer/blob/release-4.13/cmd/openshift-install/create.go and in the line 421 there is a hard coded 30 min for virtual platforms and 60 min for bare metal. I am surprised that Redhat left that landmine for everybody to stumble on. As there is more and more processing done by the MCO (machine config operator) with each release the 30 min is not enough. I hacked the code and built my own openshift-install with 'timeout := 60 * time.Minute'.

Interesting that this failures are common knowledge as I learned from the OKD maintainer that stated that the OKD releases are going ahead with vsphere installs failing - quote: "vSphere is failing most of the time" :-) see okd-project/okd#1672

@pzi123
Copy link

pzi123 commented Aug 12, 2023

The real solution is not to give up after failing 'openshift-install create cluster' but to continue with 'openshift-install wait-for bootstrap-complete' until message 'Bootstrap status: complete'. Expect to run this more then once. You are not done yet. Now continue with a series of 'openshift-install wait-for install-complete' until message 'Install complete!'. Now, how is this passing a CI/CD pipeline I have no idea.

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 10, 2023
@naumvd95
Copy link
Author

/remove-lifecycle stale

@openshift-ci openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 10, 2023
@naumvd95
Copy link
Author

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 18, 2024
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 17, 2024
@naumvd95
Copy link
Author

/remove-lifecycle rotten

@openshift-ci openshift-ci bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label May 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants