Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Parallel Cluster Recovery didn't work #730

Closed
pawellrus opened this issue Feb 15, 2024 · 10 comments · Fixed by #789
Closed

[BUG] Parallel Cluster Recovery didn't work #730

pawellrus opened this issue Feb 15, 2024 · 10 comments · Fixed by #789
Labels
bug Something isn't working

Comments

@pawellrus
Copy link

pawellrus commented Feb 15, 2024

Version: 2.4.0, 2.5.1
Function: Parallel Cluster Recovery

Description:
After performing upgrade of managed k8s cluster, the hosted opensearch cluster stuck in "waiting for quorum" state.

Scenario:
Performed k8s cluster upgrade. During upgrade all nodes was recreated and pods was rescheduled TWISE at the same time.

Expected behaviour:
All nodes should be started at the same time. (statefulset has podManagementPolicy: Parallel)
Cluster recovered.

Actual behaviour:
Only one node of cluster is started. Other nodes are not starting. Quorum cannot be achieved.
Operator does nothing with this situation. Statefulset has podManagementPolicy: OrderedReady

Thoughts:
I fixed this problem by scaling down operator deployment to 0 and then manually recreating STS with podManagementPolicy: Parallel.

Not sure that I can reproduce that case again. Maybe operator should have some manual trigger to perform parallel cluster restoration in cases like that?

@pawellrus pawellrus added bug Something isn't working untriaged Issues that have not yet been triaged labels Feb 15, 2024
@Links2004
Copy link
Contributor

Hi have seen the same behaviour for a change from 2.10.0 to 2.11.1.
stuck in waiting for quorum until manual intervention.

@pstast
Copy link

pstast commented Mar 8, 2024

Just had the same issue. All my k8s nodes crashed, but after k8s recovery, OpenSearch cluster did not recovered. Only one pod from statefulset started. I also had to stop the operator and manually change podManagementPolicy: Parallel so that all pods started and OpenSearch recovered.

Unfortunately I did not find anything in logs, there was no error. Operator just did nothing. It did not switch to parallel recovery.

@prudhvigodithi
Copy link
Collaborator

[Triage]
There a PR contributed by @swoehrl-mw #366 to handle the cluster recovery. When used the latest version of the operator do you guys still see this issue?
@pawellrus @Links2004 @pstast

Thanks
@bbarani

@prudhvigodithi prudhvigodithi removed the untriaged Issues that have not yet been triaged label Mar 26, 2024
@pawellrus
Copy link
Author

[Triage] There a PR contributed by @swoehrl-mw #366 to handle the cluster recovery. When used the latest version of the operator do you guys still see this issue? @pawellrus @Links2004 @pstast

Thanks @bbarani

Hello, @prudhvigodithi
This issue persists on 2.4.0 and 2.5.1 in my environment.

@pstast
Copy link

pstast commented Mar 27, 2024

@prudhvigodithi I also encountered this problem with the last version (2.5.1)

@prudhvigodithi
Copy link
Collaborator

@swoehrl-mw can you please take a look at this? Thanks

@swoehrl-mw
Copy link
Collaborator

@pstast @pawellrus @Links2004
To verify:

  • Do you have PVCs configured as storage?
  • Is each nodepool configured with at least 3 pods/replicas?
  • The parallelRecovery mode is not disabled (via helm values)?

There are some points in the logic where the operator can skip the recovery due to errors, but these should all produce some log.

@pawellrus
Copy link
Author

@swoehrl-mw answer to all questions - yes.
But dashboards has only two replicas.
Btw, I use ArgoCD for deployment.

@swoehrl-mw
Copy link
Collaborator

Hi all, I did some looking through the code and testing and I think I can reproduce the problem and found the cause:
Disaster recovery does not engage if the operator is doing an upgrade. The problem is that there is a bug in the detection of an upgrade where, due to status information being unclear, it thinks an upgrade is still in progress although it is long finished and so does not engage.
I'll create a PR with a fix in the next few days.

cc @prudhvigodithi

@prudhvigodithi
Copy link
Collaborator

Thanks @swoehrl-mw we can target this after v2.6.0 release.
@bbarani @swoehrl-mw

swoehrl-mw added a commit that referenced this issue May 21, 2024
### Description
Fixes a bug where parallel recovery did not engage if the cluster had
gone through a version upgrade beforehand.
Reason was that the upgrade logic added the status for each nodepool
twice leading to the recovery logic incorrectly detecting if an upgrade
was in progress.
Also did a small refactoring of names and constants and fixed a warning
by my IDE about unused parameters.

No changes to the CRDs or functionality, just internal logic fixes.

### Issues Resolved
Fixes #730 

### Check List
- [x] Commits are signed per the DCO using --signoff 
- [x] Unittest added for the new/changed functionality and all unit
tests are successful
- [x] Customer-visible features documented
- [x] No linter warnings (`make lint`)

If CRDs are changed:
- [-] CRD YAMLs updated (`make manifests`) and also copied into the helm
chart
- [-] Changes to CRDs documented

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and
signing off your commits, please check
[here](https://github.com/opensearch-project/OpenSearch/blob/main/CONTRIBUTING.md#developer-certificate-of-origin).

Signed-off-by: Sebastian Woehrl <sebastian.woehrl@maibornwolff.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants