Alleviate race conditions in roll restart reconciler #694

geobeau · 2024-01-02T13:49:44Z

Description

When a Roll Restart is triggered with at least 2 pools, a race condition can trigger the roll restart of a pods in each pools. This can lead to a red cluster.

Normally to prevent this from happening, there are 3 checks:

check the status.ReadyReplicas of all sts before moving forward.
for each nodePool, check that that all replicas are ready by listing pods directly
before deleting a pod, a check is made on the OpenSearch to see if the cluster is healthy

In practice, it is not enough.

Considering the rollRestart of 2 nodePool:

data
masterData

The following sequence can happen:

a rollRestart is triggered
reconcile function is called
data and masterData have all their pods ready
data pod is deleted; pods is terminating (NOT terminated yet)
reconcile function is recalled
data and masterData have all their pods ready from the status.ReadyReplicas point of view (because it didn't see the change yet)
data is seen as unhealthy thanks to CountRunningPodsForNodePool
masterData is seen as healthy because all its pods are ready

Additionally I added 2 commits to refactor the Reconcile function to make intent of each bloc more obvious.

Issues Resolved

Probably #650

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

prudhvigodithi · 2024-02-26T17:08:03Z

Hey @geobeau thanks for your contribution, we can ship this change soon, is it possible to add some unit tests to this PR?
Adding @salyh @pchmielnik @swoehrl-mw @jochenkressin
@bbarani

salyh · 2024-03-30T14:29:32Z

@geobeau can you please re-base from main, a recent PR is merged #767 that should fix the CI checks.

Refactor the code to seperate each steps clearly. In the first loop there was 2 operations: - Check if there is a pending update - Check if all pods are ready However, we can break in some conditions and skip the health checking. To simplify the maintability of the scope, it is now split in 2 separate loops: - first check if restarts are pending - then check that all pools are healthy Signed-off-by: Geoffrey Beausire <g.beausire@criteo.com>

Refactor the order or restart to simplify the reconcile function and make it more deterministic. Before we could maybe restart stuck pods then perform roll restart. Now, stuck pods are always restarted first. Also, stuck pods are restarted before checking the number of ready replicas, because if we check before we will never reach this point. Finally, we don't process a rolling restart if any pods stuck was deleted, to avoid performing any dangerous actions. Signed-off-by: Geoffrey Beausire <g.beausire@criteo.com>

When a Roll Restart is triggered with at least 2 pools, a race condition can trigger the roll restart of a pods in each pools. This can lead to a red cluster. Normally to prevent this from happening, there are 3 checks: - check the status.ReadyReplicas of all sts before moving forward. - for each nodePool, check that that all replicas are ready by listing pods directly - before deleting a pod, a check is made on the OpenSearch to see if the cluster is healthy In practice, it is not enough. Considering the rollRestart of 2 nodePool: - data - masterData The following sequence can happen: - a rollRestart is triggered - reconcile function is called - data and masterData have all their pods ready - data pod is deleted; pods is terminating (NOT terminated yet) - reconcile function is recalled - data and masterData have all their pods ready from the status.ReadyReplicas point of view (because it didn't see the change yet) - data is seen as unhealthy thanks to CountRunningPodsForNodePool - masterData is seen as healthy because all its pods are ready - Opensearch is still healthy, because the deleted pod is not terminated yet - A pod in masterData is restarted - Cluster is red! This commit make sure we check readiness of all nodePool using CountRunningPodsForNodePool before trying to restart any pools. Signed-off-by: Geoffrey Beausire <g.beausire@criteo.com>

geobeau · 2024-04-08T09:15:14Z

@salyh hello, I rebased and added some more unittests

prudhvigodithi · 2024-04-08T18:28:22Z

Thanks @geobeau let me take a look at this PR.
Adding @swoehrl-mw @salyh

prudhvigodithi · 2024-04-09T19:10:54Z

Adding @salyh @swoehrl-mw @jochenkressin @pchmielnik to please check this PR.
Thanks

swoehrl-mw

Logic looks good, just some minor gripes about naming and stuff.

swoehrl-mw · 2024-04-18T13:16:52Z

opensearch-operator/pkg/reconcilers/rollingRestart.go

+	// Check if there is any crashed pod. Delete it if there is any update in sts.
+	any_restarted_pod := false
+	for _, sts := range statefulSets {
+		restared_pod, err := helpers.DeleteStuckPodWithOlderRevision(r.client, &sts)


Typo

Suggested change

restared_pod, err := helpers.DeleteStuckPodWithOlderRevision(r.client, &sts)

restarted_pod, err := helpers.DeleteStuckPodWithOlderRevision(r.client, &sts)

swoehrl-mw · 2024-04-18T13:18:08Z

opensearch-operator/pkg/reconcilers/rollingRestart.go

@@ -118,6 +116,47 @@ func (r *RollingRestartReconciler) Reconcile() (ctrl.Result, error) {
 		return ctrl.Result{}, nil
 	}

+	// Check if there is any crashed pod. Delete it if there is any update in sts.
+	any_restarted_pod := false


Please use camelCase for any variables

swoehrl-mw · 2024-04-18T13:22:05Z

opensearch-operator/pkg/reconcilers/rollingRestart.go

+			return ctrl.Result{}, err
+		}
+		if sts.Status.ReadyReplicas != pointer.Int32Deref(sts.Spec.Replicas, 1) {
+			return ctrl.Result{


Please add a log line (can be debug) so it is visible the operator is waiting on pods being ready

swoehrl-mw · 2024-04-19T06:20:51Z

opensearch-operator/pkg/helpers/helpers_suite_test.go

 )

 func TestHelpers(t *testing.T) {
 	RegisterFailHandler(Fail)
 	RunSpecs(t, "Helpers Suite")
 }
+
+var _ = Describe("Helpers", func() {


This is not the correct file for the tests, they should go in a file helpers_test.go, the helpers_suite_test.go only contains the init/start code for the test of the entire package.

geobeau requested review from idanl21, dbason, swoehrl-mw and prudhvigodithi as code owners January 2, 2024 13:49

geobeau force-pushed the fix-race-rollrestart branch from 9fdb5e9 to e16fde6 Compare January 2, 2024 13:51

geobeau force-pushed the fix-race-rollrestart branch from e16fde6 to ebe3a8d Compare March 28, 2024 16:03

geobeau requested review from jochenkressin, pchmielnik and salyh as code owners March 28, 2024 16:03

geobeau added 3 commits April 8, 2024 10:39

geobeau force-pushed the fix-race-rollrestart branch from ebe3a8d to 2543804 Compare April 8, 2024 08:39

prudhvigodithi mentioned this pull request Apr 18, 2024

Release v2.6.0 #776

Closed

swoehrl-mw requested changes Apr 19, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alleviate race conditions in roll restart reconciler #694

Alleviate race conditions in roll restart reconciler #694

geobeau commented Jan 2, 2024

prudhvigodithi commented Feb 26, 2024

salyh commented Mar 30, 2024

geobeau commented Apr 8, 2024

prudhvigodithi commented Apr 8, 2024

prudhvigodithi commented Apr 9, 2024

swoehrl-mw left a comment

swoehrl-mw Apr 18, 2024

swoehrl-mw Apr 18, 2024

swoehrl-mw Apr 18, 2024

swoehrl-mw Apr 19, 2024

	restared_pod, err := helpers.DeleteStuckPodWithOlderRevision(r.client, &sts)
	restarted_pod, err := helpers.DeleteStuckPodWithOlderRevision(r.client, &sts)

Alleviate race conditions in roll restart reconciler #694

Are you sure you want to change the base?

Alleviate race conditions in roll restart reconciler #694

Conversation

geobeau commented Jan 2, 2024

Description

Issues Resolved

prudhvigodithi commented Feb 26, 2024

salyh commented Mar 30, 2024

geobeau commented Apr 8, 2024

prudhvigodithi commented Apr 8, 2024

prudhvigodithi commented Apr 9, 2024

swoehrl-mw left a comment

Choose a reason for hiding this comment

swoehrl-mw Apr 18, 2024

Choose a reason for hiding this comment

swoehrl-mw Apr 18, 2024

Choose a reason for hiding this comment

swoehrl-mw Apr 18, 2024

Choose a reason for hiding this comment

swoehrl-mw Apr 19, 2024

Choose a reason for hiding this comment