Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alleviate race conditions in roll restart reconciler #694

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

geobeau
Copy link

@geobeau geobeau commented Jan 2, 2024

Description

When a Roll Restart is triggered with at least 2 pools, a race condition can trigger the roll restart of a pods in each pools. This can lead to a red cluster.

Normally to prevent this from happening, there are 3 checks:

  • check the status.ReadyReplicas of all sts before moving forward.
  • for each nodePool, check that that all replicas are ready by listing pods directly
  • before deleting a pod, a check is made on the OpenSearch to see if the cluster is healthy

In practice, it is not enough.

Considering the rollRestart of 2 nodePool:

  • data
  • masterData

The following sequence can happen:

  1. a rollRestart is triggered
  2. reconcile function is called
  3. data and masterData have all their pods ready
  4. data pod is deleted; pods is terminating (NOT terminated yet)
  5. reconcile function is recalled
  6. data and masterData have all their pods ready from the status.ReadyReplicas point of view (because it didn't see the change yet)
  7. data is seen as unhealthy thanks to CountRunningPodsForNodePool
  8. masterData is seen as healthy because all its pods are ready

Additionally I added 2 commits to refactor the Reconcile function to make intent of each bloc more obvious.

Issues Resolved

Probably #650

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@prudhvigodithi
Copy link
Collaborator

Hey @geobeau thanks for your contribution, we can ship this change soon, is it possible to add some unit tests to this PR?
Adding @salyh @pchmielnik @swoehrl-mw @jochenkressin
@bbarani

@salyh
Copy link
Collaborator

salyh commented Mar 30, 2024

@geobeau can you please re-base from main, a recent PR is merged #767 that should fix the CI checks.

Refactor the code to seperate each steps clearly.
In the first loop there was 2 operations:
- Check if there is a pending update
- Check if all pods are ready
However, we can break in some conditions and skip the
health checking.
To simplify the maintability of the scope, it is now split in
2 separate loops:
- first check if restarts are pending
- then check that all pools are healthy

Signed-off-by: Geoffrey Beausire <g.beausire@criteo.com>
Refactor the order or restart to simplify the reconcile function
and make it more deterministic.

Before we could maybe restart stuck pods then perform roll
restart. Now, stuck pods are always restarted first.
Also, stuck pods are restarted before checking the number
of ready replicas, because if we check before we will never
reach this point.
Finally, we don't process a rolling restart if any pods stuck
was deleted, to avoid performing any dangerous actions.

Signed-off-by: Geoffrey Beausire <g.beausire@criteo.com>
When a Roll Restart is triggered with at least 2 pools, a race
condition can trigger the roll restart of a pods in each pools.
This can lead to a red cluster.
Normally to prevent this from happening, there are 3 checks:
- check the status.ReadyReplicas of all sts before moving forward.
- for each nodePool, check that that all replicas are ready by listing
  pods directly
- before deleting a pod, a check is made on the OpenSearch to see
  if the cluster is healthy

In practice, it is not enough. Considering the rollRestart of 2 nodePool:
- data
- masterData

The following sequence can happen:
- a rollRestart is triggered
- reconcile function is called
- data and masterData have all their pods ready
- data pod is deleted; pods is terminating (NOT terminated yet)
- reconcile function is recalled
- data and masterData have all their pods ready from the status.ReadyReplicas
  point of view (because it didn't see the change yet)
- data is seen as unhealthy thanks to CountRunningPodsForNodePool
- masterData is seen as healthy because all its pods are ready
- Opensearch is still healthy, because the deleted pod is not terminated yet
- A pod in masterData is restarted
- Cluster is red!

This commit make sure we check readiness of all nodePool using
CountRunningPodsForNodePool before trying to restart any pools.

Signed-off-by: Geoffrey Beausire <g.beausire@criteo.com>
@geobeau
Copy link
Author

geobeau commented Apr 8, 2024

@salyh hello, I rebased and added some more unittests

@prudhvigodithi
Copy link
Collaborator

Thanks @geobeau let me take a look at this PR.
Adding @swoehrl-mw @salyh

@prudhvigodithi
Copy link
Collaborator

Adding @salyh @swoehrl-mw @jochenkressin @pchmielnik to please check this PR.
Thanks

@prudhvigodithi prudhvigodithi mentioned this pull request Apr 18, 2024
Copy link
Collaborator

@swoehrl-mw swoehrl-mw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic looks good, just some minor gripes about naming and stuff.

// Check if there is any crashed pod. Delete it if there is any update in sts.
any_restarted_pod := false
for _, sts := range statefulSets {
restared_pod, err := helpers.DeleteStuckPodWithOlderRevision(r.client, &sts)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo

Suggested change
restared_pod, err := helpers.DeleteStuckPodWithOlderRevision(r.client, &sts)
restarted_pod, err := helpers.DeleteStuckPodWithOlderRevision(r.client, &sts)

@@ -118,6 +116,47 @@ func (r *RollingRestartReconciler) Reconcile() (ctrl.Result, error) {
return ctrl.Result{}, nil
}

// Check if there is any crashed pod. Delete it if there is any update in sts.
any_restarted_pod := false
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use camelCase for any variables

return ctrl.Result{}, err
}
if sts.Status.ReadyReplicas != pointer.Int32Deref(sts.Spec.Replicas, 1) {
return ctrl.Result{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a log line (can be debug) so it is visible the operator is waiting on pods being ready

)

func TestHelpers(t *testing.T) {
RegisterFailHandler(Fail)
RunSpecs(t, "Helpers Suite")
}

var _ = Describe("Helpers", func() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the correct file for the tests, they should go in a file helpers_test.go, the helpers_suite_test.go only contains the init/start code for the test of the entire package.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants