Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always clean up FailedToScheduleReplica with wrong HardNodeAffinity #2792

Merged
merged 2 commits into from
May 17, 2024

Conversation

ejweber
Copy link
Contributor

@ejweber ejweber commented May 10, 2024

Which issue(s) this PR fixes:

longhorn/longhorn#8522

What this PR does / why we need it:

Delete a failed to schedule replica with HardNodeAffinity if DataLocality is disabled.

Previously, we would only do this if there were enough healthy replicas, but it led to the "deadlock" in longhorn/longhorn#8522 where we would not schedule more replicas until we deleted the failed to schedule one, but we would not delete the failed to schedule one until there were enough healthy ones.

Special notes for your reviewer:

I experimented with changing the behavior of this function even more, but there were some weird side effects. I think it is better to keep it working almost exactly as it did before. I did, however, rearrange a bit.

@ejweber
Copy link
Contributor Author

ejweber commented May 10, 2024

Test steps

Follow the reproduce steps in longhorn/longhorn#8522 (comment).

  1. Observe that the failed to schedule replica is deleted and a new one (that can be scheduled) replaces it.

@ejweber
Copy link
Contributor Author

ejweber commented May 10, 2024

Regression testing in: https://ci.longhorn.io/job/private/job/longhorn-tests-regression/6924/.

Results show five failures:

  • tests.test_basic.test_backup_lock_creation_during_deletion[s3] - 50% success rate in last 20 runs
  • tests.test_ha.test_engine_image_not_fully_deployed_perform_auto_upgrade_engine 71% success rate in last 20 runs
  • tests.test_ha.test_engine_image_not_fully_deployed_perform_dr_restoring_expanding_volume[s3] - 100% success rate in last 20 runs
  • tests.test_scheduling.test_replica_auto_balance_node_best_effort - 94% 0% success rate in last 20 runs
  • tests.test_system_backup_restore.test_system_backup_and_restore_volume_with_backingimage[s3] - 0% success rate in last 20 runs

Investigating:

Copy link
Contributor

@james-munson james-munson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question.

controller/volume_controller.go Show resolved Hide resolved
@ejweber ejweber force-pushed the 8522-local-replica-corner-case branch from 5c5e165 to ec61947 Compare May 15, 2024 15:17
Copy link
Contributor

@james-munson james-munson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Longhorn 8522

Signed-off-by: Eric Weber <eric.weber@suse.com>
…reconcile

Longhorn 8522

Signed-off-by: Eric Weber <eric.weber@suse.com>
@ejweber ejweber force-pushed the 8522-local-replica-corner-case branch from ec61947 to 0ca5eb6 Compare May 16, 2024 20:59
@shuo-wu shuo-wu merged commit 00a3b8e into longhorn:master May 17, 2024
7 checks passed
@ejweber
Copy link
Contributor Author

ejweber commented May 20, 2024

@mergify backport v1.5.x

Copy link

mergify bot commented May 20, 2024

backport v1.5.x

✅ Backports have been created

@ejweber
Copy link
Contributor Author

ejweber commented May 29, 2024

@mergify backport v1.6.x

Copy link

mergify bot commented May 29, 2024

backport v1.6.x

✅ Backports have been created

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants