Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: simplify leader election #2152

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

acuteaura
Copy link
Contributor

mostly based on the upstream example:
https://github.com/kubernetes/client-go/blob/master/examples/leader-election/main.go

this also incidentally fixes bugs where run returns nil (i.e. when failing to reach the Admin API) by causing the context to cancel via defer.

  • Bugfix
  • Refactor

@acuteaura
Copy link
Contributor Author

acuteaura commented Jan 31, 2024

some remarks on this PR:

  • a bunch of code seems to treat leader status as "skipping writes" (via the Elector). this is pretty hard to reason about and prone to errors, so this PR changes it to a more traditional standby (with informers warmed up) that doesn't run any controllers.
  • alongside that, the controller now hard exits when it loses leader status. this likely only happens when a node netsplits from k8s apiserver, so it wouldn't be able to update things there anyway.
  • run got an error return value, because a lot of error conditions were just silently discarded. A future PR should improve these errors with some kind of wrapping and remove explicit logging in run itself so it all bubbles up
  • I did not touch the API server, because I'm not 100% about all the things it does, but we should consider failing a readiness probe if it is not leader if code in there relies on controllers running - though it did not have access to Elector until this point either.

@Revolyssup
Copy link
Contributor

@acuteaura Can you fix the merge conflicts?

@Revolyssup
Copy link
Contributor

@acuteaura Unit tests failing

zap.String("namespace", c.namespace),
zap.String("pod", c.name),
)
// rootCancel might be to slow, and controllers may have bugs that cause them to not yield
// the safest way to step down is to simply cause a pod restart
os.Exit(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense

@Revolyssup
Copy link
Contributor

an e2e test related to leader election was also failing. I reran it

@acuteaura
Copy link
Contributor Author

The unit test is testing specific strings in the output instead of functionality, I think it can go.

The e2e test.... I don't even know, it seems to fail before setup, but the Restart Count between the describe and the get are different and... it doesn't make sense to me why this test would fail its BeforeRun.

@acuteaura
Copy link
Contributor Author

@Revolyssup could you have a look at this failed test? I'm not really sure how it could fail in the beforeEach step when that's all shared among tests. Or maybe re-run the tests to make sure this isn't a flake.

It's unfortunately not so trivial to run this test locally, but I'm getting to it...

@zll600
Copy link
Contributor

zll600 commented Mar 19, 2024

ping @Revolyssup

Copy link

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 30 days if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the dev@apisix.apache.org list. Thank you for your contributions.

@github-actions github-actions bot added the stale label May 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants