Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

set Topology feature gate default to true #1167

Merged
merged 1 commit into from May 21, 2024

Conversation

huww98
Copy link
Contributor

@huww98 huww98 commented Feb 28, 2024

What type of PR is this?

/kind cleanup

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespaces from that line:

/kind api-change
/kind bug
/kind cleanup
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake

What this PR does / why we need it:

This feature gate has already existed for 5 years. And should be safe to enable by default.

Besides, because this feature gate has been marked as GA, it will show warnings: Setting GA feature gate Topology=true. It will be removed in a future release.. So I went ahead and removed this feature gate from command line. Then something breaks, which is very confusing. I would expect a GA feature gate to default to true.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Action required: Feature gate Topology is enabled by default. This feature gate is already GA and will be removed in a future release. CSI drivers are required to report capability `VOLUME_ACCESSIBILITY_CONSTRAINTS` correctly: when a CSI drivers reports the capability, topology is enabled in the external-provisioner. When the driver does not report it, external-provisioner disables topology support. 

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Feb 28, 2024
@k8s-ci-robot
Copy link
Contributor

Welcome @huww98!

It looks like this is your first PR to kubernetes-csi/external-provisioner 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-csi/external-provisioner has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Feb 28, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @huww98. Thanks for your PR.

I'm waiting for a kubernetes-csi member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Feb 28, 2024
@mowangdk
Copy link

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 28, 2024
@mowangdk
Copy link

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 28, 2024
@huww98
Copy link
Contributor Author

huww98 commented Feb 29, 2024

/assign @pohly
PTAL. Can you add your approval?

@pohly
Copy link
Contributor

pohly commented Feb 29, 2024

@divyenpatel, @msau42, @jsafrane: do you remember why the flag was left at default false in #516?

Is it because that would have changed the default behavior and that wasn't desirable? If that's the reason, are we stuck with "false" forever?

It's usually not a good idea to use feature gates to control optional behavior, because sooner or later one arrives at a situation like this where the feature gets promoted and (ideally) the entire feature gate gets removed. But if that is what was done, then there's not much that can be done about it without a breaking change.

@huww98
Copy link
Contributor Author

huww98 commented Feb 29, 2024

Is it because that would have changed the default behavior and that wasn't desirable?

@pohly I think it is not the case. Even this feature gate is enabled, the relevant behaviors are only enabled if the driver declares support. And if the driver supports topology, I think it is always desirable to enable this, or we are not CSI compliant, and the driver may not work properly.

So, this PR only changes the behavior for those who use driver that support topology, but do not enable this feature gate. Do such use-case exist for any reason?

@msau42
Copy link
Collaborator

msau42 commented Mar 8, 2024

I believe the reason why the gate was there was to control rollout of a driver that wants to upgrade itself to add topology support.

Because if topology was enabled in the controller but the driver has not been upgraded on the nodes to start reporting topology, then the provisioner would start failing to provision (and schedule pods). So the expected flow is:

  1. Upgrade and rollout node csi driver so that topology is being reported
  2. Upgrade controller csi driver to turn on topology.

I agree that for a permanent flag it is better to make it a config flag rather than a feature gate.

@huww98
Copy link
Contributor Author

huww98 commented Mar 9, 2024

@msau42 I think even after we remove the Topology feature gate, we can still update driver with the same flow, because this feature is still gated by VOLUME_ACCESSIBILITY_CONSTRAINTS plugin capability, and this can be used to control the rollout. And the old version of driver should not report this capability, so updating all nodes before controller should work naturally.

@jsafrane
Copy link
Contributor

We're about to release a new provisioner. I personally like this PR - it goes the right direction (feature gate removal), yet it still allows users to override the feature gate to false for another release or two.
I also like the idea of no configuration of topology in the provisioner binary, a CSI driver should report correct VOLUME_ACCESSIBILITY_CONSTRAINTS capability instead. Especially, it's the driver task to report the capability on nodes first and then in the controller if it wants to enable topology in a deterministic way.

@huww98 can you please edit the release note to emphasize that Topology feature gate is deprecated and will be removed in a future release?

@msau42 what do you think?
(BTW, would change of the default value or removal require bump of the major version? IMO we should not require it for feature gates.)

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 16, 2024
This feature gate has already existed for 5 years. And should be safe to enable by default.
@k8s-ci-robot k8s-ci-robot removed lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels May 16, 2024
@msau42
Copy link
Collaborator

msau42 commented May 16, 2024

I am ok with the approach that @huww98 outlined. We need to document it and probably in the release notes we should call out action required. I think we should make it a major bump because of the action required. But normally if turing on a feature by default doesn't require action then it doens't need to be a major bump.

@huww98
Copy link
Contributor Author

huww98 commented May 17, 2024

@msau42 to confirm, does this match what you think?

Action Required: Users who deploy controller that is reporting topology, but do not have corresponding node plugin version deployed should upgrade the node plugin before upgrading the controller.

@jsafrane
Copy link
Contributor

I think the action required should be to explicitly disable Topology feature gate if corresponding CSI driver incorrectly reports VOLUME_ACCESSIBILITY_CONSTRAINTS capability. And warn them that the feature gate will be removed in a future release.

We should document probably in README.md that if a driver wants to enable topology support while it was disabled before, they should update node DaemonSet first to report VOLUME_ACCESSIBILITY_CONSTRAINTS capability, populate Nodes with topology labels and then update the controller to report VOLUME_ACCESSIBILITY_CONSTRAINTS capability.

@k8s-ci-robot k8s-ci-robot added release-note-action-required Denotes a PR that introduces potentially breaking changes that require user action. and removed release-note Denotes a PR that will be considered when it comes time to generate release notes. labels May 21, 2024
@jsafrane
Copy link
Contributor

I edited the release note
/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 21, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: huww98, jsafrane

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 21, 2024
@k8s-ci-robot k8s-ci-robot merged commit c031ee3 into kubernetes-csi:master May 21, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-action-required Denotes a PR that introduces potentially breaking changes that require user action. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants