Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement checking etcd version to warn about deprecated etcd versions #123192

Closed
Tracked by #2340
serathius opened this issue Feb 8, 2024 · 6 comments · Fixed by #124612
Closed
Tracked by #2340

Implement checking etcd version to warn about deprecated etcd versions #123192

serathius opened this issue Feb 8, 2024 · 6 comments · Fixed by #124612
Assignees
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/etcd Categorizes an issue or PR as relevant to SIG Etcd. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@serathius
Copy link
Contributor

serathius commented Feb 8, 2024

What would you like to be added?

Part of kubernetes/enhancements#2340

From https://github.com/kubernetes/enhancements/blob/master/keps/sig-api-machinery/2340-Consistent-reads-from-cache/README.md#bug-in-etcd-progress-notification

Only recently community discovered a bug etcd-io/etcd#15220 for requesting progress notification. This bug causes a race between sending an event and progress notification with the same revision. Normally we would expect to event to always come first, however due to the bug, progress notification might be sent first. Serving a consistent lists in that situation could result in returning too early and missing the event. The bug was only fixed in v3.4.25 and v3.5.8 (both 9 months old), which is too fresh for Kubernetes to not handle.

Our tests have shown that with the bug manual progress notification cannot be trusted at all. It causes a silent corruption that cannot be automatically detected prior to acting upon the corrupted data. For Beta, we propose to conditionally enable the feature in a way that mitigates the risk of unaware users stumbling on the bug.

We propose to implement a safeguard that will prevent enabling ConsistentListFromCache on etcd version with broken progress notification. During kube-apiserver startup we will verify etcd version by calling Status method and checking etcd patch version. The condition will only pass if all etcd endpoints have etcd version that includes the fix for etcd-io/etcd#15220.

The proposed behavior of the safeguard for Beta:

etcd version could not be acquired or etcd new enough - warn, but continue with whatever the current value of the feature gate is

  • etcd too old, feature gate not explicitly specified - warn and disable the feature
  • etcd too old, feature gate explicitly enabled - abort
  • etcd new enough - use whatever the current value of the feature gate is

The proposed behavior of the safeguard for GA (feature gate locked to true):

  • etcd version could not be acquired - warn, but continue with the feature enabled
  • etcd too old - abort
  • etcd new enough - enable feature

Checking etcd version during start should be a good enough solution as we generally don't expect etcd clusters to downgrade. As of etcd v3.5 it is not an officially supported feature.

For now posted as a "help wanted" issue, however if we cannot hit the timeline for 1.30 code freeze (5th March) I would take it over so we don't miss. Still issue is simple enough that it can be quickly implemented.

/sig api-machinery
/triage accepted
/help

Why is this needed?

Warn users about running on buggy etcd version

@serathius serathius added the kind/feature Categorizes issue or PR as related to a new feature. label Feb 8, 2024
@k8s-ci-robot
Copy link
Contributor

@serathius:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

What would you like to be added?

Part of kubernetes/enhancements#2340

From https://github.com/kubernetes/enhancements/blob/master/keps/sig-api-machinery/2340-Consistent-reads-from-cache/README.md#bug-in-etcd-progress-notification

Only recently community discovered a bug https://github.com/etcd-io/etcd/issues/15220 for requesting progress notification. This bug causes a race between sending an event and progress notification with the same revision. Normally we would expect to event to always come first, however due to the bug, progress notification might be sent first. Serving a consistent lists in that situation could result in returning too early and missing the event. The bug was only fixed in v3.4.25 and v3.5.8 (both 9 months old), which is too fresh for Kubernetes to not handle.

Our tests have shown that with the bug manual progress notification cannot be trusted at all. It causes a silent corruption that cannot be automatically detected prior to acting upon the corrupted data. For Beta, we propose to conditionally enable the feature in a way that mitigates the risk of unaware users stumbling on the bug.

We propose to implement a safeguard that will prevent enabling ConsistentListFromCache on etcd version with broken progress notification. During kube-apiserver startup we will verify etcd version by calling Status method and checking etcd patch version. The condition will only pass if all etcd endpoints have etcd version that includes the fix for https://github.com/etcd-io/etcd/issues/15220.

The proposed behavior of the safeguard for Beta:

etcd version could not be acquired or etcd new enough - warn, but continue with whatever the current value of the feature gate is
etcd too old, feature gate not explicitly specified - warn and disable the feature
etcd too old, feature gate explicitly enabled - abort
etcd new enough - use whatever the current value of the feature gate is
The proposed behavior of the safeguard for GA (feature gate locked to true):

etcd version could not be acquired - warn, but continue with the feature enabled
etcd too old - abort
etcd new enough - enable feature
Checking etcd version during start should be a good enough solution as we generally don't expect etcd clusters to downgrade. As of etcd v3.5 it is not an officially supported feature.

For now posted as a "help wanted" issue, however if we cannot hit the timeline for 1.30 code freeze (5th March) I would take it over so we don't miss. Still issue is simple enough that it can be quickly implemented.

/sig api-machinery
/triage accepted
/help

Why is this needed?

Warn users about running on buggy etcd version

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Feb 8, 2024
@serathius
Copy link
Contributor Author

/sig etcd

@k8s-ci-robot k8s-ci-robot added the sig/etcd Categorizes an issue or PR as relevant to SIG Etcd. label Feb 8, 2024
@kumarankit999
Copy link

I would like to work on this!

@serathius
Copy link
Contributor Author

Thanks

/assign @kumarankit999

Please reach out on Slack if you need any help.

@ah8ad3
Copy link
Member

ah8ad3 commented Feb 12, 2024

I can take a look at this and maybe help if it is still open to work on.

@ah8ad3
Copy link
Member

ah8ad3 commented Feb 13, 2024

/assign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/etcd Categorizes an issue or PR as relevant to SIG Etcd. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
4 participants