Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Forward ClusterIP traffic to inter-node communication port before CQL is ready #1614

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

zimnx
Copy link
Collaborator

@zimnx zimnx commented Nov 30, 2023

Kubernetes reconciles Endpoints and EndpointSlices for Services having
selector by taking entire Pod readiness into account. Because we need to
allow inter-node communication before CQL traffic we used
PublishNonReadyEndpoints on Services to overcome this.
But at the same time, we would like stop accepting new connections when
Pod is tearing down. These two requirements contradicts with each other.

To satisfy both requirements, Operator will reconcile Endpoints and
EndpointSlice resources in per-container way. If Pod container specifies
port and has its own Readiness probe, Endpoint/EndpointSlice
will become ready for this port when given container is ready. If these
two conditions aren't met, given port becomes ready when entire Pod is
ready.
Controller logic will consider Pod being deleted (non-nil
deletionTimestamp) as fully non-ready so that all endpoints for all ports
and containers will removed.

Additional container added to Scylla Pod controls readiness of ports
used for inter-node communication.

With above logic, nodes observing other node going down, won't be
able to reconnect until restarted node opens a port for inter-node
communication because traffic will be rejected immediately when node
starts terminating. This fixes an issue of traffic disruption during rolling
restarts happening on ScyllaClusters using ClusterIP for inter-node
communication which was caused by nodes being stuck on reconnection
attempts.

Prerequisites:

Fixes #1077

@scylla-operator-bot scylla-operator-bot bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/needs-kind needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Nov 30, 2023
@scylla-operator-bot scylla-operator-bot bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Nov 30, 2023
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: zimnx

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@scylla-operator-bot scylla-operator-bot bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 30, 2023
@zimnx zimnx force-pushed the mz/endpointslices-controller branch 5 times, most recently from 4dc2f8f to d77e0f4 Compare December 4, 2023 17:33
@zimnx zimnx force-pushed the mz/endpointslices-controller branch 3 times, most recently from 8d9c038 to 28f6bd1 Compare December 8, 2023 15:41
@zimnx zimnx changed the title [WIP] Endpointslices controller [WIP] Fix traffic disruption happening during rolling restarts Dec 8, 2023
@zimnx zimnx changed the title [WIP] Fix traffic disruption happening during rolling restarts Fix traffic disruption happening during rolling restarts Dec 8, 2023
@scylla-operator-bot scylla-operator-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 8, 2023
@zimnx zimnx added kind/bug Categorizes issue or PR as related to a bug. kind/feature Categorizes issue or PR as related to a new feature. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Dec 8, 2023
@zimnx zimnx added do-not-merge/needs-kind and removed kind/feature Categorizes issue or PR as related to a new feature. labels Dec 8, 2023
@zimnx zimnx added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. do-not-merge/needs-kind labels Dec 8, 2023
@scylla-operator-bot scylla-operator-bot bot removed the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Dec 8, 2023
@zimnx zimnx force-pushed the mz/endpointslices-controller branch 3 times, most recently from 15ac37f to e7b623b Compare February 14, 2024 13:31
@zimnx zimnx force-pushed the mz/endpointslices-controller branch 5 times, most recently from 4509e1c to 2709a33 Compare February 20, 2024 21:09
@scylla-operator-bot scylla-operator-bot bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 23, 2024
@zimnx zimnx force-pushed the mz/endpointslices-controller branch from 2709a33 to dbeb62b Compare March 1, 2024 15:30
@scylla-operator-bot scylla-operator-bot bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 1, 2024
@zimnx
Copy link
Collaborator Author

zimnx commented Mar 1, 2024

Flake - #1516 (comment)
/retest

@zimnx
Copy link
Collaborator Author

zimnx commented Mar 11, 2024

/test images
/retest

@zimnx
Copy link
Collaborator Author

zimnx commented Mar 11, 2024

Cassandra timeouts, both seems to be related to PDs
/retest

1 similar comment
@zimnx
Copy link
Collaborator Author

zimnx commented Mar 11, 2024

Cassandra timeouts, both seems to be related to PDs
/retest

@zimnx zimnx force-pushed the mz/endpointslices-controller branch from dbeb62b to 4f0d7d4 Compare March 11, 2024 13:58
@zimnx zimnx force-pushed the mz/endpointslices-controller branch from 4f0d7d4 to b9aa243 Compare March 19, 2024 20:41
pkg/controller/scyllacluster/conditions.go Show resolved Hide resolved
pkg/controller/scyllacluster/resource.go Outdated Show resolved Hide resolved
pkg/controller/scyllacluster/resource.go Show resolved Hide resolved
}

endpointSliceLabels := map[string]string{}
maps.Copy(endpointSliceLabels, sc.Labels)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd do just the svc labels (the have their inheriting rules from scyllacluster) already. (also the controlling object of the endpoint is the service)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was added on your request: #1614 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one of those guys was wrong but I am not sure which one :)

Member services may have labels/annotation with special meaning like allocating a LB or its options so it may or may not be desirable to copy them. I am afraid this will end up with extra API field but I'd not like to make you start with it until such case is raised and justified. Should we start with just the scyllacluster annotations to be safe?

endpointSliceLabels[discoveryv1.LabelServiceName] = svc.Name

endpointSliceAnnotations := map[string]string{}
maps.Copy(endpointSliceAnnotations, sc.Annotations)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

}

// Container serves given Service port and decides about readiness
if ok && containerServingPort.ReadinessProbe != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to check the probe setup? I'd assume that kubelet uniformly reports the pod status as ready if there is no probe set.
(This would avoid the cases where say liveness set and readiness missing would imply the other.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is where it could instead find the container status as well would could avoid another loop and have it nicely colocated - unless I am missing something.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kubelet reports container readiness as ready if container doesn't have readiness probe. Which is opposite of what it's implemented here. Why would you want to change this behavior?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which container doesn't define readiness and shouldn't be ready? the container ports are defined on the container that defines the probe as well, right? At least the storage one is

ReadinessProbe: &corev1.Probe{

(From the PR description)

If these
two conditions aren't met, given port becomes ready when entire Pod is
ready.

It would be useful to explain why an possibly mention a concrete case.

pkg/controller/scyllacluster/resource.go Show resolved Hide resolved
return h, err
}

fnvHash := fnv.New32a()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

encoding seem to be more appropriate then hashing a hash

Copy link
Collaborator Author

@zimnx zimnx Mar 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

encoding sha512 wouldn't solve it, as it's too long to be used as part of a name, hence different hashing algorithm. I changed it to not hash a hash though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

encoding sha512 wouldn't solve it, as it's too long to be used as part of a name

I didn't get that. Encoding converts the char sets. Hashes (and their encodings) can be arbitrarily truncated as they have random distribution.

Kubernetes reconciles Endpoints and EndpointSlices for Services having
selector by taking entire Pod readiness into account. Because we need to
allow inter-node communication before CQL traffic we used
PublishNonReadyEndpoints on Services to overcome this.
But at the same time, we would like stop accepting new connections when
Pod is tearing down. These two requirements contradicts with each other.

To satisfy both requirements, Operator will reconcile Endpoints and
EndpointSlice resources in per-container way. If Pod container specifies
port and has its own Readiness probe, Endpoint/EndpointSlice
will become ready for this port when given container is ready. If these
two conditions aren't met, given port becomes ready when entire Pod is
ready.
Controller logic will consider Pod being deleted (non-nil
deletionTimestamp) as fully non-ready so that all endpoints for all
ports
and containers will removed.

Additional container added to Scylla Pod controls readiness of ports
used for inter-node communication.
@zimnx zimnx force-pushed the mz/endpointslices-controller branch from b9aa243 to 1f80b92 Compare March 27, 2024 22:18
@zimnx zimnx requested a review from tnozicka March 27, 2024 22:20
@tnozicka
Copy link
Member

Hi, apologies for the delay getting back with everything that has been going on. This is large enough so I don't want you and me to context switch in small batches and rather allocate a continuous chunk to get this finished. I'll be on PTO next week but I have scheduled this for the top of my list when I get back. If I don't get this submitted by Apr 30th EOD, please ping me!

Copy link
Member

@rzetelskik rzetelskik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a couple of comments, only skimmed through the tests for now

},
)
if err != nil {
objectErrs = append(objectErrs, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know the other invocations don't wrap this error, but is there actually a reason for not wrapping it?

PublishNotReadyAddresses: true,
Type: corev1.ServiceTypeClusterIP,
Ports: servicePorts(sc),
Selector: nil,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: shall we have a comment explaining why there's no selector? Or maybe even a higher-level comment how we work around it? It's obvious in this PR's context, but probably not so much for someone looking at it in the future. On the other hand it can be backtracked to this commit, so I'm just leaving it as a nit.

`,
},
ReadinessProbe: &corev1.Probe{
TimeoutSeconds: int32(30),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: shall we have comments explaining these values?

return endpointSlices, fmt.Errorf("can't get Pod %q: %w", naming.ManualRef(sc.Namespace, svc.Name), err)
}

// Don't publish endpoints for Pod that are being deleted.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Don't publish endpoints for Pod that are being deleted.
// Don't publish endpoints for Pods that are being deleted.

// Don't publish endpoints for Pod that are being deleted.
// Removing endpoints prevents from new connections being established while still allowing
// for existing connections to survive and finish their requests.
// We need to do this early, as gap between when Pod is actually deleted and Endpoint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// We need to do this early, as gap between when Pod is actually deleted and Endpoint
// We need to do this early, as a gap between when Pod is actually deleted and Endpoint

}

for i := range tt {
tc := tt[i]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for this anymore

}

for i := range tt {
tc := tt[i]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

}

for i := range tt {
tc := tt[i]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

t.Errorf("expected found %#v, got %#v", tc.expectedFound, found)
}
if !reflect.DeepEqual(container, tc.expectedContainer) {
t.Errorf("expected container %#v, got %#v", tc.expectedContainer, container)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use cmp.Diff instead?


collidedContainerName, collision := hashes[hash]
if collision {
t.Errorf("found collision on container endpoint ports, both pod and it's controlled ports and %q container with its serving port generate same hash", collidedContainerName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
t.Errorf("found collision on container endpoint ports, both pod and it's controlled ports and %q container with its serving port generate same hash", collidedContainerName)
t.Errorf("found collision on container endpoint ports, both pod and its controlled ports and %q container with its serving port generate same hash", collidedContainerName)

t.Fatal(err)
}
if !apiequality.Semantic.DeepEqual(endpoints, tc.expectedEndpoints) {
t.Errorf("expected and actual Endpoints(s) differ: %s", cmp.Diff(tc.expectedEndpoints, endpoints))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Endpoints(s) doesn't really make sense

},
},
{
name: "EndpointSlices to Service ports backed by container having readiness probe are ready when container is ready regardless of Pod readiness",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
name: "EndpointSlices to Service ports backed by container having readiness probe are ready when container is ready regardless of Pod readiness",
name: "EndpointSlices for Service ports backed by container having readiness probe are ready when container is ready regardless of Pod readiness",

for consistency mostly

},
},
{
name: "EndpointSlice have IPV6 AddressType when PodIP from Pod.Status is an IPv6 address",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
name: "EndpointSlice have IPV6 AddressType when PodIP from Pod.Status is an IPv6 address",
name: "EndpointSlice has IPV6 AddressType when PodIP from Pod.Status is an IPv6 address",

Comment on lines +4268 to +4286
for _, svcPort := range svc.Spec.Ports {
containerServingPort, ok, err := controllerhelpers.FindContainerServingPort(svcPort, sts.Spec.Template.Spec.Containers)
if err != nil {
t.Fatal(err)
}

ep := discoveryv1.EndpointPort{
Name: pointer.Ptr(svcPort.Name),
Port: pointer.Ptr(svcPort.Port),
Protocol: pointer.Ptr(svcPort.Protocol),
AppProtocol: svcPort.AppProtocol,
}

if ok && containerServingPort.ReadinessProbe != nil {
containerPortsMap[containerServingPort.Name] = append(containerPortsMap[containerServingPort.Name], ep)
} else {
podPorts = append(podPorts, ep)
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this be abstracted to a func shared between the test and implementation, so that we're sure they don't diverge?

},
},
{
name: "multi EndpointSlices per Service",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: multiple

t.Errorf("expected error %#v, got %#v", tc.expectedError, err)
}
if !reflect.DeepEqual(found, tc.expectedFound) {
t.Errorf("expected found %#v, got %#v", tc.expectedFound, found)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

%t instead of %#v?

@tnozicka
Copy link
Member

tnozicka commented May 2, 2024

@zimnx can you please link the issue on the kubernetes repo we've talked about filling in the past that explains the problem and why the existing API concepts didn't cover it? Was there any feedback you got on the approach or alternative ideas from the community?

pkg/controller/scyllacluster/resource.go Show resolved Hide resolved
pod, err := podLister.Pods(sc.Namespace).Get(svc.Name)
if err != nil {
if apierrors.IsNotFound(err) {
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

V4 log?

}

endpointSliceLabels := map[string]string{}
maps.Copy(endpointSliceLabels, sc.Labels)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one of those guys was wrong but I am not sure which one :)

Member services may have labels/annotation with special meaning like allocating a LB or its options so it may or may not be desirable to copy them. I am afraid this will end up with extra API field but I'd not like to make you start with it until such case is raised and justified. Should we start with just the scyllacluster annotations to be safe?

@@ -249,3 +251,38 @@ func GetScyllaContainerID(pod *corev1.Pod) (string, error) {

return cs.ContainerID, nil
}

func FindContainerServingPort(port corev1.ServicePort, containers []corev1.Container) (corev1.Container, bool, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func FindContainerServingPort(port corev1.ServicePort, containers []corev1.Container) (corev1.Container, bool, error) {
func FindContainerServingPort(svcPort corev1.ServicePort, containers []corev1.Container) (corev1.Container, bool, error) {

port is ambiguous

Comment on lines +260 to +266
if port.Name == cp.Name {
return true
}

if port.Port == cp.ContainerPort {
return true
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if the name matches but not the port?

}

// Container serves given Service port and decides about readiness
if ok && containerServingPort.ReadinessProbe != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which container doesn't define readiness and shouldn't be ready? the container ports are defined on the container that defines the probe as well, right? At least the storage one is

ReadinessProbe: &corev1.Probe{

(From the PR description)

If these
two conditions aren't met, given port becomes ready when entire Pod is
ready.

It would be useful to explain why an possibly mention a concrete case.

SidecarInjectorContainerName = "sidecar-injection"
PerftuneContainerName = "perftune"
CleanupContainerName = "cleanup"
InterNodeTrafficProbeSidecarContainerName = "inter-node-traffic-probe-sidecar"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
InterNodeTrafficProbeSidecarContainerName = "inter-node-traffic-probe-sidecar"
InterNodeTrafficProbeContainerName = "inter-node-traffic-probe"

return h, err
}

fnvHash := fnv.New32a()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

encoding sha512 wouldn't solve it, as it's too long to be used as part of a name

I didn't get that. Encoding converts the char sets. Hashes (and their encodings) can be arbitrarily truncated as they have random distribution.

pkg/controller/scyllacluster/conditions.go Show resolved Hide resolved
Copy link
Contributor

@zimnx: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gke-parallel-clusterip 1f80b92 link true /test e2e-gke-parallel-clusterip
ci/prow/e2e-gke-release-script-latest 1f80b92 link true /test e2e-gke-release-script-latest

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

During rollout restart operator doesn't wait until previously restarted pod becomes part of a Scylla cluster
3 participants