New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Forward ClusterIP traffic to inter-node communication port before CQL is ready #1614
base: master
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: zimnx The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
4dc2f8f
to
d77e0f4
Compare
8d9c038
to
28f6bd1
Compare
15ac37f
to
e7b623b
Compare
4509e1c
to
2709a33
Compare
2709a33
to
dbeb62b
Compare
Flake - #1516 (comment) |
/test images |
Cassandra timeouts, both seems to be related to PDs |
1 similar comment
Cassandra timeouts, both seems to be related to PDs |
dbeb62b
to
4f0d7d4
Compare
4f0d7d4
to
b9aa243
Compare
} | ||
|
||
endpointSliceLabels := map[string]string{} | ||
maps.Copy(endpointSliceLabels, sc.Labels) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd do just the svc labels (the have their inheriting rules from scyllacluster) already. (also the controlling object of the endpoint is the service)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was added on your request: #1614 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one of those guys was wrong but I am not sure which one :)
Member services may have labels/annotation with special meaning like allocating a LB or its options so it may or may not be desirable to copy them. I am afraid this will end up with extra API field but I'd not like to make you start with it until such case is raised and justified. Should we start with just the scyllacluster annotations to be safe?
endpointSliceLabels[discoveryv1.LabelServiceName] = svc.Name | ||
|
||
endpointSliceAnnotations := map[string]string{} | ||
maps.Copy(endpointSliceAnnotations, sc.Annotations) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
} | ||
|
||
// Container serves given Service port and decides about readiness | ||
if ok && containerServingPort.ReadinessProbe != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need to check the probe setup? I'd assume that kubelet uniformly reports the pod status as ready if there is no probe set.
(This would avoid the cases where say liveness set and readiness missing would imply the other.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is where it could instead find the container status as well would could avoid another loop and have it nicely colocated - unless I am missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kubelet reports container readiness as ready if container doesn't have readiness probe. Which is opposite of what it's implemented here. Why would you want to change this behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which container doesn't define readiness and shouldn't be ready? the container ports are defined on the container that defines the probe as well, right? At least the storage one is
ReadinessProbe: &corev1.Probe{ |
(From the PR description)
If these
two conditions aren't met, given port becomes ready when entire Pod is
ready.
It would be useful to explain why an possibly mention a concrete case.
return h, err | ||
} | ||
|
||
fnvHash := fnv.New32a() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
encoding seem to be more appropriate then hashing a hash
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
encoding sha512 wouldn't solve it, as it's too long to be used as part of a name, hence different hashing algorithm. I changed it to not hash a hash though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
encoding sha512 wouldn't solve it, as it's too long to be used as part of a name
I didn't get that. Encoding converts the char sets. Hashes (and their encodings) can be arbitrarily truncated as they have random distribution.
Kubernetes reconciles Endpoints and EndpointSlices for Services having selector by taking entire Pod readiness into account. Because we need to allow inter-node communication before CQL traffic we used PublishNonReadyEndpoints on Services to overcome this. But at the same time, we would like stop accepting new connections when Pod is tearing down. These two requirements contradicts with each other. To satisfy both requirements, Operator will reconcile Endpoints and EndpointSlice resources in per-container way. If Pod container specifies port and has its own Readiness probe, Endpoint/EndpointSlice will become ready for this port when given container is ready. If these two conditions aren't met, given port becomes ready when entire Pod is ready. Controller logic will consider Pod being deleted (non-nil deletionTimestamp) as fully non-ready so that all endpoints for all ports and containers will removed. Additional container added to Scylla Pod controls readiness of ports used for inter-node communication.
b9aa243
to
1f80b92
Compare
Hi, apologies for the delay getting back with everything that has been going on. This is large enough so I don't want you and me to context switch in small batches and rather allocate a continuous chunk to get this finished. I'll be on PTO next week but I have scheduled this for the top of my list when I get back. If I don't get this submitted by Apr 30th EOD, please ping me! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left a couple of comments, only skimmed through the tests for now
}, | ||
) | ||
if err != nil { | ||
objectErrs = append(objectErrs, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know the other invocations don't wrap this error, but is there actually a reason for not wrapping it?
PublishNotReadyAddresses: true, | ||
Type: corev1.ServiceTypeClusterIP, | ||
Ports: servicePorts(sc), | ||
Selector: nil, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: shall we have a comment explaining why there's no selector? Or maybe even a higher-level comment how we work around it? It's obvious in this PR's context, but probably not so much for someone looking at it in the future. On the other hand it can be backtracked to this commit, so I'm just leaving it as a nit.
`, | ||
}, | ||
ReadinessProbe: &corev1.Probe{ | ||
TimeoutSeconds: int32(30), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: shall we have comments explaining these values?
return endpointSlices, fmt.Errorf("can't get Pod %q: %w", naming.ManualRef(sc.Namespace, svc.Name), err) | ||
} | ||
|
||
// Don't publish endpoints for Pod that are being deleted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Don't publish endpoints for Pod that are being deleted. | |
// Don't publish endpoints for Pods that are being deleted. |
// Don't publish endpoints for Pod that are being deleted. | ||
// Removing endpoints prevents from new connections being established while still allowing | ||
// for existing connections to survive and finish their requests. | ||
// We need to do this early, as gap between when Pod is actually deleted and Endpoint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// We need to do this early, as gap between when Pod is actually deleted and Endpoint | |
// We need to do this early, as a gap between when Pod is actually deleted and Endpoint |
} | ||
|
||
for i := range tt { | ||
tc := tt[i] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need for this anymore
} | ||
|
||
for i := range tt { | ||
tc := tt[i] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
} | ||
|
||
for i := range tt { | ||
tc := tt[i] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
t.Errorf("expected found %#v, got %#v", tc.expectedFound, found) | ||
} | ||
if !reflect.DeepEqual(container, tc.expectedContainer) { | ||
t.Errorf("expected container %#v, got %#v", tc.expectedContainer, container) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use cmp.Diff instead?
|
||
collidedContainerName, collision := hashes[hash] | ||
if collision { | ||
t.Errorf("found collision on container endpoint ports, both pod and it's controlled ports and %q container with its serving port generate same hash", collidedContainerName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
t.Errorf("found collision on container endpoint ports, both pod and it's controlled ports and %q container with its serving port generate same hash", collidedContainerName) | |
t.Errorf("found collision on container endpoint ports, both pod and its controlled ports and %q container with its serving port generate same hash", collidedContainerName) |
t.Fatal(err) | ||
} | ||
if !apiequality.Semantic.DeepEqual(endpoints, tc.expectedEndpoints) { | ||
t.Errorf("expected and actual Endpoints(s) differ: %s", cmp.Diff(tc.expectedEndpoints, endpoints)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Endpoints(s) doesn't really make sense
}, | ||
}, | ||
{ | ||
name: "EndpointSlices to Service ports backed by container having readiness probe are ready when container is ready regardless of Pod readiness", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name: "EndpointSlices to Service ports backed by container having readiness probe are ready when container is ready regardless of Pod readiness", | |
name: "EndpointSlices for Service ports backed by container having readiness probe are ready when container is ready regardless of Pod readiness", |
for consistency mostly
}, | ||
}, | ||
{ | ||
name: "EndpointSlice have IPV6 AddressType when PodIP from Pod.Status is an IPv6 address", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name: "EndpointSlice have IPV6 AddressType when PodIP from Pod.Status is an IPv6 address", | |
name: "EndpointSlice has IPV6 AddressType when PodIP from Pod.Status is an IPv6 address", |
for _, svcPort := range svc.Spec.Ports { | ||
containerServingPort, ok, err := controllerhelpers.FindContainerServingPort(svcPort, sts.Spec.Template.Spec.Containers) | ||
if err != nil { | ||
t.Fatal(err) | ||
} | ||
|
||
ep := discoveryv1.EndpointPort{ | ||
Name: pointer.Ptr(svcPort.Name), | ||
Port: pointer.Ptr(svcPort.Port), | ||
Protocol: pointer.Ptr(svcPort.Protocol), | ||
AppProtocol: svcPort.AppProtocol, | ||
} | ||
|
||
if ok && containerServingPort.ReadinessProbe != nil { | ||
containerPortsMap[containerServingPort.Name] = append(containerPortsMap[containerServingPort.Name], ep) | ||
} else { | ||
podPorts = append(podPorts, ep) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could this be abstracted to a func shared between the test and implementation, so that we're sure they don't diverge?
}, | ||
}, | ||
{ | ||
name: "multi EndpointSlices per Service", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: multiple
t.Errorf("expected error %#v, got %#v", tc.expectedError, err) | ||
} | ||
if !reflect.DeepEqual(found, tc.expectedFound) { | ||
t.Errorf("expected found %#v, got %#v", tc.expectedFound, found) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
%t instead of %#v?
@zimnx can you please link the issue on the kubernetes repo we've talked about filling in the past that explains the problem and why the existing API concepts didn't cover it? Was there any feedback you got on the approach or alternative ideas from the community? |
pod, err := podLister.Pods(sc.Namespace).Get(svc.Name) | ||
if err != nil { | ||
if apierrors.IsNotFound(err) { | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
V4 log?
} | ||
|
||
endpointSliceLabels := map[string]string{} | ||
maps.Copy(endpointSliceLabels, sc.Labels) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one of those guys was wrong but I am not sure which one :)
Member services may have labels/annotation with special meaning like allocating a LB or its options so it may or may not be desirable to copy them. I am afraid this will end up with extra API field but I'd not like to make you start with it until such case is raised and justified. Should we start with just the scyllacluster annotations to be safe?
@@ -249,3 +251,38 @@ func GetScyllaContainerID(pod *corev1.Pod) (string, error) { | |||
|
|||
return cs.ContainerID, nil | |||
} | |||
|
|||
func FindContainerServingPort(port corev1.ServicePort, containers []corev1.Container) (corev1.Container, bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func FindContainerServingPort(port corev1.ServicePort, containers []corev1.Container) (corev1.Container, bool, error) { | |
func FindContainerServingPort(svcPort corev1.ServicePort, containers []corev1.Container) (corev1.Container, bool, error) { |
port is ambiguous
if port.Name == cp.Name { | ||
return true | ||
} | ||
|
||
if port.Port == cp.ContainerPort { | ||
return true | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if the name matches but not the port?
} | ||
|
||
// Container serves given Service port and decides about readiness | ||
if ok && containerServingPort.ReadinessProbe != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which container doesn't define readiness and shouldn't be ready? the container ports are defined on the container that defines the probe as well, right? At least the storage one is
ReadinessProbe: &corev1.Probe{ |
(From the PR description)
If these
two conditions aren't met, given port becomes ready when entire Pod is
ready.
It would be useful to explain why an possibly mention a concrete case.
SidecarInjectorContainerName = "sidecar-injection" | ||
PerftuneContainerName = "perftune" | ||
CleanupContainerName = "cleanup" | ||
InterNodeTrafficProbeSidecarContainerName = "inter-node-traffic-probe-sidecar" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
InterNodeTrafficProbeSidecarContainerName = "inter-node-traffic-probe-sidecar" | |
InterNodeTrafficProbeContainerName = "inter-node-traffic-probe" |
return h, err | ||
} | ||
|
||
fnvHash := fnv.New32a() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
encoding sha512 wouldn't solve it, as it's too long to be used as part of a name
I didn't get that. Encoding converts the char sets. Hashes (and their encodings) can be arbitrarily truncated as they have random distribution.
@zimnx: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Kubernetes reconciles Endpoints and EndpointSlices for Services having
selector by taking entire Pod readiness into account. Because we need to
allow inter-node communication before CQL traffic we used
PublishNonReadyEndpoints on Services to overcome this.
But at the same time, we would like stop accepting new connections when
Pod is tearing down. These two requirements contradicts with each other.
To satisfy both requirements, Operator will reconcile Endpoints and
EndpointSlice resources in per-container way. If Pod container specifies
port and has its own Readiness probe, Endpoint/EndpointSlice
will become ready for this port when given container is ready. If these
two conditions aren't met, given port becomes ready when entire Pod is
ready.
Controller logic will consider Pod being deleted (non-nil
deletionTimestamp) as fully non-ready so that all endpoints for all ports
and containers will removed.
Additional container added to Scylla Pod controls readiness of ports
used for inter-node communication.
With above logic, nodes observing other node going down, won't be
able to reconnect until restarted node opens a port for inter-node
communication because traffic will be rejected immediately when node
starts terminating. This fixes an issue of traffic disruption during rolling
restarts happening on ScyllaClusters using ClusterIP for inter-node
communication which was caused by nodes being stuck on reconnection
attempts.
Prerequisites:
Fixes #1077