Run manager with 2 replicas #180

k-keiichi-rh · 2024-01-27T02:05:16Z

This pr is to run the manager with 2 replicas for high availability.
We are testing this if our issue is resolved in https://issues.redhat.com/browse/ECOPROJECT-1855.

openshift-ci · 2024-01-27T02:05:30Z

Hi @k-keiichi-rh. Thanks for your PR.

I'm waiting for a medik8s member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mshitrit · 2024-01-28T07:59:19Z

/test 4.14-openshift-e2e

Signed-off-by: Keiichi Kii <k-keiichi@nec.com> Signed-off-by: Keiichi Kii <kkii@redhat.com>

k-keiichi-rh · 2024-02-22T02:23:57Z

Just increasing replicas to 2 is not enough to resolve ECOPROJECT-1855. AntiAffinity is not configured to the deployment. So 2 of the controllers may be executed on the same node.
Now it's changed to configure podAntiAffinity according to https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/

mshitrit · 2024-02-22T07:13:50Z

/test 4.14-openshift-e2e

mshitrit · 2024-02-22T07:28:34Z

Just increasing replicas to 2 is not enough to resolve ECOPROJECT-1855. AntiAffinity is not configured to the deployment. So 2 of the controllers may be executed on the same node.

That's an interesting point we should probably consider that for our other operators which are using the same approach.

However from what I understand the change (i.e anti-affinity for control-plane) only makes sure the manager pod isn't running on a control plane node, so I'm not sure how it would solve the problem.
Am I missing something ?

k-keiichi-rh · 2024-02-22T16:47:06Z

The following configuration ensures that no two pods with the label "control-plane: controller-manager" can't be scheduled on the same node(host).

              affinity:
                podAntiAffinity:
                  requiredDuringSchedulingIgnoredDuringExecution:
                  - labelSelector:
                      matchExpressions:
                      - key: control-plane
                        operator: In
                        values:
                        - controller-manager
                    topologyKey: kubernetes.io/hostname

For example, it's trying to deploy two manager pods, but the only one worker is available now.
In my env, the manager pod doesn't have a matching toleration with the "node-role.kubernetes.io/master", so the manager pods can be deployed on worker nodes only.

[root@lbdns self-node-remediation]# oc get node
NAME       STATUS                     ROLES                  AGE   VERSION
master-0   Ready                      control-plane,master   43d   v1.27.6+f67aeb3
master-1   Ready                      control-plane,master   43d   v1.27.6+f67aeb3
master-2   Ready                      control-plane,master   43d   v1.27.6+f67aeb3
worker-0   Ready                      worker                 43d   v1.27.6+f67aeb3
worker-1   Ready,SchedulingDisabled   worker                 43d   v1.27.6+f67aeb3
worker-2   Ready,SchedulingDisabled   worker                 43d   v1.27.6+f67aeb3

Before applying this patch, both of manager pods can be deployed on the worker-0, but now the above AntiAffinity is configured to the deployment. So the 2nd manager pod can't be deployed on the worker-0 and it becomes the "Pending" status.

[root@lbdns self-node-remediation]# oc get pods -o wide
NAME                                                              READY   STATUS      RESTARTS   AGE     IP            NODE       NOMINATED NODE   READINESS GATES
self-node-remediation-controller-manager-88646cd74-cs5m8          1/1     Running     0          20m     10.128.2.69   worker-0   <none>           <none>
self-node-remediation-controller-manager-88646cd74-jbglq          0/1     Pending     0          2m29s   <none>        <none>     <none>           <none>

The 1st manager pod, self-node-remediation-controller-manager-88646cd74-cs5m8, is already deployed on the worker-0.
It has the "control-plane=controller-manager" label. So any pod with the "control-plane=controller-manager" label won't be deployed on the worker-0. This is the reason why the 2nd manager pod is in the "Pending" status.

[root@lbdns self-node-remediation]# oc describe pod self-node-remediation-controller-manager-88646cd74-cs5m8
Name:                 self-node-remediation-controller-manager-88646cd74-cs5m8
Namespace:            openshift-operators
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Service Account:      self-node-remediation-controller-manager
Node:                 worker-0/192.168.28.50
Start Time:           Thu, 22 Feb 2024 16:10:31 +0000
Labels:               control-plane=controller-manager
                      pod-template-hash=88646cd74
                      self-node-remediation-operator=

We can confirm the events if the AntiAffinity is working well as expected.

[root@lbdns self-node-remediation]# oc describe pod self-node-remediation-controller-manager-88646cd74-jbglq
...
Tolerations:                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  77s (x4 over 4m31s)  default-scheduler  0/6 nodes are available: 1 node(s) didn't match pod anti-affinity rules, 2 node(s) were unschedulable, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/6 nodes are available: 1 node(s) didn't match pod anti-affinity rules, 5 Preemption is not helpful for scheduling..

After one more worker node(worker-1) is available, the 2nd manager pod will be scheduled on the worker-1 soon.

[root@lbdns self-node-remediation]# oc adm uncordon worker-1
node/worker-1 uncordoned
[root@lbdns self-node-remediation]# oc get pods -o wide
NAME                                                              READY   STATUS      RESTARTS   AGE   IP            NODE       NOMINATED NODE   READINESS GATES
ed37de5c79174808125e3dbffd6aef0c79224569c22a7bd5a9056f7af1gh962   0/1     Completed   0          15h   10.128.2.10   worker-0   <none>           <none>
quay-io-kkii-self-node-remediation-operator-bundle-latest         1/1     Running     0          15h   10.128.2.9    worker-0   <none>           <none>
self-node-remediation-controller-manager-88646cd74-cs5m8          1/1     Running     0          42m   10.128.2.69   worker-0   <none>           <none>
self-node-remediation-controller-manager-88646cd74-jbglq          0/1     Running     0          24m   10.131.0.21   worker-1   <none>           <none>
self-node-remediation-ds-c74fx                                    1/1     Running     0          15h   10.129.2.17   worker-2   <none>           <none>
self-node-remediation-ds-jgjlj                                    1/1     Running     0          15h   10.131.0.16   worker-1   <none>           <none>
self-node-remediation-ds-nbslh                                    1/1     Running     0          15h   10.129.0.44   master-1   <none>           <none>
self-node-remediation-ds-p5fsv                                    1/1     Running     0          15h   10.128.0.20   master-0   <none>           <none>
self-node-remediation-ds-x4g7p                                    1/1     Running     0          15h   10.128.2.12   worker-0   <none>           <none>
self-node-remediation-ds-x8f67                                    1/1     Running     0          15h   10.130.0.40   master-2   <none>           <none>

mshitrit · 2024-02-26T08:17:38Z

Thanks for the explanation, and really good job of catching this use-case !

/lgtm

Just a small Nit: I think you meant "can" instead of "can't"

no two pods with the label "control-plane: controller-manager" can't be

openshift-ci · 2024-02-26T08:19:57Z

[APPROVALNOTIFIER] This PR is APPROVED

Approval requirements bypassed by manually added approval.

This pull-request has been approved by: k-keiichi-rh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

slintes · 2024-02-26T15:26:27Z

/hold

slintes

in general good catch with the affinity, I have some questions / concerns though, see comments inside

slintes · 2024-02-26T15:33:32Z

bundle/manifests/self-node-remediation.clusterserviceversion.yaml

@@ -337,6 +337,16 @@ spec:
                control-plane: controller-manager
                self-node-remediation-operator: ""
            spec:
+              affinity:
+                podAntiAffinity:
+                  requiredDuringSchedulingIgnoredDuringExecution:


Would this fail operator installation in case we only have one node available during installation?

Also, what happens on updates, when there aren't enough nodes to start the pods of the new version?

(the latter can be solved with matchLabelKeys and pod-template-hash, but that's alpha in k8s 1.29 only. PopologySpreadConstraints, also with matchLabelKeys and pod-template-hash, would also work AFAIU, but there matchLabelKeys also is too new, beta since k8s 1.27 / OCP 4.14).

Maybe preferredDuringSchedulingIgnoredDuringExecution is a better choice? It would potentially deploy both pods on the same node, but that's better than nothing, is it? However, I'm not sure if that would still make a difference to the default scheduling behaviour, which already respects replicasets... 🤔

looks like the update concern can also be solved with this:

strategy: rollingUpdate: maxSurge: 0 maxUnavailable: 1 type: RollingUpdate

it will force k8s to first delete one of the pods before starting a new one

Would this fail operator installation in case we only have one node available during installation?

I will test it this case. In my opinion, I think it should fail to install the operator installation.
There are the following cases:

We have only one node available like SNO(Single Node OpenShift)
=> No reason to deploy SNR/FAR/NHC

We have a plan to deploy a cluster with multiple nodes, but there is only one node available during installation due to the node failures
=> We are unhappy to succeed to deploy the cluster without noticing the node failures. It should stop the operator installation.

looks like the update concern can also be solved with this:

strategy: rollingUpdate: maxSurge: 0 maxUnavailable: 1 type: RollingUpdate

it will force k8s to first delete one of the pods before starting a new one

I will fix it with the above. And the Workload Availability project has the Node Maintenance Operator for maintenance. So we also need to define PodDisruptionBudget like https://github.com/openshift/cluster-monitoring-operator/blob/master/assets/thanos-querier/pod-disruption-budget.yaml.
It will prevent from stopping all of the managers accidentally.

We have only one node available like SNO(Single Node OpenShift)
=> No reason to deploy SNR/FAR/NHC

agree

We have a plan to deploy a cluster with multiple nodes, but there is only one node available during installation due to the node failures
=> We are unhappy to succeed to deploy the cluster without noticing the node failures. It should stop the operator installation.

hm.... yes, actually it does not make sense to succeed, it wouldn't help fixing the cluster, because the agents would not be deployed on the failed nodes.

we also need to define PodDisruptionBudget

Let's discuss this in an issue please. As with this affinity, we need to be aware of side effects. Let's not make our operators more important than they are, compared to other deployments.

slintes · 2024-02-26T15:41:20Z

bundle/manifests/self-node-remediation.clusterserviceversion.yaml

+                  requiredDuringSchedulingIgnoredDuringExecution:
+                  - labelSelector:
+                      matchExpressions:
+                      - key: control-plane


control-plane: controller-manager is a default label which is added by operator-sdk when scaffolding a new operator. So there is a very big risk that other operator's pods also have this label, which can lead to undesired behavior. Please use the self-node-remediation-operator: "" label, which is unique for SNR and also doesn't conflict with the daemonset pods.

I will use the self-node-remediation-operator: "" label. Thank you for the suggestion.

mshitrit · 2024-03-07T07:05:36Z

/test ?

openshift-ci · 2024-03-07T07:05:40Z

@mshitrit: The following commands are available to trigger required jobs:

/test 4.12-ci-index-self-node-remediation-bundle
/test 4.12-images
/test 4.12-openshift-e2e
/test 4.12-test
/test 4.13-ci-index-self-node-remediation-bundle
/test 4.13-images
/test 4.13-openshift-e2e
/test 4.13-test
/test 4.14-ci-index-self-node-remediation-bundle
/test 4.14-images
/test 4.14-openshift-e2e
/test 4.14-test
/test 4.15-ci-index-self-node-remediation-bundle
/test 4.15-images
/test 4.15-openshift-e2e
/test 4.15-test

Use /test all to run all jobs.

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mshitrit · 2024-03-07T07:05:54Z

/test 4.14-openshift-e2e

openshift-ci · 2024-05-27T17:03:05Z

@k-keiichi-rh: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/4.16-openshift-e2e	`136e5f9`	link	true	`/test 4.16-openshift-e2e`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci bot added the do-not-merge/work-in-progress label Jan 27, 2024

openshift-ci bot requested review from beekhof and clobrano January 27, 2024 02:05

openshift-ci bot added the needs-ok-to-test label Jan 27, 2024

mshitrit marked this pull request as draft January 28, 2024 07:58

k-keiichi-rh force-pushed the ecoproject-1855 branch from c7df3fe to f4676fc Compare February 22, 2024 01:39

k-keiichi-rh changed the title ~~[WIP] Run manager with 2 replicas~~ Run manager with 2 replicas Feb 22, 2024

Run manager with 2 replicas

136e5f9

Signed-off-by: Keiichi Kii <k-keiichi@nec.com> Signed-off-by: Keiichi Kii <kkii@redhat.com>

k-keiichi-rh force-pushed the ecoproject-1855 branch from f4676fc to 136e5f9 Compare February 22, 2024 01:45

openshift-ci bot assigned mshitrit Feb 26, 2024

openshift-ci bot added the lgtm label Feb 26, 2024

mshitrit marked this pull request as ready for review February 26, 2024 08:19

openshift-ci bot removed the do-not-merge/work-in-progress label Feb 26, 2024

openshift-ci bot requested review from mshitrit and slintes February 26, 2024 08:19

mshitrit added the approved label Feb 26, 2024

mshitrit added the ok-to-test label Feb 26, 2024

openshift-ci bot added the do-not-merge/hold label Feb 26, 2024

slintes requested changes Feb 26, 2024

View reviewed changes

openshift-ci bot assigned slintes Feb 26, 2024

openshift-ci bot removed the lgtm label Feb 26, 2024

openshift-ci bot removed the needs-ok-to-test label Mar 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run manager with 2 replicas #180

Run manager with 2 replicas #180

k-keiichi-rh commented Jan 27, 2024 •

edited

openshift-ci bot commented Jan 27, 2024

mshitrit commented Jan 28, 2024

k-keiichi-rh commented Feb 22, 2024

mshitrit commented Feb 22, 2024

mshitrit commented Feb 22, 2024

k-keiichi-rh commented Feb 22, 2024 •

edited

mshitrit commented Feb 26, 2024

openshift-ci bot commented Feb 26, 2024

slintes commented Feb 26, 2024

slintes left a comment

slintes Feb 26, 2024

slintes Feb 26, 2024

k-keiichi-rh Feb 29, 2024

slintes Feb 29, 2024

slintes Feb 26, 2024

k-keiichi-rh Feb 29, 2024

mshitrit commented Mar 7, 2024

openshift-ci bot commented Mar 7, 2024

mshitrit commented Mar 7, 2024

openshift-ci bot commented May 27, 2024

Run manager with 2 replicas #180

Are you sure you want to change the base?

Run manager with 2 replicas #180

Conversation

k-keiichi-rh commented Jan 27, 2024 • edited

openshift-ci bot commented Jan 27, 2024

mshitrit commented Jan 28, 2024

k-keiichi-rh commented Feb 22, 2024

mshitrit commented Feb 22, 2024

mshitrit commented Feb 22, 2024

k-keiichi-rh commented Feb 22, 2024 • edited

mshitrit commented Feb 26, 2024

openshift-ci bot commented Feb 26, 2024

slintes commented Feb 26, 2024

slintes left a comment

Choose a reason for hiding this comment

slintes Feb 26, 2024

Choose a reason for hiding this comment

slintes Feb 26, 2024

Choose a reason for hiding this comment

k-keiichi-rh Feb 29, 2024

Choose a reason for hiding this comment

slintes Feb 29, 2024

Choose a reason for hiding this comment

slintes Feb 26, 2024

Choose a reason for hiding this comment

k-keiichi-rh Feb 29, 2024

Choose a reason for hiding this comment

mshitrit commented Mar 7, 2024

openshift-ci bot commented Mar 7, 2024

mshitrit commented Mar 7, 2024

openshift-ci bot commented May 27, 2024

k-keiichi-rh commented Jan 27, 2024 •

edited

k-keiichi-rh commented Feb 22, 2024 •

edited