Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upgrade: retry if default DSCI creation fails #1008

Merged

Conversation

ykaliuta
Copy link
Contributor

@ykaliuta ykaliuta commented May 13, 2024

Description

After removing leader election, operator fails to start if it is instructed to create default DSCI. Looks like webhook is not ready by the time:

create default DSCI CR.
{"level":"error","ts":"2024-05-13T09:25:58Z","logger":"setup","msg":"unable to create initial setup for the operator","error":"Internal error occurred: failed calling webhook \"operator.opendatahub.io\": failed to call webhook: Post \"https://opendatahub-operator-controller-manager-service.oo-2ts9m.svc:443/validate-opendatahub-io-v1?timeout=10s\": no endpoints available for service \"opendatahub-operator-controller-manager-service\"","stacktrace":"main.main.func1\n\t/workspace/main.go:200\nsigs.k8s.io/controller-runtime/pkg/manager.RunnableFunc.Start\n\t/remote-source/operator/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/manager/manager.go:336\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\t/remote-source/operator/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/manager/runnable_group.go:219"}

Leader election added some delay.

The problem does not happen in default configuration since it explicitly disables DSCI creation in the manifests:

       containers:
       - command:
         - /manager
         env:
           - name: DISABLE_DSC_CONFIG
             value: 'true'
         args:
         - --operator-name=opendatahub
         image: controller:latest

Make a wrapper function cluster.CreateWithRetry for client.Object creation with timeout. Use 5s interval, just seems reasonable, and timeout in minutes as the parameter.

It requires disable linter nilerr since for the polling function error in creation is a valid condition, something the function wait to disappear.

Fixes: 3610b0b ("feat: remove leader election for operator (#1000)")

How Has This Been Tested?

  • enable DSCI creation in the manifests
  • deploy the operator.
  • check operator pod logs

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

@ykaliuta ykaliuta requested a review from AjayJagan May 13, 2024 17:25
@openshift-ci openshift-ci bot requested review from aslakknutsen and zdtsw May 13, 2024 17:25
Copy link
Contributor

@AjayJagan AjayJagan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@VaishnaviHire
Copy link
Member

/retest

1 similar comment
@VaishnaviHire
Copy link
Member

/retest

Copy link

openshift-ci bot commented May 14, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: AjayJagan, zdtsw

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ykaliuta
Copy link
Contributor Author

/retest

After removing leader election, operator fails to start if it is
instructed to create default DSCI. Looks like webhook is not ready
by the time:

```
create default DSCI CR.
{"level":"error","ts":"2024-05-13T09:25:58Z","logger":"setup","msg":"unable to create initial setup for the operator","error":"Internal error occurred: failed calling webhook \"operator.opendatahub.io\": failed to call webhook: Post \"https://opendatahub-operator-controller-manager-service.oo-2ts9m.svc:443/validate-opendatahub-io-v1?timeout=10s\": no endpoints available for service \"opendatahub-operator-controller-manager-service\"","stacktrace":"main.main.func1\n\t/workspace/main.go:200\nsigs.k8s.io/controller-runtime/pkg/manager.RunnableFunc.Start\n\t/remote-source/operator/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/manager/manager.go:336\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\t/remote-source/operator/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/manager/runnable_group.go:219"}
```
Leader election added some delay.

The problem does not happen in default configuration since it
explicitly disables DSCI creation in the manifests:

```
       containers:
       - command:
         - /manager
         env:
           - name: DISABLE_DSC_CONFIG
             value: 'true'
         args:
         - --operator-name=opendatahub
         image: controller:latest
```

Make a wrapper function cluster.CreateWithRetry for client.Object
creation with timeout. Use hardcoded 5s interval, just seems
reasonable, and timeout in minutes as the parameter.

It requires disable linter nilerr since for the polling function
error in creation is a valid condition, something the function wait
to disappear.

Fixes: 3610b0b ("feat: remove leader election for operator (opendatahub-io#1000)")

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
@ykaliuta
Copy link
Contributor Author

Actually, generic is overthinking, taking interface here is enough.

@AjayJagan
Copy link
Contributor

/retest

@ykaliuta
Copy link
Contributor Author

/retest-required

@AjayJagan
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm label May 14, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit e26100e into opendatahub-io:incubation May 14, 2024
8 checks passed
VaishnaviHire added a commit that referenced this pull request May 14, 2024
* Update version to v2.12.0 (#1007)

* upgrade: retry if default DSCI creation fails (#1008)

After removing leader election, operator fails to start if it is
instructed to create default DSCI. Looks like webhook is not ready
by the time:

```
create default DSCI CR.
{"level":"error","ts":"2024-05-13T09:25:58Z","logger":"setup","msg":"unable to create initial setup for the operator","error":"Internal error occurred: failed calling webhook \"operator.opendatahub.io\": failed to call webhook: Post \"https://opendatahub-operator-controller-manager-service.oo-2ts9m.svc:443/validate-opendatahub-io-v1?timeout=10s\": no endpoints available for service \"opendatahub-operator-controller-manager-service\"","stacktrace":"main.main.func1\n\t/workspace/main.go:200\nsigs.k8s.io/controller-runtime/pkg/manager.RunnableFunc.Start\n\t/remote-source/operator/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/manager/manager.go:336\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\t/remote-source/operator/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/manager/runnable_group.go:219"}
```
Leader election added some delay.

The problem does not happen in default configuration since it
explicitly disables DSCI creation in the manifests:

```
       containers:
       - command:
         - /manager
         env:
           - name: DISABLE_DSC_CONFIG
             value: 'true'
         args:
         - --operator-name=opendatahub
         image: controller:latest
```

Make a wrapper function cluster.CreateWithRetry for client.Object
creation with timeout. Use hardcoded 5s interval, just seems
reasonable, and timeout in minutes as the parameter.

It requires disable linter nilerr since for the polling function
error in creation is a valid condition, something the function wait
to disappear.

Fixes: 3610b0b ("feat: remove leader election for operator (#1000)")

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>

---------

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
Co-authored-by: Yauheni Kaliuta <ykaliuta@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

4 participants