Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle GroupDiscoveryFailedError in proxy.go (#5596) #6222

Conversation

gabe-l-hart
Copy link

@gabe-l-hart gabe-l-hart commented Dec 15, 2022

Description of the change:

Similar to the fix in helm, this fix allows GroupDiscoveryFailedError to not error out the process of managing apiserver resource types.

Motivation for the change:

The proposed fix in #5596 requires that the cluster's state be fixed (upgrading / cleaning up the metrics server apiserver resource). While this is correct, it's not a fix that can be made by the author of an operator as it requires the user of the operator to fix their cluster. Fixing this on the client side (in the operator itself) will save headaches and support tickets in the future.

Checklist

If the pull request includes user-facing changes, extra documentation is required:

…work#5596)

Similar to the fix in helm (helm/helm#6361), this
fix allows GroupDiscoveryFailedError to not error out the process of
managing apiserver resource types.

Fixes operator-framework#5596

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 15, 2022
Add a changelog fragment for the merge automation

operator-framework#5596

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
@gabe-l-hart
Copy link
Author

Hi maintainers! I think this fix should address the linked issue, but I'm still working on getting a valid repro case that will allow me to fully validate. I'll also plan to look into a unit test to exercise the corner case before moving to full review.

@gabe-l-hart
Copy link
Author

I've verified that make test-e2e-ansible passes cleanly. I'm still trying to learn my way around the test framework, but it looks like the most appropriate place to add tests would be in cluster_tests.go. It's not clear to me quite how we would simulate this kind of apiserver failure in those tests, so I'm open to suggestions!

@gabe-l-hart gabe-l-hart marked this pull request as ready for review December 15, 2022 21:16
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 15, 2022
@everettraven
Copy link
Contributor

@gabe-l-hart Thanks for taking the time to make this contribution!

I looked through the changes and it seems relatively straightforward. I would definitely like to find a way to add a test case for this PR before merging though to ensure the change is working as expected. cluster_test.go definitely looks to be the right place to add a test case.

Regarding the implementation of the test case itself, this comment on the Helm issue you linked makes me think as part of the test case we would need to install an api-service onto the test cluster and then intentionally orphan it somehow (I'm honestly not sure how to do this - not very familiar with aggregated apiservers).

@gabe-l-hart gabe-l-hart temporarily deployed to deploy December 20, 2022 20:01 — with GitHub Actions Inactive
@gabe-l-hart gabe-l-hart temporarily deployed to deploy December 20, 2022 20:01 — with GitHub Actions Inactive
@gabe-l-hart gabe-l-hart temporarily deployed to deploy December 20, 2022 20:01 — with GitHub Actions Inactive
@gabe-l-hart gabe-l-hart temporarily deployed to deploy December 20, 2022 20:01 — with GitHub Actions Inactive
@gabe-l-hart gabe-l-hart temporarily deployed to deploy December 20, 2022 20:01 — with GitHub Actions Inactive
@gabe-l-hart gabe-l-hart temporarily deployed to deploy December 20, 2022 20:01 — with GitHub Actions Inactive
@gabe-l-hart gabe-l-hart temporarily deployed to deploy December 20, 2022 20:01 — with GitHub Actions Inactive
@gabe-l-hart
Copy link
Author

Thanks for the pointer @everettraven! I'll look into adding a test and see how far I get. I also don't know yet how to simulate an orphaned apiserver (we just saw this come up on a customer cluster, but I have no idea how it got in that state!) but I'm sure it should be doable somehow, if not in a real cluster at least in some kind of a mock.

@everettraven
Copy link
Contributor

I also don't know yet how to simulate an orphaned apiserver (we just saw this come up on a customer cluster, but I have no idea how it got in that state!) but I'm sure it should be doable somehow, if not in a real cluster at least in some kind of a mock.

@gabe-l-hart I was doing some pretty rudimentary digging around and I think a good starting point for understanding how the apiservers work could be: https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/apiserver-aggregation/

I also saw this: https://kubernetes.io/docs/reference/kubernetes-api/cluster-resources/api-service-v1/ - it looks to go over the APIService resource and I wonder if we could somehow use that to fake out the cluster we are using for testing that there is an apiserver running for a specific group/version without there actually being a backend implemented (maybe this would cause the error to happen?).

Just some more thoughts from doing some brief digging around about aggregated apiservers - I'm looking forward to seeing what you come up with (and feel free to ping me on here if you just want to bounce ideas around :))

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 22, 2023
@openshift-bot
Copy link

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 21, 2023
@openshift-bot
Copy link

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci openshift-ci bot closed this May 22, 2023
@openshift-ci
Copy link

openshift-ci bot commented May 22, 2023

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants