Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TRT-1576: Fail if operator has Available=False unless in upgrade window #28735

Merged

Conversation

DennisPeriquet
Copy link
Contributor

@DennisPeriquet DennisPeriquet commented Apr 23, 2024

For this test: [bz-%v] clusteroperator/%v should not change condition/Available]:

  • For non-upgrade jobs, fail when operator goes to Available=False
  • For upgrade-jobs, fail when operator goes to Available=False unless it's during an upgrade window and the condition lasts for less than 10 minutes.

Once the PR where storage operator stops reporting Available status merges, we can remove the exception for it.

@openshift-ci openshift-ci bot requested review from deads2k and soltysh April 23, 2024 11:25
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 23, 2024
@DennisPeriquet
Copy link
Contributor Author

/payload-job periodic-ci-openshift-release-master-ci-4.16-e2e-vsphere-ovn-upgrade

This will see if my new exception allows the upgrade job to pass despite the single storage operator replica.

Copy link
Contributor

openshift-ci bot commented Apr 23, 2024

@DennisPeriquet: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.16-e2e-vsphere-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/272b5a20-0187-11ef-95a0-20b3d6d376a7-0

@DennisPeriquet
Copy link
Contributor Author

/payload-job periodic-ci-openshift-release-master-ci-4.16-e2e-vsphere-ovn-upgrade

retry because the last one didn't really run

Copy link
Contributor

openshift-ci bot commented Apr 23, 2024

@DennisPeriquet: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.16-e2e-vsphere-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/61bc6960-0194-11ef-8313-791cce82a878-0

@openshift-trt-bot
Copy link

Job Failure Risk Analysis for sha: 63d0936

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-agnostic-ovn-cmd IncompleteTests
Tests for this run (16) are below the historical average (536): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

@openshift-trt-bot
Copy link

Job Failure Risk Analysis for sha: 3014822

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-agnostic-ovn-cmd IncompleteTests
Tests for this run (25) are below the historical average (531): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

@DennisPeriquet DennisPeriquet changed the title DO NOT MERGE: See how many jobs fail with Degraded=True and Available=False DO NOT MERGE: See how many jobs fail with Available=False Apr 26, 2024
@openshift-trt-bot
Copy link

Job Failure Risk Analysis for sha: d950634

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-gcp-csi High
[OLM][invariant] alert/KubePodNotReady should not be at or above info in ns/openshift-marketplace
This test has passed 100.00% of 25 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.16-e2e-gcp-ovn-csi'] in the last 14 days.
pull-ci-openshift-origin-master-e2e-gcp-ovn-upgrade Medium
[OLM][invariant] alert/KubePodNotReady should not be at or above info in ns/openshift-marketplace
This test has passed 96.70% of 818 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.16-e2e-gcp-ovn-upgrade' 'periodic-ci-openshift-release-master-ci-4.16-e2e-gcp-ovn-upgrade'] in the last 14 days.

@openshift-trt-bot
Copy link

Job Failure Risk Analysis for sha: 2e4493a

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-upgrade High
[sig-apps] job-upgrade
This test has passed 100.00% of 32 runs on jobs ['periodic-ci-openshift-release-master-ci-4.16-e2e-aws-ovn-upgrade'] in the last 14 days.
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-serial Low
[bz-apiserver-auth] clusteroperator/authentication should not change condition/Available
This test has passed 0.00% of 62 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node-serial'] in the last 14 days.

Open Bugs
Single short-lived operand blip shouldn't cause authentication operator Available=False
---
[bz-Storage] clusteroperator/storage should not change condition/Available
This test has passed 0.00% of 62 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node-serial'] in the last 14 days.

Open Bugs
Setup new vsphere informing job
---
[sig-arch] events should not repeat pathologically for ns/openshift-etcd-operator
This test has passed 51.61% of 62 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node-serial'] in the last 14 days.
---
[bz-OLM] clusteroperator/operator-lifecycle-manager-packageserver should not change condition/Available
This test has passed 1.61% of 62 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node-serial'] in the last 14 days.
---
Showing 4 of 12 test results

@DennisPeriquet
Copy link
Contributor Author

/test unit

@DennisPeriquet
Copy link
Contributor Author

/payload-job periodic-ci-openshift-release-master-ci-4.16-e2e-vsphere-ovn-upgrade

Copy link
Contributor

openshift-ci bot commented Apr 29, 2024

@DennisPeriquet: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.16-e2e-vsphere-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/8a3d2950-0627-11ef-99cb-168bfde7d9b7-0

@openshift-trt-bot
Copy link

Job Failure Risk Analysis for sha: 80a02e7

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-upgrade High
[sig-apps] job-upgrade
This test has passed 100.00% of 23 runs on jobs ['periodic-ci-openshift-release-master-ci-4.16-e2e-aws-ovn-upgrade'] in the last 14 days.
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-serial Low
[bz-Storage] clusteroperator/storage should not change condition/Available
This test has passed 0.00% of 45 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node-serial'] in the last 14 days.

Open Bugs
Setup new vsphere informing job
---
[bz-Routing] clusteroperator/ingress should not change condition/Available
This test has passed 0.00% of 45 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node-serial'] in the last 14 days.
---
[bz-Image Registry] clusteroperator/image-registry should not change condition/Available
This test has passed 22.22% of 45 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node-serial'] in the last 14 days.

Open Bugs
CI: fail update suite if any ClusterOperator go Available=False outside of updates
---
[bz-OLM] clusteroperator/operator-lifecycle-manager-packageserver should not change condition/Available
This test has passed 2.22% of 45 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node-serial'] in the last 14 days.
---
Showing 4 of 11 test results

@DennisPeriquet
Copy link
Contributor Author

/test unit

@DennisPeriquet
Copy link
Contributor Author

/payload-job periodic-ci-openshift-release-master-ci-4.16-e2e-vsphere-ovn-upgrade

Copy link
Contributor

openshift-ci bot commented Apr 30, 2024

@DennisPeriquet: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.16-e2e-vsphere-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/6ff37c20-0690-11ef-86e4-c1c128b91d20-0

@DennisPeriquet DennisPeriquet changed the title DO NOT MERGE: See how many jobs fail with Available=False TRT-1576: Fail if operator has Available=False unless in upgrade window Apr 30, 2024
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 30, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented Apr 30, 2024

@DennisPeriquet: This pull request references TRT-1576 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

Fail the [bz-%v] clusteroperator/%v should not change condition/Available] test for operators when Available=False outside of any upgrade window.

Add an exception for storage operator since it has only one replica.

This will give me a list of failures to look into. From the list of failures, we can see if there are already Jiras and decide if we want to add exceptions. Then, we'll update the PR with exceptions.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Apr 30, 2024

@DennisPeriquet: This pull request references TRT-1576 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

For this test: [bz-%v] clusteroperator/%v should not change condition/Available]:

  • For non-upgrade jobs, fail when operator goes to Available=False
  • For upgrade-jobs, fail when operator goes to Available=False unless it's during an upgrade window and the condition lasts for less than 10 minutes.

Once the PR where storage operator stops reporting Available status merges, we can remove the exception for it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-trt-bot
Copy link

Job Failure Risk Analysis for sha: efde445

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-upgrade Low
[bz-kube-storage-version-migrator] clusteroperator/kube-storage-version-migrator should not change condition/Available
This test has passed 61.54% of 52 runs on jobs ['periodic-ci-openshift-release-master-ci-4.16-e2e-aws-ovn-upgrade'] in the last 14 days.
---
[bz-Cloud Compute] clusteroperator/control-plane-machine-set should not change condition/Available
This test has passed 61.54% of 52 runs on jobs ['periodic-ci-openshift-release-master-ci-4.16-e2e-aws-ovn-upgrade'] in the last 14 days.

Open Bugs
control-plane-machine-set goes Available=False with UnavailableReplicas during updates

@DennisPeriquet
Copy link
Contributor Author

/test e2e-agnostic-ovn-cmd

@DennisPeriquet
Copy link
Contributor Author

/test verify

@DennisPeriquet
Copy link
Contributor Author

e2e-gcp-ovn test failure tracked in https://issues.redhat.com/browse/TRT-1680

@DennisPeriquet
Copy link
Contributor Author

/test e2e-aws-ovn-single-node-upgrade
/test e2e-gcp-ovn

@openshift-trt-bot
Copy link

Job Failure Risk Analysis for sha: 556ee0a

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-gcp-ovn Medium
[sig-storage] PersistentVolumes GCEPD [Feature:StorageProvider] should test that deleting the PV before the pod does not cause pod deletion to fail on PD detach [Skipped:NoOptionalCapabilities] [Suite:openshift/conformance/parallel] [Suite:k8s]
This test has passed 93.09% of 680 runs on release 4.17 [Overall] in the last week.

Open Bugs
4.17 ci failures: persistentvolumes "gce-" is forbidden ... GCE PD ...disk is not found
---
[sig-storage] Multi-AZ Cluster Volumes should schedule pods in the same zones as statically provisioned PVs [Suite:openshift/conformance/parallel] [Suite:k8s]
This test has passed 91.18% of 680 runs on release 4.17 [Overall] in the last week.

1 similar comment
@openshift-trt-bot
Copy link

Job Failure Risk Analysis for sha: 556ee0a

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-gcp-ovn Medium
[sig-storage] PersistentVolumes GCEPD [Feature:StorageProvider] should test that deleting the PV before the pod does not cause pod deletion to fail on PD detach [Skipped:NoOptionalCapabilities] [Suite:openshift/conformance/parallel] [Suite:k8s]
This test has passed 93.09% of 680 runs on release 4.17 [Overall] in the last week.

Open Bugs
4.17 ci failures: persistentvolumes "gce-" is forbidden ... GCE PD ...disk is not found
---
[sig-storage] Multi-AZ Cluster Volumes should schedule pods in the same zones as statically provisioned PVs [Suite:openshift/conformance/parallel] [Suite:k8s]
This test has passed 91.18% of 680 runs on release 4.17 [Overall] in the last week.

@openshift-trt-bot
Copy link

Job Failure Risk Analysis for sha: 3ddbed1

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-upgrade High
[bz-Node Tuning Operator] clusteroperator/node-tuning should not change condition/Available
This test has passed 99.79% of 4804 runs on release 4.17 [Overall] in the last week.
---
[sig-network] can collect pod-to-host poller pod logs
This test has passed 99.39% of 4759 runs on release 4.17 [Overall] in the last week.
---
[sig-network] can collect host-to-host poller pod logs
This test has passed 99.39% of 4759 runs on release 4.17 [Overall] in the last week.
---
[sig-arch] events should not repeat pathologically for ns/openshift-kube-apiserver-operator
This test has passed 99.75% of 4793 runs on release 4.17 [Overall] in the last week.
---
Showing 4 of 6 test results

1 similar comment
@openshift-trt-bot
Copy link

Job Failure Risk Analysis for sha: 3ddbed1

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-upgrade High
[bz-Node Tuning Operator] clusteroperator/node-tuning should not change condition/Available
This test has passed 99.79% of 4804 runs on release 4.17 [Overall] in the last week.
---
[sig-network] can collect pod-to-host poller pod logs
This test has passed 99.39% of 4759 runs on release 4.17 [Overall] in the last week.
---
[sig-network] can collect host-to-host poller pod logs
This test has passed 99.39% of 4759 runs on release 4.17 [Overall] in the last week.
---
[sig-arch] events should not repeat pathologically for ns/openshift-kube-apiserver-operator
This test has passed 99.75% of 4793 runs on release 4.17 [Overall] in the last week.
---
Showing 4 of 6 test results

@DennisPeriquet
Copy link
Contributor Author

/test e2e-aws-ovn-serial

@DennisPeriquet
Copy link
Contributor Author

/test e2e-aws-ovn-single-node-upgrade

@DennisPeriquet
Copy link
Contributor Author

/test e2e-metal-ipi-sdn

Copy link
Contributor

openshift-ci bot commented May 29, 2024

@DennisPeriquet: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

  • /test e2e-aws-jenkins
  • /test e2e-aws-ovn-edge-zones
  • /test e2e-aws-ovn-fips
  • /test e2e-aws-ovn-image-registry
  • /test e2e-aws-ovn-serial
  • /test e2e-gcp-ovn
  • /test e2e-gcp-ovn-builds
  • /test e2e-gcp-ovn-image-ecosystem
  • /test e2e-gcp-ovn-upgrade
  • /test e2e-metal-ipi-ovn-ipv6
  • /test images
  • /test lint
  • /test unit
  • /test verify
  • /test verify-deps

The following commands are available to trigger optional jobs:

  • /test 4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade-rollback
  • /test e2e-agnostic-ovn-cmd
  • /test e2e-aws
  • /test e2e-aws-csi
  • /test e2e-aws-disruptive
  • /test e2e-aws-etcd-recovery
  • /test e2e-aws-ovn
  • /test e2e-aws-ovn-cgroupsv2
  • /test e2e-aws-ovn-etcd-scaling
  • /test e2e-aws-ovn-kubevirt
  • /test e2e-aws-ovn-single-node
  • /test e2e-aws-ovn-single-node-serial
  • /test e2e-aws-ovn-single-node-upgrade
  • /test e2e-aws-ovn-upgrade
  • /test e2e-aws-ovn-upi
  • /test e2e-aws-proxy
  • /test e2e-azure
  • /test e2e-azure-ovn-etcd-scaling
  • /test e2e-azure-ovn-upgrade
  • /test e2e-baremetalds-kubevirt
  • /test e2e-gcp-csi
  • /test e2e-gcp-disruptive
  • /test e2e-gcp-fips-serial
  • /test e2e-gcp-ovn-etcd-scaling
  • /test e2e-gcp-ovn-rt-upgrade
  • /test e2e-gcp-ovn-techpreview
  • /test e2e-gcp-ovn-techpreview-serial
  • /test e2e-metal-ipi-ovn
  • /test e2e-metal-ipi-ovn-dualstack
  • /test e2e-metal-ipi-ovn-dualstack-local-gateway
  • /test e2e-metal-ipi-serial
  • /test e2e-metal-ipi-serial-ovn-ipv6
  • /test e2e-metal-ipi-virtualmedia
  • /test e2e-openstack-ovn
  • /test e2e-openstack-serial
  • /test e2e-vsphere
  • /test e2e-vsphere-ovn-dualstack-primaryv6
  • /test e2e-vsphere-ovn-etcd-scaling
  • /test okd-e2e-gcp
  • /test okd-scos-images

Use /test all to run the following jobs that were automatically triggered:

  • pull-ci-openshift-origin-master-e2e-agnostic-ovn-cmd
  • pull-ci-openshift-origin-master-e2e-aws-csi
  • pull-ci-openshift-origin-master-e2e-aws-ovn-cgroupsv2
  • pull-ci-openshift-origin-master-e2e-aws-ovn-edge-zones
  • pull-ci-openshift-origin-master-e2e-aws-ovn-fips
  • pull-ci-openshift-origin-master-e2e-aws-ovn-serial
  • pull-ci-openshift-origin-master-e2e-aws-ovn-single-node
  • pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-serial
  • pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-upgrade
  • pull-ci-openshift-origin-master-e2e-aws-ovn-upgrade
  • pull-ci-openshift-origin-master-e2e-gcp-csi
  • pull-ci-openshift-origin-master-e2e-gcp-ovn
  • pull-ci-openshift-origin-master-e2e-gcp-ovn-rt-upgrade
  • pull-ci-openshift-origin-master-e2e-gcp-ovn-upgrade
  • pull-ci-openshift-origin-master-e2e-metal-ipi-ovn
  • pull-ci-openshift-origin-master-e2e-metal-ipi-ovn-ipv6
  • pull-ci-openshift-origin-master-e2e-openstack-ovn
  • pull-ci-openshift-origin-master-images
  • pull-ci-openshift-origin-master-lint
  • pull-ci-openshift-origin-master-unit
  • pull-ci-openshift-origin-master-verify
  • pull-ci-openshift-origin-master-verify-deps

In response to this:

/test e2e-metal-ipi-sdn

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Contributor

openshift-ci bot commented May 29, 2024

@DennisPeriquet: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-metal-ipi-sdn 97f3a73 link false /test e2e-metal-ipi-sdn
ci/prow/e2e-aws-ovn-single-node-upgrade 3ddbed1 link false /test e2e-aws-ovn-single-node-upgrade

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-trt-bot
Copy link

Job Failure Risk Analysis for sha: 3ddbed1

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-upgrade High
[bz-Node Tuning Operator] clusteroperator/node-tuning should not change condition/Available
This test has passed 99.78% of 4507 runs on release 4.17 [Overall] in the last week.

@DennisPeriquet
Copy link
Contributor Author

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 30, 2024
@DennisPeriquet
Copy link
Contributor Author

The pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-upgrade does not have any passing jobs in the past 2 months.

@dgoodwin
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 30, 2024
Copy link
Contributor

openshift-ci bot commented May 30, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: DennisPeriquet, dgoodwin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [DennisPeriquet,dgoodwin]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD a64ef19 and 2 for PR HEAD 3ddbed1 in total

@openshift-trt-bot
Copy link

Job Failure Risk Analysis for sha: 3ddbed1

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-upgrade High
[bz-Node Tuning Operator] clusteroperator/node-tuning should not change condition/Available
This test has passed 99.98% of 4915 runs on release 4.17 [Overall] in the last week.

@openshift-merge-bot openshift-merge-bot bot merged commit e18dccb into openshift:master May 30, 2024
22 of 23 checks passed
@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

This PR has been included in build openshift-enterprise-tests-container-v4.17.0-202405310013.p0.ge18dccb.assembly.stream.el9 for distgit openshift-enterprise-tests.
All builds following this will include this PR.

@DennisPeriquet DennisPeriquet deleted the excepted_failures1 branch May 31, 2024 10:07
DennisPeriquet added a commit to DennisPeriquet/origin that referenced this pull request May 31, 2024
…ed_failures1"

This reverts commit e18dccb, reversing
changes made to a64ef19.
openshift-merge-bot bot added a commit that referenced this pull request May 31, 2024
TRT-1691: Revert #28735 "TRT-1576: Fail if operator has Available=False unless in upgrade window"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. vendor-update Touching vendor dir or related files
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants