Skip to content

Release Test Plan 1.5.0

Roger Yao edited this page Jun 28, 2023 · 181 revisions

Issues targeted for this release - List of Github issues

Contributor is testing: Chris, Yang, Roger, Khushboo, Ray, James, Jack, Eric

Testing Version: v1.5.0-rc1, v1.5.0-rc2, v1.5.0-rc3

e2e Pipelines

✅ - Analyze test result of SLES on 1.5.x (AMD) - Roger

v1.5.0-rc1: 8 failures.

Seven of them were due to flaky tests that are already documented in issue #5460.

The remaining issue has been reported in issue #6089.

v1.5.0-rc2: 8 failures.

These failures have already been documented in issue #5460.

v1.5.0-rc3: 3 failures.

These failures have already been documented in issue #5460.

✅ - Analyze test result of SLES on 1.5.x (ARM) - Roger

v1.5.0-rc1: 7 failures.

Five of them were due to flaky tests that are already documented in issue #5460.

The remaining issue has been reported in issue #6076 and #6089.

v1.5.0-rc2: 7 failures.

These failures have already been documented in issue #5460.

v1.5.0-rc3: 5 failures.

These failures have already been documented in issue #5460.

✅ - Analyze upgrade test result of SLES on 1.5.x (AMD) - Yang

v1.5.0-rc1: 6 failures

v1.5.0-rc2: 7 failures

✅ - Analyze upgrade test result of SLES on 1.5.x (ARM) - Yang

v1.5.0-rc1: 5 failures

v1.5.0-rc2: 9 failures

✅ - Analyze two stage test result of SLES on 1.5.x (AMD) 1.3.2 → 1.4.2 → 1.5.0-rc - Chris

v1.5.0-rc1: 14 failures.

Most of them were known issue already documented in issue #5460.

Rerun pass

  • test_rebuild_with_inc_restoration
  • test_engine_image_not_fully_deployed_perform_engine_upgrade
  • test_recovery_from_im_deletion
  • test_snapshot_hash_detect_corruption_in_global_enabled_mode

v1.5.0-rc2

Rerun pass

  • tests.test_migration.test_migration_with_unscheduled_replica
  • tests.test_snapshot.test_snapshot_hash_detect_corruption_in_global_enabled_mode
  • tests.test_system_backup_restore.test_system_backup_with_volume_backup_policy_if_not_present
  • tests.test_snapshot.test_snapshot_hash_detect_corruption_in_global_fast_check_mode

v1.5.0-rc2 2ec round: 12 failures

Most of them were known issue already documented in issue #5460.

Rerun pass

  • tests.test_ha.test_salvage_auto_crash_all_replicas
  • tests.test_migration.test_migration_confirm

v1.5.0-rc3 - 11 failures

Rerun Pass

  • test_basic.test_backuptarget_available_during_engine_image_not_ready
  • test_basic.test_backup_failed_disable_auto_cleanup[s3]
  • test_ha.test_recovery_from_im_deletion
  • test_node.test_replica_scheduler_rebuild_restore_is_too_big[s3]
  • test_ha.test_engine_image_not_fully_deployed_perform_dr_restoring_expanding_volume[s3]
  • test_migration.test_migration_with_rebuilding_replica

Failure

  • test_upgrade.test_upgrade[from_transient]: issue 6200

✅ - Analyze two stage test result of SLES on 1.5.x (ARM) 1.3.2 → 1.4.2 → 1.5.0-rc - Chris

v1.5.0-rc1: 13 failures.

Most of them were known issue already documented in issue #5460.

Rerun pass

  • test_engine_image_not_fully_deployed_perform_dr_restoring_expanding_volume
  • test_migration.test_migration_with_restore_volume
  • test_snapshot.test_snapshot_hash_detect_corruption_in_global_enabled_mode

Passed in RC2 round

  • test_settings.test_setting_concurrent_volume_backup_restore_limit

v1.5.0-rc2: 10 failures

Most of them were known issue already documented in issue #5460.

Rerun pass

  • tests.test_migration.test_migration_with_failed_replica
  • tests.test_settings.test_setting_concurrent_volume_backup_restore_limit_should_not_effect_dr_volumes[s3]

v1.5.0-rc2 2ec round: 7 failures

Rerun pass

  • tests.test_basic.test_engine_image_daemonset_restart
  • tests.test_basic.test_backuptarget_available_during_engine_image_not_ready

v1.5.0-rc3

Rerun pass

  • test_orphan.test_orphan_with_same_orphaned_dir_name_in_another_disk
  • test_ha.test_engine_image_not_fully_deployed_perform_dr_restoring_expanding_volume

Failure

  • test_upgrade.test_upgrade[from_transient]: issue 6200

Distro Matrix Support (e2e Pipelines)

❌ - SLE Micro 5.3 (AMD & ARM) - Roger

v1.5.0-rc1 AMD64: 13 failures.

Eleven of them were due to flaky tests that are already documented in issue #5460.

The remaining issue has been reported in issue #6076 and #6089.

v1.5.0-rc2 AMD64: 6 failures.

v1.5.0-rc2 ARM64: 5 failures.

These failures have already been documented in issue #5460.

✅ - RHEL 9.1 (AMD & ARM) - Yang

v1.5.0-rc1 amd64: 7 failures

v1.5.0-rc1 arm64: 5 failures

v1.5.0-rc2 amd64: 7 failures

v1.5.0-rc2 arm64: 3 failures

❌ - CentOS 8.5 (AMD & ARM) - Chris : No suitable ami for testing

Note: CentOS is removed from the verified distro list due to https://github.com/longhorn/website/commit/bcbdec90ce9450428891db6660ea38d52f3516b2 v1.5.0-rc1 AMD64, ARM64

v1.5.0-rc2 AMD64, ARM64

No test result on both amd64 arm64 pipeline

Error: Your query returned no results. Please change your search criteria and try again.

  with data.aws_ami.aws_ami_centos,
  on data.tf line 6, in data "aws_ami" "aws_ami_centos":
   6: data "aws_ami" "aws_ami_centos" {

✅ - Ubuntu 22.04 (AMD & ARM) - Ray

v1.5.0-rc1

v1.5.0-rc2

✅❌ - Rocky Linux 9 (AMD & ARM) - Eric

v1.5.0-rc1, round 1, AMD64, ARM64
v1.5.0-rc1, round 2, AMD64, ARM64

11 failures in both AMD
tests.test_backing_image.test_snapshot_with_backing_image - known to be flaky
tests.test_ha.test_rebuild_with_restoration[s3] - test fixed in rc2
tests.test_ha.test_rebuild_with_restoration[nfs] - test fixed in rc2
tests.test_ha.test_single_replica_restore_failure[s3] - test fixed in rc2
tests.test_ha.test_single_replica_restore_failure[nfs] - test fixed in rc2
tests.test_ha.test_dr_volume_with_restore_command_error[nfs] - known to be flaky
tests.test_kubernetes.test_kubernetes_status - known in Kubernetes v1.27
tests.test_kubernetes.test_backup_kubernetes_status[s3] - known in Kubernetes v1.27
tests.test_kubernetes.test_backup_kubernetes_status[nfs] - known in Kubernetes v1.27
tests.test_recurring_job.test_recurring_job_restored_from_backup_target[nfs]
tests.test_settings.test_setting_concurrent_volume_backup_restore_limit[nfs] - known to be flaky

8 failures in both ARM
tests.test_recurring_job.test_recurring_jobs_when_volume_detached_unexpectedly[nfs] - known to be flaky
tests.test_ha.test_rebuild_with_restoration[s3] - test fixed in rc2
tests.test_ha.test_rebuild_with_restoration[nfs] - test fixed in rc2
tests.test_ha.test_single_replica_restore_failure[s3] - test fixed in rc2
tests.test_ha.test_single_replica_restore_failure[nfs] - test fixed in rc2
tests.test_kubernetes.test_kubernetes_status - known in Kubernetes v1.27
tests.test_kubernetes.test_backup_kubernetes_status[s3] - known in Kubernetes v1.27
tests.test_kubernetes.test_backup_kubernetes_status[nfs] - known in Kubernetes v1.27

v1.5.0-rc2, round 1, AMD64, ARM64

10 failures in AMD
tests.test_csi.test_csi_block_volume_online_expansion - https://github.com/longhorn/longhorn/issues/6076
tests.test_ha.test_rebuild_with_inc_restoration[s3]
tests.test_ha.test_dr_volume_with_restore_command_error[s3] - https://github.com/longhorn/longhorn/issues/6130
tests.test_migration.test_migration_with_rebuilding_replica - known to be flaky
tests.test_node.test_node_eviction_multiple_volume - known to be flaky
tests.test_recurring_job.test_recurring_jobs_allow_detached_volume[s3] - https://github.com/longhorn/longhorn/issues/6124
tests.test_recurring_job.test_recurring_jobs_allow_detached_volume[nfs] - https://github.com/longhorn/longhorn/issues/6124
tests.test_recurring_job.test_recurring_job_restored_from_backup_target[nfs] - has failed every time tests.test_settings.test_setting_concurrent_volume_backup_restore_limit[s3] - known to be flaky
tests.test_settings.test_setting_concurrent_volume_backup_restore_limit_should_not_effect_dr_volumes[nfs]

6 failures in ARM
tests.test_basic.test_backup_status_for_unavailable_replicas[s3]
tests.test_ha.test_dr_volume_with_restore_command_error[nfs]
tests.test_ha.test_engine_image_not_fully_deployed_perform_replica_scheduling - https://github.com/longhorn/longhorn/issues/6130
tests.test_infra.test_offline_node
tests.test_ha.test_dr_volume_with_restore_command_error[s3] - https://github.com/longhorn/longhorn/issues/6130
tests.test_recurring_job.test_recurring_job_restored_from_backup_target[nfs]

v1.5.0-rc2, round 2, AMD64

9 failures in AMD. Most from round 1 did not repeat. All are likely flakes or known issues.
tests.test_basic.test_space_usage_for_rebuilding_only_volume - likely flaky (only failed this way once)
tests.test_ha.test_dr_volume_with_restore_command_error[nfs] - https://github.com/longhorn/longhorn/issues/6130
tests.test_migration.test_migration_with_rebuilding_replica - repeat, known to be flaky
tests.test_recurring_job.test_recurring_jobs_allow_detached_volume[nfs] - repeat, https://github.com/longhorn/longhorn/issues/6124
tests.test_recurring_job.test_recurring_jobs_when_volume_detached_unexpectedly[s3] - https://github.com/longhorn/longhorn/issues/6124
tests.test_recurring_job.test_recurring_jobs_when_volume_detached_unexpectedly[nfs] - https://github.com/longhorn/longhorn/issues/6124
tests.test_settings.test_setting_concurrent_volume_backup_restore_limit[nfs] - known to be flaky
tests.test_snapshot.test_snapshot_hash_detect_corruption_in_global_fast_check_mode - https://github.com/longhorn/longhorn/issues/6129
tests.test_infra.test_offline_node - known to be flaky

❌ - RHEL/CentOS/Rocky Linux SELinux enabled (AMD & ARM) - https://github.com/longhorn/longhorn/issues/5627

v1.5.0-rc1, All tests failed with ERROR.

v1.5.0-rc2, Tested outside of Jenkins platform while waiting on longhorn-tests PRs.

Failed due to flake or known issue:
test_ha.py::test_dr_volume_with_restore_command_error[nfs] - https://github.com/longhorn/longhorn/issues/6130
test_ha.py::test_engine_image_not_fully_deployed_perform_engine_upgrade - known to be flaky, passed on rerun
test_ha.py::test_autosalvage_with_data_locality_enabled - https://github.com/longhorn/longhorn/issues/4814
test_orphan.py::test_orphan_with_same_orphaned_dir_name_in_another_disk - passed on rerun
test_recurring_job.py::test_recurring_jobs_allow_detached_volume[nfs] - https://github.com/longhorn/longhorn/issues/6124
test_recurring_job.py::test_recurring_jobs_when_volume_detached_unexpectedly[s3] - https://github.com/longhorn/longhorn/issues/6124
test_recurring_job.py::test_recurring_jobs_when_volume_detached_unexpectedly[nfs] - https://github.com/longhorn/longhorn/issues/6124
test_settings.py::test_setting_concurrent_volume_backup_restore_limit[nfs] - known to be flaky, passed on rerun
test_settings.py::test_setting_concurrent_volume_backup_restore_limit_should_not_effect_dr_volumes[nfs] - known to be flaky, passed on rerun
test_snapshot.py::test_snapshot_hash_detect_corruption_in_global_enabled_mode - https://github.com/longhorn/longhorn/issues/6129, passed on rerun

Failed due to local environment:
test_csi_snapshotter.py::all - did not install snapshot controller
test_infra.py::test_offline_node - did not inject necessary variable

Net new issue. Cannot pass over multiple attempts:
test_ha.py::test_recovery_from_im_deletion - https://github.com/longhorn/longhorn/issues/6171

✅ - Oracle Linux (AMD & ARM) - Chris

❌ v1.5.0-rc1 AMD64, ARM64

✅ v1.5.0-rc2 AMD64 - 369 failures, ARM64 - not support

pipeline not support arm64

Error: Your query returned no results. Please change your search criteria and try again.

  with data.aws_ami.aws_ami_oraclelinux,
  on data.tf line 6, in data "aws_ami" "aws_ami_oraclelinux":
   6: data "aws_ami" "aws_ami_oraclelinux" {

Oracle AMD64 pipeline had hit volume can not attach issue. Manually applied workaround and run core test on local environment as below Screenshot from 2023-06-16 20-24-17

Below failed case all passed after rerun

  • test_restore_inc_with_offline_expansion
  • test_single_replica_failed_during_engine_start

✅ v1.5.0-rc2 2ec round [AMD64]: Core test all passed

Non-e2e Pipelines

✅ - Benchmark (compare with 1.4.2) - Yang

✅ RC1 ✅ RC2

✅ - Air-Gap () - Chris

✅ RC1 manifest, helm chart

✅ RC2 - manifest, helm chart

✅ - Vulnerability scanning - Chris

v1.5.0-rc1 74 failures

v1.5.0-rc2 39 failures

✅ - Negative testing

Manual Testing

Environment:

❌ - Test Longhorn deployment on RKE2 v1.24- with CIS-1.6 profile - Eric

Test fails in the kubectl case at step 12. longhorn-uninstall-rm58r uses the global-restricted-psp which disallows containers that run as root. We clearly intend it to use the longhorn-uninstall-psp. This looks very similar to the failure in a PR for the original issue. That PR originally targeted the Helm case and was only merged into v1.1.1 specifically, so this is all a bit confusing. Work around and uninstall by adding the securityContext from that PR to the job in uninstall.yaml (forces the uninstall pod to use longhorn-uninstall-psp instead.

> kubectl get -n longhorn-system pod
...
longhorn-uninstall-rm58r                            0/1     CreateContainerConfigError   0          11s

> kubectl describe -n longhorn-system pod longhorn-uninstall-rm58r
Name:         longhorn-uninstall-rm58r
Namespace:    longhorn-system
Priority:     0
Node:         eweber-cis-test-rocky-02/142.93.197.135
Start Time:   Fri, 09 Jun 2023 20:47:01 +0000
Labels:       controller-uid=bef88e5d-019c-401f-8f9d-44e5aec282cf
              job-name=longhorn-uninstall
Annotations:  cni.projectcalico.org/containerID: 90eb1dc91a570495c85223db56a075390a470ca0ca98a109f02783e5b616f7e0
              cni.projectcalico.org/podIP: 10.42.1.49/32
              cni.projectcalico.org/podIPs: 10.42.1.49/32
              kubernetes.io/psp: global-restricted-psp
Status:       Pending
IP:           10.42.1.49
IPs:
  IP:           10.42.1.49
Controlled By:  Job/longhorn-uninstall
Containers:
  longhorn-uninstall:
    Container ID:
    Image:         longhornio/longhorn-manager:v1.5.0-rc1
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      longhorn-manager
      uninstall
      --force
    State:          Waiting
      Reason:       CreateContainerConfigError
    Ready:          False
    Restart Count:  0
    Environment:
      LONGHORN_NAMESPACE:  longhorn-system
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-k87m8 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  kube-api-access-k87m8:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  27s                default-scheduler  Successfully assigned longhorn-system/longhorn-uninstall-rm58r to eweber-cis-test-rocky-02
  Normal   Pulled     14s (x4 over 27s)  kubelet            Container image "longhornio/longhorn-manager:v1.5.0-rc1" already present on machine
  Warning  Failed     14s (x4 over 27s)  kubelet            Error: container has runAsNonRoot and image will run as root (pod: "longhorn-uninstall-rm58r_longhorn-system(3efe856d-cda4-4444-833d-ca68e3ad2d5e)", container: longhorn-uninstall)

✅ - Test Longhorn deployment on RKE2 v1.25+ with CIS-1.23 profile - Eric

HA:

✅ - Replica Rebuilding - James

✅ RC1 ✅ RC2

✅ - Entire cluster down - Khushboo

✅ RC1 -

  1. All the nodes including etcd/control plane were power cycled and brought back after 1 hour - PASSED
  2. All the nodes rebooted together - PASSED

Node Down

✅ - Node drain and deletion test - Jack

✅ RC1

✅ - The node the DR volume attached to is rebooted - Roger

✅ RC1 ✅ RC2

Resiliency:

✅ - Test Longhorn components recovery - Jack

✅ RC1

Stability

✅ - Checksum enabled large volume with multiple rebuilding - James

✅ RC1

  • Rebuilding needs about 3 minutes and 20 seconds for v1.3.x and 3 minutes for v1.5.0-rc1

✅ RC2

  • Rebuilding needs about 2 minutes for v1.5.0-rc2

✅ - Uninstallation Checks - James

✅ RC1 ✅ RC2

Upgrade:

✅ - Test Engine Crash During Live Upgrade - Jack

✅ RC1 ✅ RC2 upgrade from v1.4.2 to v1.5.0-rc2

✅ - Kubernetes upgrade test - James

✅ RC1 upgrade k8s/k3s cluster from 1.26.5 to 1.27.2 with and w/o drain nodes, os: ubuntu2204, underlying Infrastructure: KVM

✅ RC2 upgrade k8s/k3s cluster from 1.26.5 to 1.27.2 with and w/o drain nodes, os: ubuntu2204, underlying Infrastructure: KVM

✅ - Longhorn upgrade test - Yang

✅ RC3 Upgrade from v1.4.2 to v1.5.0-rc3

✅ - Upgrade Conflict Handling test - Jack

✅ RC1 Upgrade from v1.4.2 to v1.5.0-rc1 ✅ RC2 Upgrade from v1.4.2 to v1.5.0-rc2

Rancher Integration

✅ - Drain using Rancher UI - Khushboo

✅ - Longhorn in an hardened cluster - Khushboo

✅ - Upgrade Kubernetes using Rancher UI - Khushboo

Chart Testing

✅❌ - v2.7.x Rancher - Khushboo

   Prereq: Set Concurrent Automatic Engine Upgrade Per Node Limit to 0
   Test steps:
   1. Fresh installation
   2. Uninstallation
   3. Upgrade from v1.4.2

✅ - Longhorn Chart - Chris

   Prereq: Set Concurrent Automatic Engine Upgrade Per Node Limit to greater than 0
   Test steps:
   1. Fresh installation
   2. Uninstallation
   3. Upgrade from v1.4.2

✅ RC1 ✅ RC2

Clone this wiki locally