Release Test Plan 1.5.0

Issues targeted for this release - List of Github issues

Contributor is testing: Chris, Yang, Roger, Khushboo, Ray, James, Jack, Eric

Testing Version: v1.5.0-rc1, v1.5.0-rc2, v1.5.0-rc3

e2e Pipelines

✅ - Analyze test result of SLES on 1.5.x (AMD) - Roger

v1.5.0-rc1: 8 failures.

Seven of them were due to flaky tests that are already documented in issue #5460.

The remaining issue has been reported in issue #6089.

v1.5.0-rc2: 8 failures.

These failures have already been documented in issue #5460.

v1.5.0-rc3: 3 failures.

These failures have already been documented in issue #5460.

✅ - Analyze test result of SLES on 1.5.x (ARM) - Roger

v1.5.0-rc1: 7 failures.

Five of them were due to flaky tests that are already documented in issue #5460.

The remaining issue has been reported in issue #6076 and #6089.

v1.5.0-rc2: 7 failures.

These failures have already been documented in issue #5460.

v1.5.0-rc3: 5 failures.

These failures have already been documented in issue #5460.

✅ - Analyze upgrade test result of SLES on 1.5.x (AMD) - Yang

v1.5.0-rc1: 6 failures

v1.5.0-rc2: 7 failures

✅ - Analyze upgrade test result of SLES on 1.5.x (ARM) - Yang

v1.5.0-rc1: 5 failures

v1.5.0-rc2: 9 failures

✅ - Analyze two stage test result of SLES on 1.5.x (AMD) 1.3.2 → 1.4.2 → 1.5.0-rc - Chris

v1.5.0-rc1: 14 failures.

Most of them were known issue already documented in issue #5460.

Rerun pass

test_rebuild_with_inc_restoration

test_engine_image_not_fully_deployed_perform_engine_upgrade

test_recovery_from_im_deletion

test_snapshot_hash_detect_corruption_in_global_enabled_mode

v1.5.0-rc2

Rerun pass

tests.test_migration.test_migration_with_unscheduled_replica

tests.test_snapshot.test_snapshot_hash_detect_corruption_in_global_enabled_mode

tests.test_system_backup_restore.test_system_backup_with_volume_backup_policy_if_not_present

tests.test_snapshot.test_snapshot_hash_detect_corruption_in_global_fast_check_mode

v1.5.0-rc2 2ec round: 12 failures

Most of them were known issue already documented in issue #5460.

Rerun pass

tests.test_ha.test_salvage_auto_crash_all_replicas

tests.test_migration.test_migration_confirm

❌ v1.5.0-rc3 - 11 failures

Rerun Pass

test_basic.test_backuptarget_available_during_engine_image_not_ready

test_basic.test_backup_failed_disable_auto_cleanup[s3]

test_ha.test_recovery_from_im_deletion

test_node.test_replica_scheduler_rebuild_restore_is_too_big[s3]

test_ha.test_engine_image_not_fully_deployed_perform_dr_restoring_expanding_volume[s3]

test_migration.test_migration_with_rebuilding_replica

Failure

test_upgrade.test_upgrade[from_transient]: issue 6200

✅ - Analyze two stage test result of SLES on 1.5.x (ARM) 1.3.2 → 1.4.2 → 1.5.0-rc - Chris

v1.5.0-rc1: 13 failures.

Most of them were known issue already documented in issue #5460.

Rerun pass

test_engine_image_not_fully_deployed_perform_dr_restoring_expanding_volume

test_migration.test_migration_with_restore_volume

test_snapshot.test_snapshot_hash_detect_corruption_in_global_enabled_mode

Passed in RC2 round

test_settings.test_setting_concurrent_volume_backup_restore_limit

v1.5.0-rc2: 10 failures

Most of them were known issue already documented in issue #5460.

Rerun pass

tests.test_migration.test_migration_with_failed_replica

tests.test_settings.test_setting_concurrent_volume_backup_restore_limit_should_not_effect_dr_volumes[s3]

v1.5.0-rc2 2ec round: 7 failures

Rerun pass

tests.test_basic.test_engine_image_daemonset_restart

tests.test_basic.test_backuptarget_available_during_engine_image_not_ready

❌ v1.5.0-rc3

Rerun pass

test_orphan.test_orphan_with_same_orphaned_dir_name_in_another_disk

test_ha.test_engine_image_not_fully_deployed_perform_dr_restoring_expanding_volume

Failure

test_upgrade.test_upgrade[from_transient]: issue 6200

Distro Matrix Support (e2e Pipelines)

❌ - SLE Micro 5.3 (AMD & ARM) - Roger

v1.5.0-rc1 AMD64: 13 failures.

Eleven of them were due to flaky tests that are already documented in issue #5460.

The remaining issue has been reported in issue #6076 and #6089.

v1.5.0-rc2 AMD64: 6 failures.

v1.5.0-rc2 ARM64: 5 failures.

These failures have already been documented in issue #5460.

✅ - RHEL 9.1 (AMD & ARM) - Yang

v1.5.0-rc1 amd64: 7 failures

v1.5.0-rc1 arm64: 5 failures

v1.5.0-rc2 amd64: 7 failures

v1.5.0-rc2 arm64: 3 failures

❌ - CentOS 8.5 (AMD & ARM) - Chris : No suitable ami for testing

Note: CentOS is removed from the verified distro list due to https://github.com/longhorn/website/commit/bcbdec90ce9450428891db6660ea38d52f3516b2 v1.5.0-rc1 AMD64, ARM64

v1.5.0-rc2 AMD64, ARM64

No test result on both amd64 arm64 pipeline

Error: Your query returned no results. Please change your search criteria and try again.

  with data.aws_ami.aws_ami_centos,
  on data.tf line 6, in data "aws_ami" "aws_ami_centos":
   6: data "aws_ami" "aws_ami_centos" {

✅ - Ubuntu 22.04 (AMD & ARM) - Ray

v1.5.0-rc1

✅ AMD64

3 failures

tests.test_ha.test_inc_restoration_with_multiple_rebuild_and_expansion[s3]

tests.test_ha.test_dr_volume_with_restore_command_error[s3]

tests.test_orphan.test_orphaned_dirs_in_duplicated_disks

After rerunning, all three failed cases can pass successfully

v1.5.0-rc2

✅ AMD64

After rerunning, the failed cases can pass successfully

tests.test_ha.test_dr_volume_with_restore_command_error[s3]

tests.test_ha.test_engine_image_not_fully_deployed_perform_auto_upgrade_engine

tests.test_ha.test_autosalvage_with_data_locality_enabled

Issue#4814:

tests.test_snapshot.test_snapshot_hash_detect_corruption_in_global_enabled_mode

Issue#6129: waiting for replica rebuilding timeout

The test can pass after extending the replica rebuild waiting time.

tests.test_snapshot.test_snapshot_hash_detect_corruption_in_global_fast_check_mode

✅ ARM64

Re-trigger with longhornio/longhorn-manager:v1.5.x-head

9 failures

tests.test_backing_image.test_exporting_backing_image_from_volume

tests.test_basic.test_space_usage_for_rebuilding_only_volume

tests.test_engine_upgrade.test_engine_live_upgrade_while_replica_concurrent_rebuild

tests.test_ha.test_ha_recovery_with_expansion

tests.test_scheduling.test_replica_rebuild_per_volume_limit

tests.test_scheduling.test_data_locality_basic

tests.test_settings.test_setting_concurrent_rebuild_limit

tests.test_snapshot.test_snapshot_hash_detect_corruption_in_global_enabled_mode

tests.test_snapshot.test_snapshot_hash_detect_corruption_in_global_fast_check_mode

After rerunning, the failed cases can pass successfully

✅❌ - Rocky Linux 9 (AMD & ARM) - Eric

v1.5.0-rc1, round 1, AMD64, ARM64
v1.5.0-rc1, round 2, AMD64, ARM64

11 failures in both AMD
tests.test_backing_image.test_snapshot_with_backing_image - known to be flaky
~~tests.test_ha.test_rebuild_with_restoration[s3]~~ - test fixed in rc2
~~tests.test_ha.test_rebuild_with_restoration[nfs]~~ - test fixed in rc2
~~tests.test_ha.test_single_replica_restore_failure[s3]~~ - test fixed in rc2
~~tests.test_ha.test_single_replica_restore_failure[nfs]~~ - test fixed in rc2
tests.test_ha.test_dr_volume_with_restore_command_error[nfs] - known to be flaky
~~tests.test_kubernetes.test_kubernetes_status~~ - known in Kubernetes v1.27
~~tests.test_kubernetes.test_backup_kubernetes_status[s3]~~ - known in Kubernetes v1.27
~~tests.test_kubernetes.test_backup_kubernetes_status[nfs]~~ - known in Kubernetes v1.27
tests.test_recurring_job.test_recurring_job_restored_from_backup_target[nfs]
tests.test_settings.test_setting_concurrent_volume_backup_restore_limit[nfs] - known to be flaky

8 failures in both ARM
tests.test_recurring_job.test_recurring_jobs_when_volume_detached_unexpectedly[nfs] - known to be flaky
~~tests.test_ha.test_rebuild_with_restoration[s3]~~ - test fixed in rc2
~~tests.test_ha.test_rebuild_with_restoration[nfs]~~ - test fixed in rc2
~~tests.test_ha.test_single_replica_restore_failure[s3]~~ - test fixed in rc2
~~tests.test_ha.test_single_replica_restore_failure[nfs]~~ - test fixed in rc2
~~tests.test_kubernetes.test_kubernetes_status~~ - known in Kubernetes v1.27
~~tests.test_kubernetes.test_backup_kubernetes_status[s3]~~ - known in Kubernetes v1.27
~~tests.test_kubernetes.test_backup_kubernetes_status[nfs]~~ - known in Kubernetes v1.27

v1.5.0-rc2, round 1, AMD64, ARM64

10 failures in AMD
tests.test_csi.test_csi_block_volume_online_expansion - https://github.com/longhorn/longhorn/issues/6076
tests.test_ha.test_rebuild_with_inc_restoration[s3]
tests.test_ha.test_dr_volume_with_restore_command_error[s3] - https://github.com/longhorn/longhorn/issues/6130
~~tests.test_migration.test_migration_with_rebuilding_replica~~ - known to be flaky
~~tests.test_node.test_node_eviction_multiple_volume~~ - known to be flaky
tests.test_recurring_job.test_recurring_jobs_allow_detached_volume[s3] - https://github.com/longhorn/longhorn/issues/6124
tests.test_recurring_job.test_recurring_jobs_allow_detached_volume[nfs] - https://github.com/longhorn/longhorn/issues/6124
tests.test_recurring_job.test_recurring_job_restored_from_backup_target[nfs] - has failed every time ~~tests.test_settings.test_setting_concurrent_volume_backup_restore_limit[s3]~~ - known to be flaky
tests.test_settings.test_setting_concurrent_volume_backup_restore_limit_should_not_effect_dr_volumes[nfs]

6 failures in ARM
tests.test_basic.test_backup_status_for_unavailable_replicas[s3]
tests.test_ha.test_dr_volume_with_restore_command_error[nfs]
tests.test_ha.test_engine_image_not_fully_deployed_perform_replica_scheduling - https://github.com/longhorn/longhorn/issues/6130
tests.test_infra.test_offline_node
tests.test_ha.test_dr_volume_with_restore_command_error[s3] - https://github.com/longhorn/longhorn/issues/6130
tests.test_recurring_job.test_recurring_job_restored_from_backup_target[nfs]

v1.5.0-rc2, round 2, AMD64

9 failures in AMD. Most from round 1 did not repeat. All are likely flakes or known issues.
~~tests.test_basic.test_space_usage_for_rebuilding_only_volume~~ - likely flaky (only failed this way once)
tests.test_ha.test_dr_volume_with_restore_command_error[nfs] - https://github.com/longhorn/longhorn/issues/6130
~~tests.test_migration.test_migration_with_rebuilding_replica~~ - repeat, known to be flaky
tests.test_recurring_job.test_recurring_jobs_allow_detached_volume[nfs] - repeat, https://github.com/longhorn/longhorn/issues/6124
tests.test_recurring_job.test_recurring_jobs_when_volume_detached_unexpectedly[s3] - https://github.com/longhorn/longhorn/issues/6124
tests.test_recurring_job.test_recurring_jobs_when_volume_detached_unexpectedly[nfs] - https://github.com/longhorn/longhorn/issues/6124
~~tests.test_settings.test_setting_concurrent_volume_backup_restore_limit[nfs]~~ - known to be flaky
tests.test_snapshot.test_snapshot_hash_detect_corruption_in_global_fast_check_mode - https://github.com/longhorn/longhorn/issues/6129
~~tests.test_infra.test_offline_node~~ - known to be flaky

❌ - RHEL/CentOS/Rocky Linux SELinux enabled (AMD & ARM) - https://github.com/longhorn/longhorn/issues/5627

v1.5.0-rc1, All tests failed with ERROR.

Fixes from https://github.com/longhorn/longhorn-tests/pull/1424 will allow testing to proceed with both RKE2 and k3s.

A fix for https://github.com/longhorn/longhorn/issues/6108 will be required to get backing image tests working with RKE2.

v1.5.0-rc2, Tested outside of Jenkins platform while waiting on longhorn-tests PRs.

Failed due to flake or known issue:
test_ha.py::test_dr_volume_with_restore_command_error[nfs] - https://github.com/longhorn/longhorn/issues/6130
test_ha.py::test_engine_image_not_fully_deployed_perform_engine_upgrade - known to be flaky, passed on rerun
test_ha.py::test_autosalvage_with_data_locality_enabled - https://github.com/longhorn/longhorn/issues/4814
test_orphan.py::test_orphan_with_same_orphaned_dir_name_in_another_disk - passed on rerun
test_recurring_job.py::test_recurring_jobs_allow_detached_volume[nfs] - https://github.com/longhorn/longhorn/issues/6124
test_recurring_job.py::test_recurring_jobs_when_volume_detached_unexpectedly[s3] - https://github.com/longhorn/longhorn/issues/6124
test_recurring_job.py::test_recurring_jobs_when_volume_detached_unexpectedly[nfs] - https://github.com/longhorn/longhorn/issues/6124
test_settings.py::test_setting_concurrent_volume_backup_restore_limit[nfs] - known to be flaky, passed on rerun
test_settings.py::test_setting_concurrent_volume_backup_restore_limit_should_not_effect_dr_volumes[nfs] - known to be flaky, passed on rerun
test_snapshot.py::test_snapshot_hash_detect_corruption_in_global_enabled_mode - https://github.com/longhorn/longhorn/issues/6129, passed on rerun

Failed due to local environment:
test_csi_snapshotter.py::all - did not install snapshot controller
test_infra.py::test_offline_node - did not inject necessary variable

Net new issue. Cannot pass over multiple attempts:
test_ha.py::test_recovery_from_im_deletion - https://github.com/longhorn/longhorn/issues/6171

✅ - Oracle Linux (AMD & ARM) - Chris

❌ v1.5.0-rc1 AMD64, ARM64

✅ v1.5.0-rc2 AMD64 - 369 failures, ARM64 - not support

pipeline not support arm64

Error: Your query returned no results. Please change your search criteria and try again.

  with data.aws_ami.aws_ami_oraclelinux,
  on data.tf line 6, in data "aws_ami" "aws_ami_oraclelinux":
   6: data "aws_ami" "aws_ami_oraclelinux" {

Oracle AMD64 pipeline had hit volume can not attach issue. Manually applied workaround and run core test on local environment as below

Below failed case all passed after rerun

test_restore_inc_with_offline_expansion

test_single_replica_failed_during_engine_start

✅ v1.5.0-rc2 2ec round [AMD64]: Core test all passed

Non-e2e Pipelines

✅ - Benchmark (compare with 1.4.2) - Yang

✅ RC1 ✅ RC2

✅ - Air-Gap () - Chris

✅ RC1 manifest, helm chart

✅ RC2 - manifest, helm chart

✅ - Vulnerability scanning - Chris

v1.5.0-rc1 74 failures

v1.5.0-rc2 39 failures

✅ - Negative testing

Node not ready - Ray
- Node reboot
  - rc1
  - rc2 report files
- Node Power Off
  - rc1 report files
  - rc2 report files
- Restart Kublet
  - rc1 report files
  - rc2 report files

Manual Testing

Environment:

❌ - Test Longhorn deployment on RKE2 v1.24- with CIS-1.6 profile - Eric

Test fails in the kubectl case at step 12. longhorn-uninstall-rm58r uses the global-restricted-psp which disallows containers that run as root. We clearly intend it to use the longhorn-uninstall-psp. This looks very similar to the failure in a PR for the original issue. That PR originally targeted the Helm case and was only merged into v1.1.1 specifically, so this is all a bit confusing. Work around and uninstall by adding the securityContext from that PR to the job in uninstall.yaml (forces the uninstall pod to use longhorn-uninstall-psp instead.

> kubectl get -n longhorn-system pod
...
longhorn-uninstall-rm58r                            0/1     CreateContainerConfigError   0          11s

> kubectl describe -n longhorn-system pod longhorn-uninstall-rm58r
Name:         longhorn-uninstall-rm58r
Namespace:    longhorn-system
Priority:     0
Node:         eweber-cis-test-rocky-02/142.93.197.135
Start Time:   Fri, 09 Jun 2023 20:47:01 +0000
Labels:       controller-uid=bef88e5d-019c-401f-8f9d-44e5aec282cf
              job-name=longhorn-uninstall
Annotations:  cni.projectcalico.org/containerID: 90eb1dc91a570495c85223db56a075390a470ca0ca98a109f02783e5b616f7e0
              cni.projectcalico.org/podIP: 10.42.1.49/32
              cni.projectcalico.org/podIPs: 10.42.1.49/32
              kubernetes.io/psp: global-restricted-psp
Status:       Pending
IP:           10.42.1.49
IPs:
  IP:           10.42.1.49
Controlled By:  Job/longhorn-uninstall
Containers:
  longhorn-uninstall:
    Container ID:
    Image:         longhornio/longhorn-manager:v1.5.0-rc1
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      longhorn-manager
      uninstall
      --force
    State:          Waiting
      Reason:       CreateContainerConfigError
    Ready:          False
    Restart Count:  0
    Environment:
      LONGHORN_NAMESPACE:  longhorn-system
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-k87m8 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  kube-api-access-k87m8:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  27s                default-scheduler  Successfully assigned longhorn-system/longhorn-uninstall-rm58r to eweber-cis-test-rocky-02
  Normal   Pulled     14s (x4 over 27s)  kubelet            Container image "longhornio/longhorn-manager:v1.5.0-rc1" already present on machine
  Warning  Failed     14s (x4 over 27s)  kubelet            Error: container has runAsNonRoot and image will run as root (pod: "longhorn-uninstall-rm58r_longhorn-system(3efe856d-cda4-4444-833d-ca68e3ad2d5e)", container: longhorn-uninstall)

✅ - Test Longhorn deployment on RKE2 v1.25+ with CIS-1.23 profile - Eric

HA:

✅ - Replica Rebuilding - James

✅ RC1 ✅ RC2

✅ - Entire cluster down - Khushboo

✅ RC1 -

All the nodes including etcd/control plane were power cycled and brought back after 1 hour - PASSED

All the nodes rebooted together - PASSED

Node Down

✅ - Node drain and deletion test - Jack

✅ RC1

✅ - The node the DR volume attached to is rebooted - Roger

✅ RC1 ✅ RC2

Resiliency:

✅ - Test Longhorn components recovery - Jack

✅ RC1

Stability

✅ - Checksum enabled large volume with multiple rebuilding - James

✅ RC1

Rebuilding needs about 3 minutes and 20 seconds for v1.3.x and 3 minutes for v1.5.0-rc1

✅ RC2

Rebuilding needs about 2 minutes for v1.5.0-rc2

✅ - Uninstallation Checks - James

✅ RC1 ✅ RC2

Upgrade:

✅ - Test Engine Crash During Live Upgrade - Jack

✅ RC1 ✅ RC2 upgrade from v1.4.2 to v1.5.0-rc2

✅ - Kubernetes upgrade test - James

✅ RC1 upgrade k8s/k3s cluster from 1.26.5 to 1.27.2 with and w/o drain nodes, os: ubuntu2204, underlying Infrastructure: KVM

✅ RC2 upgrade k8s/k3s cluster from 1.26.5 to 1.27.2 with and w/o drain nodes, os: ubuntu2204, underlying Infrastructure: KVM

✅ - Longhorn upgrade test - Yang

✅ RC3 Upgrade from v1.4.2 to v1.5.0-rc3

✅ - Upgrade Conflict Handling test - Jack

✅ RC1 Upgrade from v1.4.2 to v1.5.0-rc1 ✅ RC2 Upgrade from v1.4.2 to v1.5.0-rc2

Rancher Integration

✅ - Drain using Rancher UI - Khushboo

✅ - Longhorn in an hardened cluster - Khushboo

✅ - Upgrade Kubernetes using Rancher UI - Khushboo

Chart Testing

✅❌ - v2.7.x Rancher - Khushboo

   Prereq: Set Concurrent Automatic Engine Upgrade Per Node Limit to 0
   Test steps:
   1. Fresh installation
   2. Uninstallation
   3. Upgrade from v1.4.2

✅ - Longhorn Chart - Chris

   Prereq: Set Concurrent Automatic Engine Upgrade Per Node Limit to greater than 0
   Test steps:
   1. Fresh installation
   2. Uninstallation
   3. Upgrade from v1.4.2

✅ RC1 ✅ RC2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release Test Plan 1.5.0

Testing Version: v1.5.0-rc1, v1.5.0-rc2, v1.5.0-rc3

e2e Pipelines

Distro Matrix Support (e2e Pipelines)

Non-e2e Pipelines

Manual Testing

Environment:

HA:

Node Down

Resiliency:

Stability

Upgrade:

Rancher Integration

Chart Testing

Longhorn Wiki

Roadmap

Release Known Issues

Release Schedule & Support

Release Test Plans

Release Regular Tasks

Release Flow

Branch Strategy

Backporting Policy

Community Issue Coordiantion

CVE Resolution

Dependency Update Policy

Deprecation Policy

Test Automation Strategy

Version Update Policy

Development

Issue Management

Performance Benchmark

Member Task Priority

Domain Experts

Clone this wiki locally