Release Test Plan 1.5.0
Issues targeted for this release - List of Github issues
Contributor is testing: Chris, Yang, Roger, Khushboo, Ray, James, Jack, Eric
✅ - Analyze test result of SLES on 1.5.x (AMD) - Roger
v1.5.0-rc1: 8 failures.
Seven of them were due to flaky tests that are already documented in issue #5460.
The remaining issue has been reported in issue #6089.
v1.5.0-rc2: 8 failures.
These failures have already been documented in issue #5460.
v1.5.0-rc3: 3 failures.
These failures have already been documented in issue #5460.
✅ - Analyze test result of SLES on 1.5.x (ARM) - Roger
v1.5.0-rc1: 7 failures.
Five of them were due to flaky tests that are already documented in issue #5460.
The remaining issue has been reported in issue #6076 and #6089.
v1.5.0-rc2: 7 failures.
These failures have already been documented in issue #5460.
v1.5.0-rc3: 5 failures.
These failures have already been documented in issue #5460.
✅ - Analyze upgrade test result of SLES on 1.5.x (AMD) - Yang
v1.5.0-rc1: 6 failures
v1.5.0-rc2: 7 failures
✅ - Analyze upgrade test result of SLES on 1.5.x (ARM) - Yang
v1.5.0-rc1: 5 failures
v1.5.0-rc2: 9 failures
✅ - Analyze two stage test result of SLES on 1.5.x (AMD) 1.3.2 → 1.4.2 → 1.5.0-rc - Chris
v1.5.0-rc1: 14 failures.
Most of them were known issue already documented in issue #5460.
Rerun pass
- test_rebuild_with_inc_restoration
- test_engine_image_not_fully_deployed_perform_engine_upgrade
- test_recovery_from_im_deletion
- test_snapshot_hash_detect_corruption_in_global_enabled_mode
Rerun pass
- tests.test_migration.test_migration_with_unscheduled_replica
- tests.test_snapshot.test_snapshot_hash_detect_corruption_in_global_enabled_mode
- tests.test_system_backup_restore.test_system_backup_with_volume_backup_policy_if_not_present
- tests.test_snapshot.test_snapshot_hash_detect_corruption_in_global_fast_check_mode
v1.5.0-rc2 2ec round: 12 failures
Most of them were known issue already documented in issue #5460.
Rerun pass
- tests.test_ha.test_salvage_auto_crash_all_replicas
- tests.test_migration.test_migration_confirm
❌ v1.5.0-rc3 - 11 failures
Rerun Pass
- test_basic.test_backuptarget_available_during_engine_image_not_ready
- test_basic.test_backup_failed_disable_auto_cleanup[s3]
- test_ha.test_recovery_from_im_deletion
- test_node.test_replica_scheduler_rebuild_restore_is_too_big[s3]
- test_ha.test_engine_image_not_fully_deployed_perform_dr_restoring_expanding_volume[s3]
- test_migration.test_migration_with_rebuilding_replica
Failure
- test_upgrade.test_upgrade[from_transient]: issue 6200
✅ - Analyze two stage test result of SLES on 1.5.x (ARM) 1.3.2 → 1.4.2 → 1.5.0-rc - Chris
v1.5.0-rc1: 13 failures.
Most of them were known issue already documented in issue #5460.
Rerun pass
- test_engine_image_not_fully_deployed_perform_dr_restoring_expanding_volume
- test_migration.test_migration_with_restore_volume
- test_snapshot.test_snapshot_hash_detect_corruption_in_global_enabled_mode
Passed in RC2 round
- test_settings.test_setting_concurrent_volume_backup_restore_limit
v1.5.0-rc2: 10 failures
Most of them were known issue already documented in issue #5460.
Rerun pass
- tests.test_migration.test_migration_with_failed_replica
- tests.test_settings.test_setting_concurrent_volume_backup_restore_limit_should_not_effect_dr_volumes[s3]
v1.5.0-rc2 2ec round: 7 failures
Rerun pass
- tests.test_basic.test_engine_image_daemonset_restart
- tests.test_basic.test_backuptarget_available_during_engine_image_not_ready
Rerun pass
- test_orphan.test_orphan_with_same_orphaned_dir_name_in_another_disk
- test_ha.test_engine_image_not_fully_deployed_perform_dr_restoring_expanding_volume
Failure
- test_upgrade.test_upgrade[from_transient]: issue 6200
❌ - SLE Micro 5.3 (AMD & ARM) - Roger
v1.5.0-rc1 AMD64: 13 failures.
Eleven of them were due to flaky tests that are already documented in issue #5460.
The remaining issue has been reported in issue #6076 and #6089.
v1.5.0-rc2 AMD64: 6 failures.
v1.5.0-rc2 ARM64: 5 failures.
These failures have already been documented in issue #5460.
✅ - RHEL 9.1 (AMD & ARM) - Yang
v1.5.0-rc1 amd64: 7 failures
v1.5.0-rc1 arm64: 5 failures
v1.5.0-rc2 amd64: 7 failures
v1.5.0-rc2 arm64: 3 failures
❌ - CentOS 8.5 (AMD & ARM) - Chris : No suitable ami for testing
Note: CentOS is removed from the verified distro list due to https://github.com/longhorn/website/commit/bcbdec90ce9450428891db6660ea38d52f3516b2 v1.5.0-rc1 AMD64, ARM64
No test result on both amd64 arm64 pipeline
Error: Your query returned no results. Please change your search criteria and try again.
with data.aws_ami.aws_ami_centos,
on data.tf line 6, in data "aws_ami" "aws_ami_centos":
6: data "aws_ami" "aws_ami_centos" {
✅ - Ubuntu 22.04 (AMD & ARM) - Ray
v1.5.0-rc1
- ✅ AMD64
- 3 failures
- After rerunning, all three failed cases can pass successfully
v1.5.0-rc2
- ✅ AMD64
- After rerunning, the failed cases can pass successfully
- tests.test_ha.test_dr_volume_with_restore_command_error[s3]
- tests.test_ha.test_engine_image_not_fully_deployed_perform_auto_upgrade_engine
- tests.test_ha.test_autosalvage_with_data_locality_enabled
- tests.test_snapshot.test_snapshot_hash_detect_corruption_in_global_enabled_mode
- Issue#6129: waiting for replica rebuilding timeout
- The test can pass after extending the replica rebuild waiting time.
- tests.test_snapshot.test_snapshot_hash_detect_corruption_in_global_fast_check_mode
- ✅ ARM64
- Re-trigger with
longhornio/longhorn-manager:v1.5.x-head
- 9 failures
- tests.test_backing_image.test_exporting_backing_image_from_volume
- tests.test_basic.test_space_usage_for_rebuilding_only_volume
- tests.test_engine_upgrade.test_engine_live_upgrade_while_replica_concurrent_rebuild
- tests.test_ha.test_ha_recovery_with_expansion
- tests.test_scheduling.test_replica_rebuild_per_volume_limit
- tests.test_scheduling.test_data_locality_basic
- tests.test_settings.test_setting_concurrent_rebuild_limit
- tests.test_snapshot.test_snapshot_hash_detect_corruption_in_global_enabled_mode
- tests.test_snapshot.test_snapshot_hash_detect_corruption_in_global_fast_check_mode
- After rerunning, the failed cases can pass successfully
✅❌ - Rocky Linux 9 (AMD & ARM) - Eric
v1.5.0-rc1, round 1, AMD64, ARM64
v1.5.0-rc1, round 2, AMD64, ARM64
11 failures in both AMD
tests.test_backing_image.test_snapshot_with_backing_image - known to be flaky
tests.test_ha.test_rebuild_with_restoration[s3]- test fixed in rc2
tests.test_ha.test_rebuild_with_restoration[nfs]- test fixed in rc2
tests.test_ha.test_single_replica_restore_failure[s3]- test fixed in rc2
tests.test_ha.test_single_replica_restore_failure[nfs]- test fixed in rc2
tests.test_ha.test_dr_volume_with_restore_command_error[nfs] - known to be flaky
tests.test_kubernetes.test_kubernetes_status- known in Kubernetes v1.27
tests.test_kubernetes.test_backup_kubernetes_status[s3]- known in Kubernetes v1.27
tests.test_kubernetes.test_backup_kubernetes_status[nfs]- known in Kubernetes v1.27
tests.test_recurring_job.test_recurring_job_restored_from_backup_target[nfs]
tests.test_settings.test_setting_concurrent_volume_backup_restore_limit[nfs] - known to be flaky
8 failures in both ARM
tests.test_recurring_job.test_recurring_jobs_when_volume_detached_unexpectedly[nfs] - known to be flaky
tests.test_ha.test_rebuild_with_restoration[s3]- test fixed in rc2
tests.test_ha.test_rebuild_with_restoration[nfs]- test fixed in rc2
tests.test_ha.test_single_replica_restore_failure[s3]- test fixed in rc2
tests.test_ha.test_single_replica_restore_failure[nfs]- test fixed in rc2
tests.test_kubernetes.test_kubernetes_status- known in Kubernetes v1.27
tests.test_kubernetes.test_backup_kubernetes_status[s3]- known in Kubernetes v1.27
tests.test_kubernetes.test_backup_kubernetes_status[nfs]- known in Kubernetes v1.27
10 failures in AMD
tests.test_csi.test_csi_block_volume_online_expansion - https://github.com/longhorn/longhorn/issues/6076
tests.test_ha.test_rebuild_with_inc_restoration[s3]
tests.test_ha.test_dr_volume_with_restore_command_error[s3] - https://github.com/longhorn/longhorn/issues/6130
tests.test_migration.test_migration_with_rebuilding_replica- known to be flaky
tests.test_node.test_node_eviction_multiple_volume- known to be flaky
tests.test_recurring_job.test_recurring_jobs_allow_detached_volume[s3] - https://github.com/longhorn/longhorn/issues/6124
tests.test_recurring_job.test_recurring_jobs_allow_detached_volume[nfs] - https://github.com/longhorn/longhorn/issues/6124
tests.test_recurring_job.test_recurring_job_restored_from_backup_target[nfs] - has failed every timetests.test_settings.test_setting_concurrent_volume_backup_restore_limit[s3]- known to be flaky
tests.test_settings.test_setting_concurrent_volume_backup_restore_limit_should_not_effect_dr_volumes[nfs]
6 failures in ARM
tests.test_basic.test_backup_status_for_unavailable_replicas[s3]
tests.test_ha.test_dr_volume_with_restore_command_error[nfs]
tests.test_ha.test_engine_image_not_fully_deployed_perform_replica_scheduling - https://github.com/longhorn/longhorn/issues/6130
tests.test_infra.test_offline_node
tests.test_ha.test_dr_volume_with_restore_command_error[s3] - https://github.com/longhorn/longhorn/issues/6130
tests.test_recurring_job.test_recurring_job_restored_from_backup_target[nfs]
v1.5.0-rc2, round 2, AMD64
9 failures in AMD. Most from round 1 did not repeat. All are likely flakes or known issues.
tests.test_basic.test_space_usage_for_rebuilding_only_volume- likely flaky (only failed this way once)
tests.test_ha.test_dr_volume_with_restore_command_error[nfs] - https://github.com/longhorn/longhorn/issues/6130
tests.test_migration.test_migration_with_rebuilding_replica- repeat, known to be flaky
tests.test_recurring_job.test_recurring_jobs_allow_detached_volume[nfs] - repeat, https://github.com/longhorn/longhorn/issues/6124
tests.test_recurring_job.test_recurring_jobs_when_volume_detached_unexpectedly[s3] - https://github.com/longhorn/longhorn/issues/6124
tests.test_recurring_job.test_recurring_jobs_when_volume_detached_unexpectedly[nfs] - https://github.com/longhorn/longhorn/issues/6124
tests.test_settings.test_setting_concurrent_volume_backup_restore_limit[nfs]- known to be flaky
tests.test_snapshot.test_snapshot_hash_detect_corruption_in_global_fast_check_mode - https://github.com/longhorn/longhorn/issues/6129
tests.test_infra.test_offline_node- known to be flaky
❌ - RHEL/CentOS/Rocky Linux SELinux enabled (AMD & ARM) - https://github.com/longhorn/longhorn/issues/5627
v1.5.0-rc1, All tests failed with ERROR.
- Fixes from https://github.com/longhorn/longhorn-tests/pull/1424 will allow testing to proceed with both RKE2 and k3s.
- A fix for https://github.com/longhorn/longhorn/issues/6108 will be required to get backing image tests working with RKE2.
v1.5.0-rc2, Tested outside of Jenkins platform while waiting on longhorn-tests PRs.
Failed due to flake or known issue:
test_ha.py::test_dr_volume_with_restore_command_error[nfs] - https://github.com/longhorn/longhorn/issues/6130
test_ha.py::test_engine_image_not_fully_deployed_perform_engine_upgrade - known to be flaky, passed on rerun
test_ha.py::test_autosalvage_with_data_locality_enabled - https://github.com/longhorn/longhorn/issues/4814
test_orphan.py::test_orphan_with_same_orphaned_dir_name_in_another_disk - passed on rerun
test_recurring_job.py::test_recurring_jobs_allow_detached_volume[nfs] - https://github.com/longhorn/longhorn/issues/6124
test_recurring_job.py::test_recurring_jobs_when_volume_detached_unexpectedly[s3] - https://github.com/longhorn/longhorn/issues/6124
test_recurring_job.py::test_recurring_jobs_when_volume_detached_unexpectedly[nfs] - https://github.com/longhorn/longhorn/issues/6124
test_settings.py::test_setting_concurrent_volume_backup_restore_limit[nfs] - known to be flaky, passed on rerun
test_settings.py::test_setting_concurrent_volume_backup_restore_limit_should_not_effect_dr_volumes[nfs] - known to be flaky, passed on rerun
test_snapshot.py::test_snapshot_hash_detect_corruption_in_global_enabled_mode - https://github.com/longhorn/longhorn/issues/6129, passed on rerun
Failed due to local environment:
test_csi_snapshotter.py::all - did not install snapshot controller
test_infra.py::test_offline_node - did not inject necessary variable
Net new issue. Cannot pass over multiple attempts:
test_ha.py::test_recovery_from_im_deletion - https://github.com/longhorn/longhorn/issues/6171
✅ - Oracle Linux (AMD & ARM) - Chris
✅ v1.5.0-rc2 AMD64 - 369 failures, ARM64 - not support
pipeline not support arm64
Error: Your query returned no results. Please change your search criteria and try again.
with data.aws_ami.aws_ami_oraclelinux,
on data.tf line 6, in data "aws_ami" "aws_ami_oraclelinux":
6: data "aws_ami" "aws_ami_oraclelinux" {
Oracle AMD64 pipeline had hit volume can not attach issue. Manually applied workaround and run core test on local environment as below
Below failed case all passed after rerun
- test_restore_inc_with_offline_expansion
- test_single_replica_failed_during_engine_start
✅ v1.5.0-rc2 2ec round [AMD64]: Core test all passed
✅ - Benchmark (compare with 1.4.2) - Yang
✅ RC1 ✅ RC2
✅ - Air-Gap () - Chris
✅ RC1 manifest, helm chart
✅ RC2 - manifest, helm chart
✅ - Vulnerability scanning - Chris
v1.5.0-rc1 74 failures
v1.5.0-rc2 39 failures
✅ - Negative testing
- Node not ready - Ray
- Node reboot
- rc1
- rc2 report files
- rc1
- Node Power Off
- rc1 report files
- rc2 report files
- rc1 report files
- Restart Kublet
- rc1 report files
- rc2 report files
- rc1 report files
- Node reboot
❌ - Test Longhorn deployment on RKE2 v1.24- with CIS-1.6 profile - Eric
Test fails in the
kubectl
case at step 12.longhorn-uninstall-rm58r
uses theglobal-restricted-psp
which disallows containers that run as root. We clearly intend it to use thelonghorn-uninstall-psp
. This looks very similar to the failure in a PR for the original issue. That PR originally targeted the Helm case and was only merged into v1.1.1 specifically, so this is all a bit confusing. Work around and uninstall by adding the securityContext from that PR to the job inuninstall.yaml
(forces the uninstall pod to uselonghorn-uninstall-psp
instead.
> kubectl get -n longhorn-system pod
...
longhorn-uninstall-rm58r 0/1 CreateContainerConfigError 0 11s
> kubectl describe -n longhorn-system pod longhorn-uninstall-rm58r
Name: longhorn-uninstall-rm58r
Namespace: longhorn-system
Priority: 0
Node: eweber-cis-test-rocky-02/142.93.197.135
Start Time: Fri, 09 Jun 2023 20:47:01 +0000
Labels: controller-uid=bef88e5d-019c-401f-8f9d-44e5aec282cf
job-name=longhorn-uninstall
Annotations: cni.projectcalico.org/containerID: 90eb1dc91a570495c85223db56a075390a470ca0ca98a109f02783e5b616f7e0
cni.projectcalico.org/podIP: 10.42.1.49/32
cni.projectcalico.org/podIPs: 10.42.1.49/32
kubernetes.io/psp: global-restricted-psp
Status: Pending
IP: 10.42.1.49
IPs:
IP: 10.42.1.49
Controlled By: Job/longhorn-uninstall
Containers:
longhorn-uninstall:
Container ID:
Image: longhornio/longhorn-manager:v1.5.0-rc1
Image ID:
Port: <none>
Host Port: <none>
Command:
longhorn-manager
uninstall
--force
State: Waiting
Reason: CreateContainerConfigError
Ready: False
Restart Count: 0
Environment:
LONGHORN_NAMESPACE: longhorn-system
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-k87m8 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-k87m8:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 27s default-scheduler Successfully assigned longhorn-system/longhorn-uninstall-rm58r to eweber-cis-test-rocky-02
Normal Pulled 14s (x4 over 27s) kubelet Container image "longhornio/longhorn-manager:v1.5.0-rc1" already present on machine
Warning Failed 14s (x4 over 27s) kubelet Error: container has runAsNonRoot and image will run as root (pod: "longhorn-uninstall-rm58r_longhorn-system(3efe856d-cda4-4444-833d-ca68e3ad2d5e)", container: longhorn-uninstall)
✅ - Test Longhorn deployment on RKE2 v1.25+ with CIS-1.23 profile - Eric
✅ - Replica Rebuilding - James
✅ RC1 ✅ RC2
✅ - Entire cluster down - Khushboo
✅ RC1 -
- All the nodes including etcd/control plane were power cycled and brought back after 1 hour - PASSED
- All the nodes rebooted together - PASSED
✅ - Node drain and deletion test - Jack
✅ RC1
✅ - The node the DR volume attached to is rebooted - Roger
✅ RC1 ✅ RC2
✅ - Test Longhorn components recovery - Jack
✅ RC1
✅ - Checksum enabled large volume with multiple rebuilding - James
✅ RC1
- Rebuilding needs about 3 minutes and 20 seconds for v1.3.x and 3 minutes for v1.5.0-rc1
✅ RC2
- Rebuilding needs about 2 minutes for v1.5.0-rc2
✅ - Uninstallation Checks - James
✅ RC1 ✅ RC2
✅ - Test Engine Crash During Live Upgrade - Jack
✅ RC1 ✅ RC2 upgrade from v1.4.2 to v1.5.0-rc2
✅ - Kubernetes upgrade test - James
✅ RC1 upgrade k8s/k3s cluster from 1.26.5 to 1.27.2 with and w/o drain nodes, os: ubuntu2204, underlying Infrastructure: KVM
✅ RC2 upgrade k8s/k3s cluster from 1.26.5 to 1.27.2 with and w/o drain nodes, os: ubuntu2204, underlying Infrastructure: KVM
✅ - Longhorn upgrade test - Yang
✅ RC3 Upgrade from v1.4.2 to v1.5.0-rc3
✅ - Upgrade Conflict Handling test - Jack
✅ RC1 Upgrade from v1.4.2 to v1.5.0-rc1 ✅ RC2 Upgrade from v1.4.2 to v1.5.0-rc2
✅ - Drain using Rancher UI - Khushboo
✅ - Longhorn in an hardened cluster - Khushboo
✅ - Upgrade Kubernetes using Rancher UI - Khushboo
✅❌ - v2.7.x Rancher - Khushboo
Prereq: Set Concurrent Automatic Engine Upgrade Per Node Limit to 0
Test steps:
1. Fresh installation
2. Uninstallation
3. Upgrade from v1.4.2
✅ - Longhorn Chart - Chris
Prereq: Set Concurrent Automatic Engine Upgrade Per Node Limit to greater than 0
Test steps:
1. Fresh installation
2. Uninstallation
3. Upgrade from v1.4.2
✅ RC1 ✅ RC2