Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Extra unexpected replicas are created when volume creation results in more replicas than the specified numberOfReplicas and fails the test case #8536

Open
yangchiu opened this issue May 10, 2024 · 0 comments
Assignees
Labels
area/volume-replica-scheduling Volume replica scheduling related kind/bug priority/1 Highly recommended to fix in this release (managed by PO) reproduce/rare < 50% reproducible severity/2 Function working but has a major issue w/o workaround (a major incident with significant impact)
Milestone

Comments

@yangchiu
Copy link
Member

yangchiu commented May 10, 2024

Describe the bug

In test case test_engine_image_not_fully_deployed_perform_dr_restoring_expanding_volume, it will first create a volume with 2 replicas:

# step 1
    volume1 = create_and_check_volume(client, "vol-1",
                                      num_of_replicas=2,
                                      size=str(1 * Gi))

Normally the created volume will only have 2 replicas:

[{
	'address': '',
	'currentImage': '',
	'dataEngine': 'v1',
	'dataPath': '/var/lib/longhorn/replicas/vol-1-24518c0e',
	'diskID': '2cc7cf5c-1845-497d-af08-72b7001d492c',
	'diskPath': '/var/lib/longhorn/',
	'failedAt': '',
	'hostId': 'ip-10-0-2-28',
	'image': 'longhornio/longhorn-engine:master-head',
	'instanceManagerName': '',
	'mode': '',
	'name': 'vol-1-r-e8a59577',
	'running': False
}, {
	'address': '',
	'currentImage': '',
	'dataEngine': 'v1',
	'dataPath': '/var/lib/longhorn/replicas/vol-1-764e4cbd',
	'diskID': '11526ea7-957a-48f9-93cf-96abd59430ec',
	'diskPath': '/var/lib/longhorn/',
	'failedAt': '',
	'hostId': 'ip-10-0-2-30',
	'image': 'longhornio/longhorn-engine:master-head',
	'instanceManagerName': '',
	'mode': '',
	'name': 'vol-1-r-fdce2ad4',
	'running': False
}]

But there's a small chance (~1/100) that the created volume will have more than 2 replicas:

[{
	'address': '',
	'currentImage': '',
	'dataEngine': 'v1',
	'dataPath': '/var/lib/longhorn/replicas/vol-1-2663f221',
	'diskID': '4e5b43bf-3f49-4d02-b26c-1e9742922935',
	'diskPath': '/var/lib/longhorn/',
	'failedAt': '',
	'hostId': 'ip-10-0-2-159',
	'image': 'longhornio/longhorn-engine:master-head',
	'instanceManagerName': '',
	'mode': '',
	'name': 'vol-1-r-2ad8951f',
	'running': False
}, {
	'address': '',
	'currentImage': '',
	'dataEngine': 'v1',
	'dataPath': '/var/lib/longhorn/replicas/vol-1-f748f68e',
	'diskID': '8f5320d2-3699-4ce4-87ca-021fc0845c35',
	'diskPath': '/var/lib/longhorn/',
	'failedAt': '',
	'hostId': 'ip-10-0-2-246',
	'image': 'longhornio/longhorn-engine:master-head',
	'instanceManagerName': '',
	'mode': '',
	'name': 'vol-1-r-9df2c517',
	'running': False
}, {
	'address': '',
	'currentImage': '',
	'dataEngine': 'v1',
	'dataPath': '/var/lib/longhorn/replicas/vol-1-1eb6dc20',
	'diskID': '4e5b43bf-3f49-4d02-b26c-1e9742922935',
	'diskPath': '/var/lib/longhorn/',
	'failedAt': '',
	'hostId': 'ip-10-0-2-159',
	'image': 'longhornio/longhorn-engine:master-head',
	'instanceManagerName': '',
	'mode': '',
	'name': 'vol-1-r-b1c95c31',
	'running': False
}, {
	'address': '',
	'currentImage': '',
	'dataEngine': 'v1',
	'dataPath': 'replicas',
	'diskID': '',
	'diskPath': '',
	'failedAt': '',
	'hostId': '',
	'image': 'longhornio/longhorn-engine:master-head',
	'instanceManagerName': '',
	'mode': '',
	'name': 'vol-1-r-de984af1',
	'running': False
}]

And fail the test case when it tries to assure number of replicas = 2:

        # step 15
        volume1 = wait_for_volume_degraded(client, volume1.name)
        print(f"after crash, volume1.replicas = {volume1.replicas}")
        for i in range(RETRY_COUNTS_SHORT * 2):
            volume1 = client.by_id_volume(volume1.name)
>           assert len(volume1.replicas) == 2, f"volume1 = {volume1}"
E           AssertionError: volume1 = {'accessMode': 'rwo', 'backingImage': '', 'backupCompressionMethod': 'lz4', 'backupStatus': [{'backupURL': 's3://backupbucket@us-east-1/backupstore?backup=backup-6a9ff04bdaab4808&volume=vol-1', 'error': '', 'progress': 100, 'replica': 'tcp://10.42.1.10:10010', 'size': '6291456', 'snapshot': 'ac8067c8-f874-4f4a-a3b4-6e734858a2c8', 'state': 'Completed'}], 'cloneStatus': {'snapshot': '', 'sourceVolume': '', 'state': ''}, 'conditions': {'Restore': {'lastProbeTime': '', 'lastTransitionTime': '2024-05-09T15:48:17Z', 'message': '', 'reason': '', 'status': 'False'}, 'Scheduled': {'lastProbeTime': '', 'lastTransitionTime': '2024-05-09T15:48:28Z', 'message': '', 'reason': '', 'status': 'True'}, 'TooManySnapshots': {'lastProbeTime': '', 'lastTransitionTime': '2024-05-09T15:48:17Z', 'message': '', 'reason': '', 'status': 'False'}, 'WaitForBackingImage': {'lastProbeTime': '', 'lastTransitionTime': '2024-05-09T15:48:17Z', 'message': '', 'reason': '', 'status': 'False'}}, 'controllers': [{'actualSize': '4194304', 'address': '10.42.3.10', 'currentImage': 'longhornio/longhorn-engine:master-head', 'endpoint': '/dev/longhorn/vol-1', 'hostId': 'ip-10-0-2-28', 'image': 'longhornio/longhorn-engine:master-head', 'instanceManagerName': 'instance-manager-e1ceb33a5d95478fb0b1322eb92fa2ad', 'isExpanding': False, 'lastExpansionError': '', 'lastExpansionFailedAt': '', 'lastRestoredBackup': '', 'name': 'vol-1-e-0', 'requestedBackupRestore': '', 'running': True, 'size': '1073741824', 'unmapMarkSnapChainRemovedEnabled': False}], 'created': '2024-05-09 15:48:17 +0000 UTC', 'currentImage': 'longhornio/longhorn-engine:master-head', 'dataEngine': 'v1', 'dataLocality': 'disabled', 'dataSource': '', 'disableFrontend': False, 'diskSelector': [], 'encrypted': False, 'fromBackup': '', 'frontend': 'blockdev', 'image': 'longhornio/longhorn-engine:master-head', 'kubernetesStatus': {'lastPVCRefAt': '', 'lastPodRefAt': '', 'namespace': '', 'pvName': '', 'pvStatus': '', 'pvcName': '', 'workloadsStatus': None}, 'lastAttachedBy': '', 'lastBackup': 'backup-6a9ff04bdaab4808', 'lastBackupAt': '2024-05-09T15:48:29Z', 'migratable': False, 'name': 'vol-1', 'nodeSelector': [], 'numberOfReplicas': 2, 'offlineReplicaRebuilding': 'disabled', 'offlineReplicaRebuildingRequired': False, 'purgeStatus': [{'error': '', 'isPurging': False, 'progress': 0, 'replica': 'vol-1-r-6b1cd831', 'state': ''}], 'ready': True, 'rebuildStatus': [], 'recurringJobSelector': None, 'replicaAutoBalance': 'ignored', 'replicaDiskSoftAntiAffinity': 'ignored', 'replicaSoftAntiAffinity': 'ignored', 'replicaZoneSoftAntiAffinity': 'ignored', 'replicas': [{'address': '', 'currentImage': '', 'dataEngine': 'v1', 'dataPath': '/var/lib/longhorn/replicas/vol-1-87c2822d', 'diskID': '11526ea7-957a-48f9-93cf-96abd59430ec', 'diskPath': '/var/lib/longhorn/', 'failedAt': '2024-05-09T15:50:37Z', 'hostId': 'ip-10-0-2-30', 'image': 'longhornio/longhorn-engine:master-head', 'instanceManagerName': '', 'mode': '', 'name': 'vol-1-r-5f7c46ff', 'running': False}, {'address': '', 'currentImage': '', 'dataEngine': 'v1', 'dataPath': '/var/lib/longhorn/replicas/vol-1-c888d9e6', 'diskID': '11526ea7-957a-48f9-93cf-96abd59430ec', 'diskPath': '/var/lib/longhorn/', 'failedAt': '2024-05-09T15:50:38Z', 'hostId': 'ip-10-0-2-30', 'image': 'longhornio/longhorn-engine:master-head', 'instanceManagerName': '', 'mode': '', 'name': 'vol-1-r-6055fdb4', 'running': False}, {'address': '10.42.3.10', 'currentImage': 'longhornio/longhorn-engine:master-head', 'dataEngine': 'v1', 'dataPath': '/var/lib/longhorn/replicas/vol-1-42234f29', 'diskID': '2cc7cf5c-1845-497d-af08-72b7001d492c', 'diskPath': '/var/lib/longhorn/', 'failedAt': '', 'hostId': 'ip-10-0-2-28', 'image': 'longhornio/longhorn-engine:master-head', 'instanceManagerName': 'instance-manager-e1ceb33a5d95478fb0b1322eb92fa2ad', 'mode': 'RW', 'name': 'vol-1-r-6b1cd831', 'running': True}], 'restoreInitiated': False, 'restoreRequired': False, 'restoreStatus': [{'backupURL': '', 'error': '', 'filename': '', 'isRestoring': False, 'lastRestored': '', 'progress': 0, 'replica': 'vol-1-r-6b1cd831', 'state': ''}], 'restoreVolumeRecurringJob': 'ignored', 'revisionCounterDisabled': False, 'robustness': 'degraded', 'shareEndpoint': '', 'shareState': '', 'size': '1073741824', 'snapshotDataIntegrity': 'ignored', 'snapshotMaxCount': 250, 'snapshotMaxSize': '0', 'staleReplicaTimeout': 0, 'standby': False, 'state': 'attached', 'unmapMarkSnapChainRemoved': 'ignored', 'volumeAttachment': {'attachments': {'': {'attachmentID': '', 'attachmentType': 'longhorn-api', 'conditions': [{'lastProbeTime': '', 'lastTransitionTime': '2024-05-09T15:48:23Z', 'message': '', 'reason': '', 'status': 'True'}], 'nodeID': 'ip-10-0-2-28', 'parameters': {'disableFrontend': 'false', 'lastAttachedBy': ''}, 'satisfied': True}}, 'volume': 'vol-1'}}
E           assert 3 == 2
E             +3
E             -2

test_ha.py:3131: AssertionError

https://ci.longhorn.io/job/private/job/longhorn-tests-regression/6880/testReport/junit/tests/test_ha/test_engine_image_not_fully_deployed_perform_dr_restoring_expanding_volume_s3_21_25_/
https://ci.longhorn.io/job/private/job/longhorn-tests-regression/6877/

To Reproduce

Run test case test_engine_image_not_fully_deployed_perform_dr_restoring_expanding_volume repeatedly.

Expected behavior

Support bundle for troubleshooting

longhorn-tests-regression-6877-bundle.zip

Environment

  • Longhorn version: master-head
  • Impacted volume (PV):
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): helm
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.29.3+k3s1
    • Number of control plane nodes in the cluster: 1
    • Number of worker nodes in the cluster: 3
  • Node config
    • OS type and version: sles 15-sp5
    • Kernel version:
    • CPU per node:
    • Memory per node:
    • Disk type (e.g. SSD/NVMe/HDD):
    • Network bandwidth between the nodes (Gbps):
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): aws
  • Number of Longhorn volumes in the cluster:

Additional context

@yangchiu yangchiu added kind/bug reproduce/rare < 50% reproducible priority/1 Highly recommended to fix in this release (managed by PO) area/volume-replica-scheduling Volume replica scheduling related severity/2 Function working but has a major issue w/o workaround (a major incident with significant impact) labels May 10, 2024
@yangchiu yangchiu added this to the v1.7.0 milestone May 10, 2024
@yangchiu yangchiu changed the title [BUG] Extra unexpected replicas are created when volume creation causes there to be more replicas than numberOfReplicas and fails the test case [BUG] Extra unexpected replicas are created when volume creation results in more replicas than the specified numberOfReplicas and fails the test case May 10, 2024
@derekbit derekbit modified the milestones: v1.7.0, v1.8.0 May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/volume-replica-scheduling Volume replica scheduling related kind/bug priority/1 Highly recommended to fix in this release (managed by PO) reproduce/rare < 50% reproducible severity/2 Function working but has a major issue w/o workaround (a major incident with significant impact)
Projects
None yet
Development

No branches or pull requests

3 participants