Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRBD resource stuck in SyncTarget state and unable to proceed with Syncing #40

Open
Dipanshu-Sehjal opened this issue May 26, 2022 · 1 comment

Comments

@Dipanshu-Sehjal
Copy link

Dipanshu-Sehjal commented May 26, 2022

Hi,
After a serial/rolling upgrade of k8s cluster, one of the DRBD resources was found stuck in SyncTarget.


Linstor version -
[root@flex-103 ~]# k exec --namespace=piraeus deployment/piraeus-op-piraeus-operator-cs-controller -- linstor --version
linstor 1.13.0; GIT-hash: 840cf57c75c166659509e22447b2c0ca6377ee6d

DRBD version -
[root@flex-103 ~]# k exec -n piraeus piraeus-op-piraeus-operator-ns-node-jrhwv -c linstor-satellite -- drbdadm --version
DRBDADM_BUILDTAG=GIT-hash:\ 087ee6b4961ca154d76e4211223b03149373bed8\ build\ by\ @buildsystem,\ 2022-01-28\ 12:19:33
DRBDADM_API_VERSION=2
DRBD_KERNEL_VERSION_CODE=0x090106
DRBD_KERNEL_VERSION=9.1.6
DRBDADM_VERSION_CODE=0x091402
DRBDADM_VERSION=9.20.2

Piraeus 1.8.0


Setup details -

K8s cluster is a 3 nodes setup - 2 disk nodes and 1 Diskless node with Protocol C replication.
Disk node with InUse resource and SyncTarget is flex-106 (shorted as 106)
Disk node with Unused resource and UpToDate is flex-107 (shorted as 107)
Diskless node with Unused resource is flex-108 (shorted as 108)

Some relevant info -

[root@flex-103 ~]# k exec --namespace=piraeus deployment/piraeus-op-piraeus-operator-cs-controller -- linstor r l | grep pvc-353143c5-e55d-4c75-98be-5248124dd160
| pvc-353143c5-e55d-4c75-98be-5248124dd160 | flex-106.dr.avaya.com | 7009 | InUse  | Ok    | SyncTarget(64.11%) | 2022-05-25 22:59:58 |
| pvc-353143c5-e55d-4c75-98be-5248124dd160 | flex-107.dr.avaya.com | 7009 | Unused | Ok    |           UpToDate | 2022-05-25 23:00:00 |
| pvc-353143c5-e55d-4c75-98be-5248124dd160 | flex-108.dr.avaya.com | 7009 | Unused | Ok    |           Diskless | 2022-05-25 22:59:59 |


DRBD logs from the disk node (flex-106) that has this PVC resource stuck in SyncTarget -

[root@flex-106 ccmuser]# grep pvc-353143c5-e55d-4c75-98be-5248124dd160  /var/log/messages 
May 25 17:51:13 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-107.dr.avaya.com: conn( Connecting -> Connected ) peer( Unknown -> Secondary )
May 25 17:51:13 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009: quorum( no -> yes )
May 25 17:51:13 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009 flex-107.dr.avaya.com: pdsk( DUnknown -> UpToDate ) repl( Off -> Established )
May 25 19:14:05 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009: Would lose quorum, but using tiebreaker logic to keep
May 25 19:14:05 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160: Preparing cluster-wide state change 3657581006 (0->1 496/16)
May 25 19:14:05 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160: State change 3657581006: primary_nodes=4, weak_nodes=FFFFFFFFFFFFFFF9
May 25 19:14:05 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160: Committing cluster-wide state change 3657581006 (3ms)
May 25 19:14:05 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-107.dr.avaya.com: conn( Connected -> Disconnecting ) peer( Secondary -> Unknown )
May 25 19:14:05 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009 flex-107.dr.avaya.com: pdsk( UpToDate -> DUnknown ) repl( Established -> Off )
May 25 19:14:05 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-107.dr.avaya.com: ack_receiver terminated
May 25 19:14:05 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-107.dr.avaya.com: Terminating ack_recv thread
May 25 19:14:05 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-107.dr.avaya.com: Terminating sender thread
May 25 19:14:05 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-107.dr.avaya.com: Starting sender thread (from drbd_r_pvc-3531 [7369])
May 25 19:14:06 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-107.dr.avaya.com: Connection closed
May 25 19:14:06 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-107.dr.avaya.com: conn( Disconnecting -> StandAlone )
May 25 19:14:06 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-107.dr.avaya.com: Terminating receiver thread
May 25 19:14:06 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-107.dr.avaya.com: conn( StandAlone -> Unconnected )
May 25 19:14:06 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-107.dr.avaya.com: Starting receiver thread (from drbd_w_pvc-3531 [7349])
May 25 19:14:06 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-107.dr.avaya.com: conn( Unconnected -> Connecting )
May 25 19:14:06 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-107.dr.avaya.com: Handshake to peer 1 successful: Agreed network protocol version 121
May 25 19:14:06 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-107.dr.avaya.com: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
May 25 19:14:06 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-107.dr.avaya.com: Peer authenticated using 20 bytes HMAC
May 25 19:14:06 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-107.dr.avaya.com: Starting ack_recv thread (from drbd_r_pvc-3531 [83213])
May 25 19:14:06 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160: Preparing cluster-wide state change 484053172 (0->1 499/146)
May 25 19:14:06 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009 flex-107.dr.avaya.com: drbd_sync_handshake:
May 25 19:14:06 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009 flex-107.dr.avaya.com: self 93E3EF437B3C9C06:0000000000000000:CB4D79960DC2245A:43030D49A0AAB1B8 bits:0 flags:100
May 25 19:14:06 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009 flex-107.dr.avaya.com: peer 2744834DBA493B50:0000000000000000:93E3EF437B3C9C06:CB4D79960DC2245A bits:2589 flags:1100
May 25 19:14:06 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009 flex-107.dr.avaya.com: uuid_compare()=target-set-bitmap by rule=history-peer
May 25 19:14:06 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009 flex-107.dr.avaya.com: Setting and writing one bitmap slot, after drbd_sync_handshake
May 25 19:14:06 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160: State change 484053172: primary_nodes=4, weak_nodes=FFFFFFFFFFFFFFF8
May 25 19:14:06 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160: Committing cluster-wide state change 484053172 (22ms)
May 25 19:14:06 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-107.dr.avaya.com: conn( Connecting -> Connected ) peer( Unknown -> Secondary )
May 25 19:14:06 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009: disk( Consistent -> Outdated )
May 25 19:14:06 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009 flex-107.dr.avaya.com: pdsk( DUnknown -> UpToDate ) repl( Off -> WFBitMapT )
May 25 19:14:06 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009 flex-107.dr.avaya.com: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 21(1), total 21; compression: 100.0%
May 25 19:14:06 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009 flex-107.dr.avaya.com: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 21(1), total 21; compression: 100.0%
May 25 19:14:06 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009: disk( Outdated -> Inconsistent )
May 25 19:14:06 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009 flex-108.dr.avaya.com: resync-susp( no -> connection dependency )
May 25 19:14:06 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009 flex-107.dr.avaya.com: repl( WFBitMapT -> SyncTarget )
May 25 19:14:06 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009 flex-107.dr.avaya.com: Began resync as SyncTarget (will sync 2100760 KB [525190 bits set]).
May 25 19:14:13 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160: Preparing cluster-wide state change 236192581 (0->2 8176/3088)
May 25 19:14:13 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160: State change 236192581: primary_nodes=4, weak_nodes=FFFFFFFFFFFFFFF9
May 25 19:14:13 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160: Committing cluster-wide state change 236192581 (2ms)
May 25 19:14:13 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-108.dr.avaya.com: conn( Connected -> Disconnecting ) peer( Primary -> Unknown )
May 25 19:14:13 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009 flex-108.dr.avaya.com: pdsk( Diskless -> DUnknown ) repl( Established -> Off )
May 25 19:14:13 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-108.dr.avaya.com: ack_receiver terminated
May 25 19:14:13 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-108.dr.avaya.com: Terminating ack_recv thread
May 25 19:14:13 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-108.dr.avaya.com: Terminating sender thread
May 25 19:14:13 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-108.dr.avaya.com: Starting sender thread (from drbd_r_pvc-3531 [7370])
May 25 19:14:13 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-108.dr.avaya.com: Connection closed
May 25 19:14:13 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-108.dr.avaya.com: conn( Disconnecting -> StandAlone )
May 25 19:14:13 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-108.dr.avaya.com: Terminating receiver thread
May 25 19:14:14 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-108.dr.avaya.com: conn( StandAlone -> Unconnected )
May 25 19:14:14 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-108.dr.avaya.com: Starting receiver thread (from drbd_w_pvc-3531 [7349])
May 25 19:14:14 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-108.dr.avaya.com: conn( Unconnected -> Connecting )
May 25 19:14:14 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-108.dr.avaya.com: Handshake to peer 2 successful: Agreed network protocol version 121
May 25 19:14:14 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-108.dr.avaya.com: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
May 25 19:14:14 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-108.dr.avaya.com: Peer authenticated using 20 bytes HMAC
May 25 19:14:14 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-108.dr.avaya.com: Starting ack_recv thread (from drbd_r_pvc-3531 [83820])
May 25 19:14:14 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160: Preparing cluster-wide state change 3661494316 (0->2 499/146)
May 25 19:14:14 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160: State change 3661494316: primary_nodes=4, weak_nodes=FFFFFFFFFFFFFFF8
May 25 19:14:14 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160: Committing cluster-wide state change 3661494316 (6ms)
May 25 19:14:14 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-108.dr.avaya.com: conn( Connecting -> Connected ) peer( Unknown -> Primary )
May 25 19:14:14 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009 flex-108.dr.avaya.com: pdsk( DUnknown -> Diskless ) repl( Off -> Established )
May 25 19:14:14 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009 flex-107.dr.avaya.com: strategy = target-set-bitmap
May 25 19:14:14 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009 flex-107.dr.avaya.com: drbd_sync_handshake:
May 25 19:14:14 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009 flex-107.dr.avaya.com: self 93E3EF437B3C9C06:0000000000000000:CB4D79960DC2245A:43030D49A0AAB1B8 bits:333702 flags:104
May 25 19:14:14 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009 flex-107.dr.avaya.com: peer 2744834DBA493B50:0000000000000000:93E3EF437B3C9C06:CB4D79960DC2245A bits:2589 flags:1100
May 25 19:14:14 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009 flex-107.dr.avaya.com: uuid_compare()=target-set-bitmap by rule=history-peer
May 25 19:14:14 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009 flex-107.dr.avaya.com: Setting and writing one bitmap slot, after drbd_sync_handshake
May 25 19:14:14 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009: ASSERTION current != device->resource->worker.task FAILED in drbd_bitmap_io
May 25 19:14:14 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009 flex-107.dr.avaya.com: Becoming WFBitMapT because primary is diskless
May 25 19:14:14 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160: State change failed: Can not start OV/resync since it is already active
May 25 19:14:14 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009 flex-108.dr.avaya.com: Failed: resync-susp( connection dependency -> no )
May 25 19:14:14 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009 flex-107.dr.avaya.com: Failed: repl( SyncTarget -> WFBitMapT )
May 25 19:14:14 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009 flex-107.dr.avaya.com: ...postponing this until current resync finished
May 25 19:14:22 flex-106 kubelet[2255]: I0525 19:14:22.787201    2255 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"pvc-353143c5-e55d-4c75-98be-5248124dd160\" (UniqueName: \"kubernetes.io/csi/linstor.csi.linbit.com^pvc-353143c5-e55d-4c75-98be-5248124dd160\") pod \"alarming-db-service-pgo-repo-host-0\" (UID: \"617bdf9e-7c3d-4263-9799-2a89567677d3\") "
May 25 19:14:22 flex-106 kubelet[2255]: E0525 19:14:22.787266    2255 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/linstor.csi.linbit.com^pvc-353143c5-e55d-4c75-98be-5248124dd160 podName: nodeName:}" failed. No retries permitted until 2022-05-25 19:14:23.28724099 -0600 MDT m=+5071.728082294 (durationBeforeRetry 500ms). Error: Volume has not been added to the list of VolumesInUse in the node's volume status for volume "pvc-353143c5-e55d-4c75-98be-5248124dd160" (UniqueName: "kubernetes.io/csi/linstor.csi.linbit.com^pvc-353143c5-e55d-4c75-98be-5248124dd160") pod "alarming-db-service-pgo-repo-host-0" (UID: "617bdf9e-7c3d-4263-9799-2a89567677d3")
May 25 19:14:22 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009 flex-107.dr.avaya.com: BAD! enr=0 rs_left=-1 rs_failed=0 count=1 cstate=Connected SyncTarget
May 25 19:14:22 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160 flex-108.dr.avaya.com: peer( Primary -> Secondary )
May 25 19:14:23 flex-106 kubelet[2255]: I0525 19:14:23.293106    2255 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"pvc-353143c5-e55d-4c75-98be-5248124dd160\" (UniqueName: \"kubernetes.io/csi/linstor.csi.linbit.com^pvc-353143c5-e55d-4c75-98be-5248124dd160\") pod \"alarming-db-service-pgo-repo-host-0\" (UID: \"617bdf9e-7c3d-4263-9799-2a89567677d3\") "
May 25 19:14:23 flex-106 kubelet[2255]: E0525 19:14:23.293189    2255 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/linstor.csi.linbit.com^pvc-353143c5-e55d-4c75-98be-5248124dd160 podName: nodeName:}" failed. No retries permitted until 2022-05-25 19:14:24.293174982 -0600 MDT m=+5072.734016279 (durationBeforeRetry 1s). Error: Volume has not been added to the list of VolumesInUse in the node's volume status for volume "pvc-353143c5-e55d-4c75-98be-5248124dd160" (UniqueName: "kubernetes.io/csi/linstor.csi.linbit.com^pvc-353143c5-e55d-4c75-98be-5248124dd160") pod "alarming-db-service-pgo-repo-host-0" (UID: "617bdf9e-7c3d-4263-9799-2a89567677d3")
May 25 19:14:24 flex-106 kubelet[2255]: I0525 19:14:24.301762    2255 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"pvc-353143c5-e55d-4c75-98be-5248124dd160\" (UniqueName: \"kubernetes.io/csi/linstor.csi.linbit.com^pvc-353143c5-e55d-4c75-98be-5248124dd160\") pod \"alarming-db-service-pgo-repo-host-0\" (UID: \"617bdf9e-7c3d-4263-9799-2a89567677d3\") "
May 25 19:14:24 flex-106 kubelet[2255]: E0525 19:14:24.301836    2255 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/linstor.csi.linbit.com^pvc-353143c5-e55d-4c75-98be-5248124dd160 podName: nodeName:}" failed. No retries permitted until 2022-05-25 19:14:26.301822172 -0600 MDT m=+5074.742663467 (durationBeforeRetry 2s). Error: Volume has not been added to the list of VolumesInUse in the node's volume status for volume "pvc-353143c5-e55d-4c75-98be-5248124dd160" (UniqueName: "kubernetes.io/csi/linstor.csi.linbit.com^pvc-353143c5-e55d-4c75-98be-5248124dd160") pod "alarming-db-service-pgo-repo-host-0" (UID: "617bdf9e-7c3d-4263-9799-2a89567677d3")
May 25 19:14:26 flex-106 kubelet[2255]: I0525 19:14:26.319722    2255 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"pvc-353143c5-e55d-4c75-98be-5248124dd160\" (UniqueName: \"kubernetes.io/csi/linstor.csi.linbit.com^pvc-353143c5-e55d-4c75-98be-5248124dd160\") pod \"alarming-db-service-pgo-repo-host-0\" (UID: \"617bdf9e-7c3d-4263-9799-2a89567677d3\") "
May 25 19:14:26 flex-106 kubelet[2255]: E0525 19:14:26.324195    2255 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/linstor.csi.linbit.com^pvc-353143c5-e55d-4c75-98be-5248124dd160 podName: nodeName:}" failed. No retries permitted until 2022-05-25 19:14:30.324175949 -0600 MDT m=+5078.765017244 (durationBeforeRetry 4s). Error: Volume not attached according to node status for volume "pvc-353143c5-e55d-4c75-98be-5248124dd160" (UniqueName: "kubernetes.io/csi/linstor.csi.linbit.com^pvc-353143c5-e55d-4c75-98be-5248124dd160") pod "alarming-db-service-pgo-repo-host-0" (UID: "617bdf9e-7c3d-4263-9799-2a89567677d3")
May 25 19:14:30 flex-106 kubelet[2255]: I0525 19:14:30.347525    2255 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"pvc-353143c5-e55d-4c75-98be-5248124dd160\" (UniqueName: \"kubernetes.io/csi/linstor.csi.linbit.com^pvc-353143c5-e55d-4c75-98be-5248124dd160\") pod \"alarming-db-service-pgo-repo-host-0\" (UID: \"617bdf9e-7c3d-4263-9799-2a89567677d3\") "
May 25 19:14:30 flex-106 kubelet[2255]: I0525 19:14:30.354760    2255 operation_generator.go:1523] Controller attach succeeded for volume "pvc-353143c5-e55d-4c75-98be-5248124dd160" (UniqueName: "kubernetes.io/csi/linstor.csi.linbit.com^pvc-353143c5-e55d-4c75-98be-5248124dd160") pod "alarming-db-service-pgo-repo-host-0" (UID: "617bdf9e-7c3d-4263-9799-2a89567677d3") device path: ""
May 25 19:14:30 flex-106 kubelet[2255]: I0525 19:14:30.448399    2255 reconciler.go:269] "operationExecutor.MountVolume started for volume \"pvc-353143c5-e55d-4c75-98be-5248124dd160\" (UniqueName: \"kubernetes.io/csi/linstor.csi.linbit.com^pvc-353143c5-e55d-4c75-98be-5248124dd160\") pod \"alarming-db-service-pgo-repo-host-0\" (UID: \"617bdf9e-7c3d-4263-9799-2a89567677d3\") "
May 25 19:14:30 flex-106 kubelet[2255]: I0525 19:14:30.448643    2255 operation_generator.go:587] MountVolume.WaitForAttach entering for volume "pvc-353143c5-e55d-4c75-98be-5248124dd160" (UniqueName: "kubernetes.io/csi/linstor.csi.linbit.com^pvc-353143c5-e55d-4c75-98be-5248124dd160") pod "alarming-db-service-pgo-repo-host-0" (UID: "617bdf9e-7c3d-4263-9799-2a89567677d3") DevicePath ""
May 25 19:14:30 flex-106 kubelet[2255]: I0525 19:14:30.453864    2255 operation_generator.go:597] MountVolume.WaitForAttach succeeded for volume "pvc-353143c5-e55d-4c75-98be-5248124dd160" (UniqueName: "kubernetes.io/csi/linstor.csi.linbit.com^pvc-353143c5-e55d-4c75-98be-5248124dd160") pod "alarming-db-service-pgo-repo-host-0" (UID: "617bdf9e-7c3d-4263-9799-2a89567677d3") DevicePath "csi-a9e5b60d7e750fc54fdefe9a0b376ceb20ade3d5650f9c5533faaf827c6ac10e"
May 25 19:14:30 flex-106 kubelet[2255]: I0525 19:14:30.457200    2255 operation_generator.go:630] MountVolume.MountDevice succeeded for volume "pvc-353143c5-e55d-4c75-98be-5248124dd160" (UniqueName: "kubernetes.io/csi/linstor.csi.linbit.com^pvc-353143c5-e55d-4c75-98be-5248124dd160") pod "alarming-db-service-pgo-repo-host-0" (UID: "617bdf9e-7c3d-4263-9799-2a89567677d3") device mount path "/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-353143c5-e55d-4c75-98be-5248124dd160/globalmount"
May 25 19:14:30 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160: Preparing cluster-wide state change 266334151 (0->-1 3/1)
May 25 19:14:30 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160: State change 266334151: primary_nodes=1, weak_nodes=FFFFFFFFFFFFFFF8
May 25 19:14:30 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160: Committing cluster-wide state change 266334151 (1ms)
May 25 19:14:30 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160: role( Secondary -> Primary )
May 25 19:14:30 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160: role( Primary -> Secondary )
May 25 19:14:30 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160: Preparing cluster-wide state change 195577500 (0->-1 3/1)
May 25 19:14:30 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160: State change 195577500: primary_nodes=1, weak_nodes=FFFFFFFFFFFFFFF8
May 25 19:14:30 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160: Committing cluster-wide state change 195577500 (1ms)
May 25 19:14:30 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160: role( Secondary -> Primary )
May 25 19:14:30 flex-106 kubelet[2255]: I0525 19:14:30.682158    2255 operation_generator.go:712] MountVolume.SetUp succeeded for volume "pvc-353143c5-e55d-4c75-98be-5248124dd160" (UniqueName: "kubernetes.io/csi/linstor.csi.linbit.com^pvc-353143c5-e55d-4c75-98be-5248124dd160") pod "alarming-db-service-pgo-repo-host-0" (UID: "617bdf9e-7c3d-4263-9799-2a89567677d3")
May 25 20:00:05 flex-106 kernel: drbd pvc-353143c5-e55d-4c75-98be-5248124dd160/0 drbd1009 flex-107.dr.avaya.com: BAD! enr=1 rs_left=-55 rs_failed=0 count=55 cstate=Connected SyncTarget

[root@flex-103 cust]# k exec -n piraeus piraeus-op-piraeus-operator-ns-node-4vmfm -c linstor-satellite -- drbdadm dstate pvc-353143c5-e55d-4c75-98be-5248124dd160
Inconsistent/Diskless/UpToDate
[root@flex-103 cust]# k exec -n piraeus piraeus-op-piraeus-operator-ns-node-4vmfm -c linstor-satellite -- drbdadm cstate pvc-353143c5-e55d-4c75-98be-5248124dd160
Connected
Connected
[root@flex-103 ~]# k exec -n piraeus piraeus-op-piraeus-operator-ns-node-4vmfm -c linstor-satellite -- drbdadm status pvc-353143c5-e55d-4c75-98be-5248124dd160 --verbose
drbdsetup status pvc-353143c5-e55d-4c75-98be-5248124dd160
pvc-353143c5-e55d-4c75-98be-5248124dd160 role:Primary
  disk:Inconsistent
  flex-107.dr.avaya.com role:Secondary
    replication:SyncTarget peer-disk:UpToDate done:64.12
  flex-108.dr.avaya.com role:Secondary
    peer-disk:Diskless peer-client:yes resync-suspended:dependency


[root@flex-103 ~]# k exec -n piraeus piraeus-op-piraeus-operator-ns-node-4vmfm -c linstor-satellite -- drbdsetup status pvc-353143c5-e55d-4c75-98be-5248124dd160
pvc-353143c5-e55d-4c75-98be-5248124dd160 role:Primary
  disk:Inconsistent
  flex-107.dr.avaya.com role:Secondary
    replication:SyncTarget peer-disk:UpToDate done:64.12
  flex-108.dr.avaya.com role:Secondary
    peer-disk:Diskless peer-client:yes resync-suspended:dependency


Pod using this PVC is deployed on disk node with this SyncTarget resource -

[root@flex-103 ~]# k get pods -o wide | grep alarming-db-service-pgo-repo-host-0
alarming-db-service-pgo-repo-host-0                               1/1     Running     0               6h8m    10.200.56.55     flex-106.dr.avaya.com   <none>           <none>
[root@flex-103 cust]# k exec -n piraeus piraeus-op-piraeus-operator-ns-node-4vmfm -c linstor-satellite -- drbdsetup show pvc-353143c5-e55d-4c75-98be-5248124dd160 
resource "pvc-353143c5-e55d-4c75-98be-5248124dd160" {
    options {
        quorum          	majority;
        on-no-quorum    	io-error;
    }
    _this_host {
        node-id			0;
        volume 0 {
            device			minor 1009;
            disk			"/dev/vg_sds/pvc-353143c5-e55d-4c75-98be-5248124dd160_00000";
            meta-disk			internal;
            disk {
                rs-discard-granularity	262144; # bytes
            }
        }
    }
    connection {
        _peer_node_id 1;
        path {
            _this_host ipv4 10.129.185.106:7009;
            _remote_host ipv4 10.129.185.107:7009;
        }
        net {
            max-epoch-size  	10000;
            cram-hmac-alg   	"sha1";
            shared-secret   	"HK+1vTym1caM9w4aR3uh";
            verify-alg      	"crct10dif-pclmul";
            max-buffers     	10000;
            _name           	"flex-107.dr.avaya.com";
        }
    }
    connection {
        _peer_node_id 2;
        path {
            _this_host ipv4 10.129.185.106:7009;
            _remote_host ipv4 10.129.185.108:7009;
        }
        net {
            max-epoch-size  	10000;
            cram-hmac-alg   	"sha1";
            shared-secret   	"HK+1vTym1caM9w4aR3uh";
            verify-alg      	"crct10dif-pclmul";
            max-buffers     	10000;
            _name           	"flex-108.dr.avaya.com";
        }
        volume 0 {
            disk {
                bitmap          	no;
            }
        }
    }
}

[root@flex-103 ~]# k get pods -n piraeus -o wide | grep ns
piraeus-op-piraeus-operator-ns-node-4vmfm                     2/2     Running   4 (7h35m ago)   9h      10.129.185.106   flex-106.dr.avaya.com   <none>           <none>
piraeus-op-piraeus-operator-ns-node-jrhwv                     2/2     Running   2 (168m ago)    9h      10.129.185.108   flex-108.dr.avaya.com   <none>           <none>
piraeus-op-piraeus-operator-ns-node-nhrxp                     2/2     Running   2 (171m ago)    9h      10.129.185.107   flex-107.dr.avaya.com   <none>           <none>


To avoid cluttering, I have attached all logs related to this PVC in logs here -

Here are attached drbd states, DRBD kernel logs, sc def, and linstor r l output files

drbadm-status-verbose-on-replica-disk-node-107.log
drbdadm-cstate-disk-node-106.log
drbdadm-cstate-diskless.log
drbdadm-cstate-replica-disk-node-107.log
drbdadm-dstate-disk-node-106.log
drbdadm-dstate-diskless.log
drbdadm-dstate-replica-disk-node-107.log
drbdadm-show-resource-on-InUse.log
linstor-resource-list.log
node-associated-with-pod-using-this-pvc.log
pvc-description.log
sc-info.yaml.log
diskless-node-stuck-target-108.log
disk-node-stuck-target-primary-106.log
disk-node-stuck-target-secondary-107.log
drbadm-status-verbose-on-diskless-node-108.log
drbadm-status-verbose-on-InUse-node.log

@bitemarcz
Copy link

How do you get the resource name for a drbd resource? Seems we are using RES: * but every name we try appears to say no valid resource name.

We also tried drbdadm down all but no resource defined is also returned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants