mkfs.xfs: libxfs_device_zero write failed: Input/output error #8604

reysravanga · 2024-05-20T07:35:15Z

Describe the bug

While mounting the PVC to pod longhorn failed to mount with timeout error.

To Reproduce

Longhorn failed to mount the PVCto pod with following error: volume attachment is being deleted
scaledown the pod replicas zero and detach the volume
Volume turned in to faulty mode - to fix faulty volume salvage the volume
scaled up the pod replicas to 1 - envounted below error :

 Warning  FailedMount             6m20s  kubelet                  MountVolume.MountDevice failed for volume "pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf" : rpc error: code = Internal desc = format of disk "/dev/longhorn/pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf" failed: type:("xfs") target:("/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf/globalmount") options:("defaults") errcode:(exit status 1) output:(error reading existing superblock: No data available
mkfs.xfs: pwrite failed: No data available
libxfs_bwrite: write failed on (unknown) bno 0x1116ffff00/0x100, err=61
mkfs.xfs: Releasing dirty buffer to free list!
found dirty buffer (bulk) on free list!
mkfs.xfs: pwrite failed: No data available
libxfs_bwrite: write failed on (unknown) bno 0x0/0x100, err=61
mkfs.xfs: Releasing dirty buffer to free list!
found dirty buffer (bulk) on free list!
mkfs.xfs: pwrite failed: No data available
libxfs_bwrite: write failed on xfs_sb bno 0x0/0x8, err=61
mkfs.xfs: Releasing dirty buffer to free list!
mkfs.xfs: libxfs_device_zero write failed: Input/output error
meta-data=/dev/longhorn/pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf isize=512    agcount=35, agsize=268435455 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=0    bigtime=0 inobtcount=0
data     =                       bsize=4096   blocks=9175040000, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=521728, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
)
  Warning  FailedMount  57s (x3 over 5m30s)  kubelet  Unable to attach or mount volumes: unmounted volumes=[storage-cloned], unattached volumes=[kube-api-access-7p6b4 storage-cloned]: timed out waiting for the condition
  Warning  FailedMount  7s (x10 over 6m20s)  kubelet  MountVolume.MountDevice failed for volume "pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf" : rpc error: code = InvalidArgument desc = volume pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf hasn't been attached yet

Expected behavior

longhorn volume should be mounted without any error.

Environment

Longhorn version:
Impacted volume (PV):
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): helm
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: bearmetal k8s
- Number of control plane nodes in the cluster: 1
- Number of worker nodes in the cluster: 2
Node config
- OS type and version: ubuntu 22.04
- Kernel version: Linux node1 5.4.0-174-generic EXT4-fs (sdb): Remounting filesystem read-only #193-Ubuntu SMP Thu Mar 7 14:29:28 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
- CPU per node: 48
- Memory per node:
- Disk type (e.g. SSD/NVMe/HDD): SSD
- Network bandwidth between the nodes (Gbps):
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Baremetal
Number of Longhorn volumes in the cluster: 64

Additional context

The text was updated successfully, but these errors were encountered:

reysravanga · 2024-05-22T07:40:16Z

Hi Team,

Could someone help me here with the possible fix?

shuo-wu · 2024-05-24T23:25:20Z

Volume turned in to faulty mode - to fix faulty volume salvage the volume

Notice that your volume became faulted before this remount. The filesystem inside it is probably crashed as well. Would you try to check and fix the filesystem manually first (via fsck or xfs_repair)?

Enter into the node host the volume is currently attached on
Find the volume block device /dev/longhorn/<volume name> and run the repair cmds

reysravanga · 2024-05-26T06:38:58Z

Hi @shuo-wu ,

Thanks for the steps, However after restarting the instance manager it can schedule the pod, Is there any way we can prevent the longhorn failures?

shuo-wu · 2024-05-29T17:51:18Z

I don't know what causes the Longhorn failure. Are there any error logs in longhorn-manager pods or instance-manager pods when your volume becomes Faulted? (I guess the instance manager pod log is lost since you already restart it.)
BTW, restarting the instance manager is pretty dangerous, which may crash all volumes and replicas on the corresponding node.

reysravanga · 2024-06-03T02:41:40Z

unfortunately we don't have the longhorn manager and instance manager logs, however I have the support bundle here.
https://drive.google.com/file/d/1GDjK2hEKrEIgiiFezMZZQM4BLiQV6v_i/view?usp=sharing

shuo-wu · 2024-06-05T07:33:07Z

In instance-manager pod instance-manager-406f2d88d3fb749fba8b723a5ea318ac, I found the error logs indicate that tgtd timeouts or fails receive responses from the engine while the requests are actually successfully handled:

2024-05-29T21:47:52.731004839Z timeout_handler: Timeout request 26029821 due to disconnection
2024-05-29T21:47:52.731115669Z timeout_handler: Timeout request 26029822 due to disconnection
......
2024-05-29T21:47:52.731193708Z tgtd: tgtd: bs_longhorn_request(97) fail to write at 2490052608 for 4096
2024-05-29T21:47:52.731197585Z tgtd: bs_longhorn_request(97) fail to write at 1619369984 for 4096
2024-05-29T21:47:52.731201432Z tgtd: tgtd: bs_longhorn_request(97) fail to write at 2099392512 for 4096
2024-05-29T21:47:52.731205180Z tgtd: tgtd: tgtd: tgtd: bs_longhorn_request(97) fail to write at 4096 for 8192
2024-05-29T21:47:52.731209388Z bs_longhorn_request(210) io error 0x232d380 8a -14 4096 1619369984, Success
2024-05-29T21:47:52.731212864Z tgtd: bs_longhorn_request(210) io error 0x223b100 8a -14 4096 2099392512, Success
......
2024-05-29T21:49:06.766894083Z response_process: Unknown response sequence 26029821
2024-05-29T21:49:06.766913409Z response_process: Unknown response sequence 26029833

After this short period of timeout, everything seems to be normal again. And the corresponding volume finally looks good in the longhorn-csi-plugin pod...

2024-06-02T18:00:00.327353476Z time="2024-06-02T18:00:00Z" level=info msg="NodePublishVolume: req: {\"staging_target_path\":\"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf/globalmount\",\"target_path\":\"/var/lib/kubelet/pods/05032b26-9933-42fa-83e5-0cd6ce6f9f8f/volumes/kubernetes.io~csi/pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf/mount\",\"volume_capability\":{\"AccessType\":{\"Mount\":{\"fs_type\":\"xfs\"}},\"access_mode\":{\"mode\":1}},\"volume_context\":{\"csi.storage.k8s.io/ephemeral\":\"false\",\"csi.storage.k8s.io/pod.name\":\"backup-minio-28622520-wzdrm\",\"csi.storage.k8s.io/pod.namespace\":\"avs\",\"csi.storage.k8s.io/pod.uid\":\"05032b26-9933-42fa-83e5-0cd6ce6f9f8f\",\"csi.storage.k8s.io/serviceAccount.name\":\"default\",\"dataLocality\":\"disabled\",\"fromBackup\":\"\",\"fsType\":\"xfs\",\"numberOfReplicas\":\"1\",\"staleReplicaTimeout\":\"30\",\"storage.kubernetes.io/csiProvisionerIdentity\":\"1687657401492-8081-driver.longhorn.io\"},\"volume_id\":\"pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf\"}" func=csi.logGRPC file="server.go:132"
2024-06-02T18:00:00.328223473Z time="2024-06-02T18:00:00Z" level=info msg="NodePublishVolume is called with req volume_id:\"pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf\" staging_target_path:\"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf/globalmount\" target_path:\"/var/lib/kubelet/pods/05032b26-9933-42fa-83e5-0cd6ce6f9f8f/volumes/kubernetes.io~csi/pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf/mount\" volume_capability:<mount:<fs_type:\"xfs\" > access_mode:<mode:SINGLE_NODE_WRITER > > volume_context:<key:\"csi.storage.k8s.io/ephemeral\" value:\"false\" > volume_context:<key:\"csi.storage.k8s.io/pod.name\" value:\"backup-minio-28622520-wzdrm\" > volume_context:<key:\"csi.storage.k8s.io/pod.namespace\" value:\"avs\" > volume_context:<key:\"csi.storage.k8s.io/pod.uid\" value:\"05032b26-9933-42fa-83e5-0cd6ce6f9f8f\" > volume_context:<key:\"csi.storage.k8s.io/serviceAccount.name\" value:\"default\" > volume_context:<key:\"dataLocality\" value:\"disabled\" > volume_context:<key:\"fromBackup\" value:\"\" > volume_context:<key:\"fsType\" value:\"xfs\" > volume_context:<key:\"numberOfReplicas\" value:\"1\" > volume_context:<key:\"staleReplicaTimeout\" value:\"30\" > volume_context:<key:\"storage.kubernetes.io/csiProvisionerIdentity\" value:\"1687657401492-8081-driver.longhorn.io\" > " func="csi.(*NodeServer).NodePublishVolume" file="node_server.go:82" component=csi-node-server function=NodePublishVolume
2024-06-02T18:00:00.339179122Z time="2024-06-02T18:00:00Z" level=info msg="Volume pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf using user and longhorn provided xfs fs creation params: -ssize=4096 -bsize=4096" func="csi.(*NodeServer).getMounter" file="node_server.go:829"
2024-06-02T18:00:00.346910169Z time="2024-06-02T18:00:00Z" level=info msg="Trying to ensure mount point /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf/globalmount" func=csi.ensureMountPoint file="util.go:295"
2024-06-02T18:00:00.346981355Z time="2024-06-02T18:00:00Z" level=info msg="Mount point /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf/globalmount try opening and syncing dir to make sure it's healthy" func=csi.ensureMountPoint file="util.go:303"
2024-06-02T18:00:01.178594601Z time="2024-06-02T18:00:01Z" level=info msg="Trying to ensure mount point /var/lib/kubelet/pods/05032b26-9933-42fa-83e5-0cd6ce6f9f8f/volumes/kubernetes.io~csi/pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf/mount" func=csi.ensureMountPoint file="util.go:295"
2024-06-02T18:00:01.217698509Z time="2024-06-02T18:00:01Z" level=info msg="NodePublishVolume: rsp: {}" func=csi.logGRPC file="server.go:141"

Are the corresponding workload pods working fine now?

reysravanga added kind/bug require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage labels May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mkfs.xfs: libxfs_device_zero write failed: Input/output error #8604

mkfs.xfs: libxfs_device_zero write failed: Input/output error #8604

reysravanga commented May 20, 2024

reysravanga commented May 22, 2024

shuo-wu commented May 24, 2024

reysravanga commented May 26, 2024

shuo-wu commented May 29, 2024

reysravanga commented Jun 3, 2024

shuo-wu commented Jun 5, 2024

mkfs.xfs: libxfs_device_zero write failed: Input/output error #8604

mkfs.xfs: libxfs_device_zero write failed: Input/output error #8604

Comments

reysravanga commented May 20, 2024

Describe the bug

To Reproduce

Expected behavior

Environment

Additional context

reysravanga commented May 22, 2024

shuo-wu commented May 24, 2024

reysravanga commented May 26, 2024

shuo-wu commented May 29, 2024

reysravanga commented Jun 3, 2024

shuo-wu commented Jun 5, 2024