Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mkfs.xfs: libxfs_device_zero write failed: Input/output error #8604

Open
reysravanga opened this issue May 20, 2024 · 6 comments
Open

mkfs.xfs: libxfs_device_zero write failed: Input/output error #8604

reysravanga opened this issue May 20, 2024 · 6 comments
Labels
kind/bug require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage

Comments

@reysravanga
Copy link

Describe the bug

While mounting the PVC to pod longhorn failed to mount with timeout error.

To Reproduce

  1. Longhorn failed to mount the PVCto pod with following error: volume attachment is being deleted
  2. scaledown the pod replicas zero and detach the volume
  3. Volume turned in to faulty mode - to fix faulty volume salvage the volume
  4. scaled up the pod replicas to 1 - envounted below error :
 Warning  FailedMount             6m20s  kubelet                  MountVolume.MountDevice failed for volume "pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf" : rpc error: code = Internal desc = format of disk "/dev/longhorn/pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf" failed: type:("xfs") target:("/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf/globalmount") options:("defaults") errcode:(exit status 1) output:(error reading existing superblock: No data available
mkfs.xfs: pwrite failed: No data available
libxfs_bwrite: write failed on (unknown) bno 0x1116ffff00/0x100, err=61
mkfs.xfs: Releasing dirty buffer to free list!
found dirty buffer (bulk) on free list!
mkfs.xfs: pwrite failed: No data available
libxfs_bwrite: write failed on (unknown) bno 0x0/0x100, err=61
mkfs.xfs: Releasing dirty buffer to free list!
found dirty buffer (bulk) on free list!
mkfs.xfs: pwrite failed: No data available
libxfs_bwrite: write failed on xfs_sb bno 0x0/0x8, err=61
mkfs.xfs: Releasing dirty buffer to free list!
mkfs.xfs: libxfs_device_zero write failed: Input/output error
meta-data=/dev/longhorn/pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf isize=512    agcount=35, agsize=268435455 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=0    bigtime=0 inobtcount=0
data     =                       bsize=4096   blocks=9175040000, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=521728, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
)
  Warning  FailedMount  57s (x3 over 5m30s)  kubelet  Unable to attach or mount volumes: unmounted volumes=[storage-cloned], unattached volumes=[kube-api-access-7p6b4 storage-cloned]: timed out waiting for the condition
  Warning  FailedMount  7s (x10 over 6m20s)  kubelet  MountVolume.MountDevice failed for volume "pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf" : rpc error: code = InvalidArgument desc = volume pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf hasn't been attached yet

Expected behavior

longhorn volume should be mounted without any error.

Environment

  • Longhorn version:
  • Impacted volume (PV):
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): helm
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: bearmetal k8s
    • Number of control plane nodes in the cluster: 1
    • Number of worker nodes in the cluster: 2
  • Node config
    • OS type and version: ubuntu 22.04
    • Kernel version: Linux node1 5.4.0-174-generic EXT4-fs (sdb): Remounting filesystem read-only #193-Ubuntu SMP Thu Mar 7 14:29:28 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
    • CPU per node: 48
    • Memory per node:
    • Disk type (e.g. SSD/NVMe/HDD): SSD
    • Network bandwidth between the nodes (Gbps):
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Baremetal
  • Number of Longhorn volumes in the cluster: 64

Additional context

@reysravanga reysravanga added kind/bug require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage labels May 20, 2024
@reysravanga
Copy link
Author

Hi Team,

Could someone help me here with the possible fix?

@shuo-wu
Copy link
Contributor

shuo-wu commented May 24, 2024

Volume turned in to faulty mode - to fix faulty volume salvage the volume

Notice that your volume became faulted before this remount. The filesystem inside it is probably crashed as well. Would you try to check and fix the filesystem manually first (via fsck or xfs_repair)?

  1. Enter into the node host the volume is currently attached on
  2. Find the volume block device /dev/longhorn/<volume name> and run the repair cmds

@reysravanga
Copy link
Author

Hi @shuo-wu ,

Thanks for the steps, However after restarting the instance manager it can schedule the pod, Is there any way we can prevent the longhorn failures?

@shuo-wu
Copy link
Contributor

shuo-wu commented May 29, 2024

I don't know what causes the Longhorn failure. Are there any error logs in longhorn-manager pods or instance-manager pods when your volume becomes Faulted? (I guess the instance manager pod log is lost since you already restart it.)
BTW, restarting the instance manager is pretty dangerous, which may crash all volumes and replicas on the corresponding node.

@reysravanga
Copy link
Author

unfortunately we don't have the longhorn manager and instance manager logs, however I have the support bundle here.
https://drive.google.com/file/d/1GDjK2hEKrEIgiiFezMZZQM4BLiQV6v_i/view?usp=sharing

@shuo-wu
Copy link
Contributor

shuo-wu commented Jun 5, 2024

In instance-manager pod instance-manager-406f2d88d3fb749fba8b723a5ea318ac, I found the error logs indicate that tgtd timeouts or fails receive responses from the engine while the requests are actually successfully handled:

2024-05-29T21:47:52.731004839Z timeout_handler: Timeout request 26029821 due to disconnection
2024-05-29T21:47:52.731115669Z timeout_handler: Timeout request 26029822 due to disconnection
......
2024-05-29T21:47:52.731193708Z tgtd: tgtd: bs_longhorn_request(97) fail to write at 2490052608 for 4096
2024-05-29T21:47:52.731197585Z tgtd: bs_longhorn_request(97) fail to write at 1619369984 for 4096
2024-05-29T21:47:52.731201432Z tgtd: tgtd: bs_longhorn_request(97) fail to write at 2099392512 for 4096
2024-05-29T21:47:52.731205180Z tgtd: tgtd: tgtd: tgtd: bs_longhorn_request(97) fail to write at 4096 for 8192
2024-05-29T21:47:52.731209388Z bs_longhorn_request(210) io error 0x232d380 8a -14 4096 1619369984, Success
2024-05-29T21:47:52.731212864Z tgtd: bs_longhorn_request(210) io error 0x223b100 8a -14 4096 2099392512, Success
......
2024-05-29T21:49:06.766894083Z response_process: Unknown response sequence 26029821
2024-05-29T21:49:06.766913409Z response_process: Unknown response sequence 26029833

After this short period of timeout, everything seems to be normal again. And the corresponding volume finally looks good in the longhorn-csi-plugin pod...

2024-06-02T18:00:00.327353476Z time="2024-06-02T18:00:00Z" level=info msg="NodePublishVolume: req: {\"staging_target_path\":\"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf/globalmount\",\"target_path\":\"/var/lib/kubelet/pods/05032b26-9933-42fa-83e5-0cd6ce6f9f8f/volumes/kubernetes.io~csi/pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf/mount\",\"volume_capability\":{\"AccessType\":{\"Mount\":{\"fs_type\":\"xfs\"}},\"access_mode\":{\"mode\":1}},\"volume_context\":{\"csi.storage.k8s.io/ephemeral\":\"false\",\"csi.storage.k8s.io/pod.name\":\"backup-minio-28622520-wzdrm\",\"csi.storage.k8s.io/pod.namespace\":\"avs\",\"csi.storage.k8s.io/pod.uid\":\"05032b26-9933-42fa-83e5-0cd6ce6f9f8f\",\"csi.storage.k8s.io/serviceAccount.name\":\"default\",\"dataLocality\":\"disabled\",\"fromBackup\":\"\",\"fsType\":\"xfs\",\"numberOfReplicas\":\"1\",\"staleReplicaTimeout\":\"30\",\"storage.kubernetes.io/csiProvisionerIdentity\":\"1687657401492-8081-driver.longhorn.io\"},\"volume_id\":\"pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf\"}" func=csi.logGRPC file="server.go:132"
2024-06-02T18:00:00.328223473Z time="2024-06-02T18:00:00Z" level=info msg="NodePublishVolume is called with req volume_id:\"pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf\" staging_target_path:\"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf/globalmount\" target_path:\"/var/lib/kubelet/pods/05032b26-9933-42fa-83e5-0cd6ce6f9f8f/volumes/kubernetes.io~csi/pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf/mount\" volume_capability:<mount:<fs_type:\"xfs\" > access_mode:<mode:SINGLE_NODE_WRITER > > volume_context:<key:\"csi.storage.k8s.io/ephemeral\" value:\"false\" > volume_context:<key:\"csi.storage.k8s.io/pod.name\" value:\"backup-minio-28622520-wzdrm\" > volume_context:<key:\"csi.storage.k8s.io/pod.namespace\" value:\"avs\" > volume_context:<key:\"csi.storage.k8s.io/pod.uid\" value:\"05032b26-9933-42fa-83e5-0cd6ce6f9f8f\" > volume_context:<key:\"csi.storage.k8s.io/serviceAccount.name\" value:\"default\" > volume_context:<key:\"dataLocality\" value:\"disabled\" > volume_context:<key:\"fromBackup\" value:\"\" > volume_context:<key:\"fsType\" value:\"xfs\" > volume_context:<key:\"numberOfReplicas\" value:\"1\" > volume_context:<key:\"staleReplicaTimeout\" value:\"30\" > volume_context:<key:\"storage.kubernetes.io/csiProvisionerIdentity\" value:\"1687657401492-8081-driver.longhorn.io\" > " func="csi.(*NodeServer).NodePublishVolume" file="node_server.go:82" component=csi-node-server function=NodePublishVolume
2024-06-02T18:00:00.339179122Z time="2024-06-02T18:00:00Z" level=info msg="Volume pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf using user and longhorn provided xfs fs creation params: -ssize=4096 -bsize=4096" func="csi.(*NodeServer).getMounter" file="node_server.go:829"
2024-06-02T18:00:00.346910169Z time="2024-06-02T18:00:00Z" level=info msg="Trying to ensure mount point /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf/globalmount" func=csi.ensureMountPoint file="util.go:295"
2024-06-02T18:00:00.346981355Z time="2024-06-02T18:00:00Z" level=info msg="Mount point /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf/globalmount try opening and syncing dir to make sure it's healthy" func=csi.ensureMountPoint file="util.go:303"
2024-06-02T18:00:01.178594601Z time="2024-06-02T18:00:01Z" level=info msg="Trying to ensure mount point /var/lib/kubelet/pods/05032b26-9933-42fa-83e5-0cd6ce6f9f8f/volumes/kubernetes.io~csi/pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf/mount" func=csi.ensureMountPoint file="util.go:295"
2024-06-02T18:00:01.217698509Z time="2024-06-02T18:00:01Z" level=info msg="NodePublishVolume: rsp: {}" func=csi.logGRPC file="server.go:141"

Are the corresponding workload pods working fine now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage
Projects
Development

No branches or pull requests

2 participants