Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iscsi failure on "some" cluster nodes. #395

Open
MadOtis opened this issue May 6, 2024 · 5 comments
Open

iscsi failure on "some" cluster nodes. #395

MadOtis opened this issue May 6, 2024 · 5 comments

Comments

@MadOtis
Copy link

MadOtis commented May 6, 2024

So, I've had democratic-csi deployed and running well for over a year with no issues. After recently updating expired Kube certs and restarting kubelet on all nodes in my homelab, I'm suddenly facing an issue on 2 of my 5 Kube nodes where either previously created iscsi PVCs can't log back into the NAS or newly deployed pods fail to deploy with an iscsi error (see below). The following error gets tossed out by pods (either from new deployments on those nodes -or- forcing existing pods from working nodes onto either of those 2 problematic nodes):

MountVolume.MountDevice failed for volume "pvc-29a229db-2143-45f7-811a-299585e200dd" : rpc error: code = Internal desc = {"code":19,"stdout":"Logging in to [iface: default, target: iqn.2005-10.org.freenas.ctl:csi-pvc-29a229db-2143-45f7-811a-299585e200dd, portal: 10.0.1.249,3260]\n","stderr":"iscsiadm: Could not login to [iface: default, target: iqn.2005-10.org.freenas.ctl:csi-pvc-29a229db-2143-45f7-811a-299585e200dd, portal: 10.0.1.249,3260].\niscsiadm: initiator reported error (19 - encountered non-retryable iSCSI login failure)\niscsiadm: Could not log into all portals\n","timeout":false}

I DO see the associated PVC- prefixed dataset created in Truenas with an associated BOUND state on the PVC within the cluster. But, I never see anything via lsblk on the node host itself. Conversely, when the pods run on any of the 3 WORKING nodes, I DO see corresponding name/mountpoints created for those PVCs.

All nodes in my cluster are managed and updated through Ansible, and I've had no errors updating, and when spot-checking kernel versions, lib versions, etc. they are all the same.

While I know this is most likely an open-iscsi problem, I'm just not sure how/where to look to resolve it. So, ANY pointers on what/where to look even though it might be slightly out-of-scope, would be IMMENSELY appreciated, because I'm kind-of at a loss as to what to try next.

Details on OS, etc, if needed:
Node hosts are all Debian 11 with daily apt-updates, running CRI-O (1.28.0), Open-iSCSI (2.1.3-5), K8S/Kubelet (1.28.9)
TrueNas Scale (Cobia) - all latest patches applied
democratic-csi version - Helm Chart 0.14.6

@travisghansen
Copy link
Member

You need to check the server-side (iscsi target) logs to see why it's rejecting the connections.

Also be careful that your initiators each have unique names. Sometimes automation will end up with all nodes using the same initiatorname which can/will cause issues.

@MadOtis
Copy link
Author

MadOtis commented May 6, 2024

Yeah, I've confirmed initiator names in /etc/iscsi/initiatorname.iscsi are unique across all nodes.

And thanks; I'll take a look at each node and see if I can figure out what is borked. I'm NOT an iscsi expert; I've only started using it in my cluster because NFS seems to have issues with some pods, so it's been largely a "learn iscsi at gunpoint" approach, thus far.

@MadOtis
Copy link
Author

MadOtis commented May 8, 2024

I think this can be closed. I couldn't figure out why the two nodes could not log in to the portal, and there was nothing at all usable in the iscsiadm logs (at least that I could decipher). So, I took the hard route and just stood up 2 new nodes to replace the two that weren't working. After the controller spun up new democratic-csi pods on those new nodes, it started working. So, I suspect there was some borkage on those nodes.

On the plus side, I upgraded Debian to 12 while I was at it, so now I have playbooks to replace the other older Deb 11 nodes, now.

@travisghansen
Copy link
Member

I meant to observe the logs on the storage server (not sure if you're on TrueNAS or something else). On SCALE for example that would be the scst service logs.

@MadOtis
Copy link
Author

MadOtis commented May 8, 2024

Yes, I run Truenas Scale (Cobia) on both my NAS hosts. I do have plans to update to Dragonfish in a few weeks, but wanted to get this sorted out first.

Regarding logging: I log all my nodes/servers/docker containers to a local Graylog server and I included both initiator and target outputs in a stream I could query so I can see both the initiator and target logs in concert with each other. I saw the creation of the PVC by the CSI, but it fell apart when the node itself was trying to mount that newly created target. The server showed the connection request as successful but the initiator just failed with an "error 19" with no real log output as to why; and in all transparency, I am not versed enough on how to increase the logging level of either end so I can get more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants