Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'sudo: simple-file-writer: command not found' when resizing #390

Closed
Routhinator opened this issue Apr 26, 2024 · 18 comments
Closed

'sudo: simple-file-writer: command not found' when resizing #390

Routhinator opened this issue Apr 26, 2024 · 18 comments
Assignees

Comments

@Routhinator
Copy link

This error seems new. It's now coming up when resizing a volume on the new TrueNas Scale Dragonfish release.. is this executable supposed to be on the TrueNAS side?

Driver freenas-iscsi-csi - Democratic CSI Chart 0.14.6
TrueNAS Scale Dragonfly Train - Version TrueNAS-SCALE-24.04.0

0s          Warning   VolumeResizeFailed         persistentvolumeclaim/nextcloud-db-3          resize volume "pvc-52d505c0-91b1-451f-a168-b66506e14846" by resizer "org.democratic-csi.iscsi" failed: rpc error: code = Unknown desc = error reloading iscsi daemon: {"stderr":"sudo: simple-file-writer: command not found\n","code":1}
@Routhinator
Copy link
Author

Tried rolling the image tag from v1.9.0 > v1.8.4, and that resulted in a different error..

0s          Warning   VolumeResizeFailed   persistentvolumeclaim/nextcloud-db-1        resize volume "pvc-bf920f1b-9270-437c-9193-8724cf1eee24" by resizer "org.democratic-csi.iscsi" failed: rpc error: code = Unknown desc = error reloading iscsi daemon: {"stderr":"bash: line 1: /sys/kernel/scst_tgt/devices/csi-pvc-bf920f1b-9270-437c-9193-8724cf1eee24-cluster/resync_size: Permission denied\n","code":1}

Not sure what is wrong here but any advice would be helpful.. I have workloads that cannot schedule until the PVC is done resizing and have had a postgres instance down for 24 hours now, cannot figure out how to get the resize to finish.

The volumes have already had quota changes applied on the Truenas side.

@Routhinator
Copy link
Author

Routhinator commented Apr 26, 2024

Actually, after chasing this down for a bit, i believe this is related to #295

Related, but not the same - as I do not use nameTemplate

Here is my full iSCSI config

controller:
  externalAttacher:
    resources:
      limits:
        cpu: 50m
        memory: 50Mi
      requests:
        cpu: 50m
        memory: 50Mi
  externalProvisioner:
    resources:
      limits:
        cpu: 50m
        memory: 50Mi
      requests:
        cpu: 50m
        memory: 50Mi
  externalSnapshotter:
    resources:
      limits:
        cpu: 50m
        memory: 30Mi
      requests:
        cpu: 50m
        memory: 30Mi
  externalResizer:
    resources:
      limits:
        cpu: 50m
        memory: 50Mi
      requests:
        cpu: 50m
        memory: 50Mi
  driver:
    image: ghcr.io/democratic-csi/democratic-csi:v1.9.0
    resources:
      limits:
        cpu: 200m
        memory: 200Mi
      requests:
        cpu: 200m
        memory: 200Mi
node:
  driver:
    image: ghcr.io/democratic-csi/democratic-csi:v1.9.0
    resources:
      limits:
        cpu: 200m
        memory: 128Mi
      requests:
        cpu: 200m
        memory: 128Mi
csiDriver:
  # should be globally unique for a given cluster
  name: "org.democratic-csi.iscsi"

# add note here about volume expansion requirements
storageClasses:
- name: freenas-iscsi-csi
  defaultClass: true
  reclaimPolicy: Delete
  volumeBindingMode: Immediate
  allowVolumeExpansion: true
  parameters:
    # for block-based storage can be ext3, ext4, xfs
    # for nfs should be nfs
    fsType: ext4

    # if true, volumes created from other snapshots will be
    # zfs send/received instead of zfs cloned
    # detachedVolumesFromSnapshots: "false"

    # if true, volumes created from other volumes will be
    # zfs send/received instead of zfs cloned
    # detachedVolumesFromVolumes: "false"

  mountOptions: []
  secrets:
    provisioner-secret:
    controller-publish-secret:
    node-stage-secret:
#      # any arbitrary iscsiadm entries can be add by creating keys starting with node-db.<entry.name>
#      # if doing CHAP
#      node-db.node.session.auth.authmethod: CHAP
#      node-db.node.session.auth.username: foo
#      node-db.node.session.auth.password: bar
#
#      # if doing mutual CHAP
#      node-db.node.session.auth.username_in: baz
#      node-db.node.session.auth.password_in: bar
    node-publish-secret:
    controller-expand-secret:

# if your cluster supports snapshots you may enable below
volumeSnapshotClasses: []
#- name: freenas-iscsi-csi
#  parameters:
#  # if true, snapshots will be created with zfs send/receive
#  # detachedSnapshots: "false"
#  secrets:
#    snapshotter-secret:

driver:
  config:
    # please see the most up-to-date example of the corresponding config here:
    # https://github.com/democratic-csi/democratic-csi/tree/master/examples
    # YOU MUST COPY THE DATA HERE INLINE!
    driver: ${driver_name}
    instance_id:
    httpConnection:
      protocol: https
      host: ${hostname}
      port: 8443
      apiKey: ${api_key}
      allowInsecure: false
      apiVersion: 2
    sshConnection:
      host: ${hostname}
      port: 22
      username: ${username}
      privateKey: |
        ${indent(8, ssh_priv_key)}
    zfs:
      # the example below is useful for TrueNAS 12
      cli:
        sudoEnabled: true
        paths:
          zfs: /sbin/zfs
          zpool: /sbin/zpool
          sudo: /bin/sudo
          chroot: /sbin/chroot
      # total volume name (zvol/<datasetParentName>/<pvc name>) length cannot exceed 63 chars
      # https://www.ixsystems.com/documentation/freenas/11.2-U5/storage.html#zfs-zvol-config-opts-tab
      # standard volume naming overhead is 46 chars
      # datasetParentName should therefore be 17 chars or less
      datasetParentName: ssd/block/v
      detachedSnapshotsDatasetParentName: ssd/block/s
      datasetEnableQuotas: true
      datasetEnableReservation: false
      # "" (inherit), lz4, gzip-9, etc
      zvolCompression:
      # "" (inherit), on, off, verify
      zvolDedup:
      zvolEnableReservation: false
      # 512, 1K, 2K, 4K, 8K, 16K, 64K, 128K default is 16K
      zvolBlocksize:
    iscsi:
      targetPortal: "${hostname}:${portal_port}"
      targetPortals: []
      # leave empty to omit usage of -I with iscsiadm
      interface:
      namePrefix: csi-
      nameSuffix: "-cluster"
      # add as many as needed
      targetGroups:
        # get the correct ID from the "portal" section in the UI
        - targetGroupPortalGroup: 1
          # get the correct ID from the "initiators" section in the UI
          targetGroupInitiatorGroup: 1
          # None, CHAP, or CHAP Mutual
          targetGroupAuthType: None
          # get the correct ID from the "Authorized Access" section of the UI
          # only required if using Chap
          targetGroupAuthGroup:
      extentInsecureTpc: true
      extentXenCompat: false
      extentDisablePhysicalBlocksize: true
      # 512, 1024, 2048, or 4096,
      extentBlocksize: 4096
      # "" (let FreeNAS decide, currently defaults to SSD), Unknown, SSD, 5400, 7200, 10000, 15000
      extentRpm: "SSD"
      # 0-100 (0 == ignore)
      extentAvailThreshold: 0

@Routhinator
Copy link
Author

Routhinator commented Apr 26, 2024

I've rolled forward to v1.9.0 again for now, and back to sudo: simple-file-writer: command not found

@travisghansen if you have any ideas on how i can get the resize unstuck for now, that would be appreciated.

@Routhinator
Copy link
Author

Well, the command exists in the image and is in the path, I'm confused why the CSI cannot find it during execution...

-> % docker run -it --entrypoint bash ghcr.io/democratic-csi/democratic-csi:v1.9.0
root@9ceb6701b729:/home/csi/app# ls
bin  csi_proto  csi_proxy_proto  LICENSE  node_modules  package.json  package-lock.json  src
root@9ceb6701b729:/home/csi/app# find / -type f -name "simple-file-writer"
/usr/local/bin/simple-file-writer
root@9ceb6701b729:/home/csi/app# echo $PATH
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
root@9ceb6701b729:/home/csi/app# simple-file-writer
/usr/local/bin/simple-file-writer: line 3: ${2}: ambiguous redirect
root@9ceb6701b729:/home/csi/app# cat /usr/local/bin/simple-file-writer
#!/bin/bash

echo ${1} > ${2}
root@9ceb6701b729:/home/csi/app#

@Routhinator
Copy link
Author

I have a feeling this expandVolume function somehow doesn't have the $PATH the same as the container, and thus it may not be picking up /usr/local/bin - testing a longshot here, I've jump into the CSI and edit the ssh.js to use the full path instead of just the wrapper script name, to see if that's it.

command = execClient.buildCommand("simple-file-writer", [

@Routhinator
Copy link
Author

Ok, took a few iterations of building an image variant to confirm since I can't build the container due to not being able to download objectivefs but:

{"host":"truenas-iscsi-democratic-csi-controller-5f4b6cfd5d-vjsrk","level":"error","message":"handler error - driver: FreeNASSshDriver method: ControllerExpandVolume error: {\"name\":\"GrpcError\",\"code\":2,\"message\":\"error reloading iscsi daemon: {\\\"stderr\\\":\\\"sudo: /usr/local/bin/simple-file-writer: command not found\\\\n\\\",\\\"code\\\":1}\"}","service":"democratic-csi","timestamp":"2024-04-26T21:36:25.603Z"}

It's not a path problem - same error with the full path. I think this is being executed on the FreeNAS host?

Another issue I hit, while looking at this:

My cert expired on the TrueNAS and I started getting

{"host":"truenas-iscsi-democratic-csi-controller-64bcdd9859-jlh6w","level":"error","message":"handler error - driver: FreeNASSshDriver method: Probe error: TypeError: err.getMessage is not a function TypeError: err.getMessage is not a function\n    at FreeNASSshDriver.Probe (/home/csi/app/src/driver/freenas/ssh.js:49:46)\n    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at async requestHandlerProxy (/home/csi/app/bin/democratic-csi:222:18)","service":"democratic-csi","timestamp":"2024-04-26T21:22:46.912Z"}

So the error message I posted showing the full path looks weird because to find out the cert expired, I had to modify this:

        throw new GrpcError(
          grpc.status.FAILED_PRECONDITION,
          `TrueNAS api is unavailable: ${err.getMessage()}`
        );

to

        throw new GrpcError(
          grpc.status.FAILED_PRECONDITION,
          `TrueNAS api is unavailable: ${err}`
        );

That allowed me to see the cert error - seems like whatever throws the cert error isn't building the error object the same way.

@Routhinator
Copy link
Author

So I noticed the TODO and realized the if statement causing this and shell script aren't needed if you double quote the echo statement for the sudo call after playing around in the TrueNAS shell.

csi@truenas01:~$ sudo echo 1 > /sys/kernel/scst_tgt/devices/csi-pvc-bf920f1b-9270-437c-9193-8724cf1eee24-cluster/resync_size
-bash: /sys/kernel/scst_tgt/devices/csi-pvc-bf920f1b-9270-437c-9193-8724cf1eee24-cluster/resync_size: Permission denied
csi@truenas01:~$ sudo sh -c echo 1 > /sys/kernel/scst_tgt/devices/csi-pvc-bf920f1b-9270-437c-9193-8724cf1eee24-cluster/resync_size
-bash: /sys/kernel/scst_tgt/devices/csi-pvc-bf920f1b-9270-437c-9193-8724cf1eee24-cluster/resync_size: Permission denied
csi@truenas01:~$ sudo sh -c "echo 1 > /sys/kernel/scst_tgt/devices/csi-pvc-bf920f1b-9270-437c-9193-8724cf1eee24-cluster/resync_size"

So I modified your original code from

          if (process.env.DEMOCRATIC_CSI_IS_CONTAINER == "true") {
            // use the built-in wrapper script that works with sudo
            command = execClient.buildCommand("simple-file-writer", [
              "1",
              `/sys/kernel/scst_tgt/devices/${kName}/resync_size`,
            ]);
          } else {
            // TODO: syntax fails with sudo
            command = execClient.buildCommand("sh", [
              "-c",
              `echo 1 > /sys/kernel/scst_tgt/devices/${kName}/resync_size`,
            ]);
          }

To

          command = execClient.buildCommand("sh", [
            "-c",
            `"echo 1 > /sys/kernel/scst_tgt/devices/${kName}/resync_size"`,
          ]);

And that has resolved it for me.

@Routhinator
Copy link
Author

I'm currently running off my patched version registry.gitlab.com/routhio/docker/democratic-csi:v1.9.6 - this is public so if you want to see the changes you can pull that.

I can throw up a PR -however I'm uncertain how to properly test the build with the objectivefs part of the build blocking me from building.

@travisghansen
Copy link
Member

You are entirely correct about running on the TN machine. Not sure what I was thinking. Is the code you have working with sudo?

@Routhinator
Copy link
Author

Yes

@travisghansen
Copy link
Member

I’ll get this incorporated shortly. Thanks for taking the time to sort it out! Good point about objectivefs too, I think I’ll make that more friendly to these kinds of scenarios.

@Routhinator
Copy link
Author

Yeah if you can fix that mate then I can likely start throwing PRs your way when I hit stuff like this. I have enough JS and Node skills to be of use, and had to hydrate on the codebase a wee bit yesterday.

@rouke-broersma
Copy link

rouke-broersma commented Apr 30, 2024

I have similar issues but my errors are different, is this the same issue or should I create a new one?

Kubernetes event:

NodeExpandVolume.NodeExpandVolume failed for volume "pvc-3e0e9bd7-5c3a-418f-b6a4-6f008acccca7" : Expander.NodeExpand failed to expand the volume : rpc error: code = Internal desc = {"code":1,"stdout":"Filesystem at /dev/sdl is mounted on /var/lib/kubelet/plugins/kubernetes.io/csi/org.democratic-csi.iscsi.hdd/0ee0287604124898644c00f65443d390c8f67b62f865064d56ec33147efbf033/globalmount; on-line resizing required\nold_desc_blocks = 1, new_desc_blocks = 1\n","stderr":"resize2fs 1.47.0 (5-Feb-2023)\nresize2fs: Permission denied to resize filesystem\n","timeout":false}

Dmesg:

kern: warning: [2024-04-30T15:18:05.903900173Z]: EXT4-fs warning (device sdl): ext4_resize_begin:83: There are errors in the filesystem, so online resizing is not allowed

I am coincidentally also blocked on a postgres pod.

@Routhinator I have an error running your image: exec bin/democratic-csi: exec format error

@Routhinator
Copy link
Author

That is a different error and not related.

And the exec error you get with my image suggests you are not running it on x86 infrastructure. I did not build arm64 images

@rouke-broersma
Copy link

That is a different error and not related.

And the exec error you get with my image suggests you are not running it on x86 infrastructure. I did not build arm64 images

Ah I see, it is indeed only my arm node that has the issue.

@travisghansen
Copy link
Member

#295

That syntax was reported to not work previously. Need to figure what the deal is here..

@travisghansen
Copy link
Member

Oh, I see yours has quotes..nevermind :)

@travisghansen
Copy link
Member

Should be fixed here: 38bee21

Give v1.9.1 a try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants