Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[IMPROVEMENT] RWX volume scheduling #7872

Closed
mazocode opened this issue Feb 6, 2024 · 10 comments
Closed

[IMPROVEMENT] RWX volume scheduling #7872

mazocode opened this issue Feb 6, 2024 · 10 comments
Assignees
Labels
area/volume-rwx Volume RWX related kind/improvement Request for improvement of existing function priority/0 Must be fixed in this release (managed by PO) require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated require/backport Require backport. Only used when the specific versions to backport have not been definied. require/doc Require updating the longhorn.io documentation require/lep Require adding/updating enhancement proposal require/manual-test-plan Require adding/updating manual test cases if they can't be automated
Milestone

Comments

@mazocode
Copy link

mazocode commented Feb 6, 2024

Is your improvement request related to a feature? Please describe (馃憤 if you like this request)

The RWX functionality is hardly usable for us because the share-manager does not honor the storage classes nodeSelector, topology, nor any other nodeSelector configured. This means that it gets scheduled on nodes that are not suitable for this kind of workload or do not match the requirements. In addition, placing it randomly might not be beneficial to the traffic flow.

Describe the solution you'd like

Add a nodeSelector derived from storage class topolgy configuration or nodeSelector tag.

Describe alternatives you've considered

So far I could not find a workaround.

Additional context

RWX storage class example

apiVersion: storage.k8s.io/v1
kind: StorageClass
provisioner: driver.longhorn.io
allowVolumeExpansion: true
allowedTopologies:
- matchLabelExpressions:
  - key: storage-provisioning/longhorn
    values:
    - ""
metadata:
  name: rep2x-rwx-hdd-fra4-b1c
parameters:
  diskSelector: hdd
  nfsOptions: vers=4.1,noresvport,softerr,timeo=600,retrans=5
  nodeSelector: fra4-b1c
  numberOfReplicas: "2"
  staleReplicaTimeout: "2880"
volumeBindingMode: WaitForFirstConsumer
reclaimPolicy: Delete

Neither topology nor nodeSelector tag gets honored when scheduling the share-manager pod.

@mazocode mazocode added kind/improvement Request for improvement of existing function require/backport Require backport. Only used when the specific versions to backport have not been definied. require/doc Require updating the longhorn.io documentation require/manual-test-plan Require adding/updating manual test cases if they can't be automated labels Feb 6, 2024
Copy link

github-actions bot commented Mar 8, 2024

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the stale label Mar 8, 2024
Copy link

This issue was closed because it has been stalled for 5 days with no activity.

@mazocode
Copy link
Author

Can we please reopen this, so that I don't feel like I waste my time here?

@innobead innobead reopened this Mar 13, 2024
@innobead innobead removed the stale label Mar 13, 2024
@derekbit derekbit added this to the v1.7.0 milestone Apr 2, 2024
@derekbit derekbit added the area/volume-rwx Volume RWX related label Apr 2, 2024
@derekbit derekbit self-assigned this Apr 2, 2024
@derekbit derekbit added the require/lep Require adding/updating enhancement proposal label Apr 2, 2024
@derekbit
Copy link
Member

derekbit commented Apr 2, 2024

Just realized nodeSelector in StorageClass.Parameters is a bit confusing. The field is for selecting the candidates for replace placement rather than engine placement.

https://longhorn.io/docs/1.6.1/references/storage-class-parameters/

cc @shuo-wu @innobead

@derekbit
Copy link
Member

derekbit commented Apr 3, 2024

@mazocode Thanks for raising the issue. I've implemented the feature. Feel free to check the design #8294 and welcome any feedback.

@derekbit derekbit added require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated priority/0 Must be fixed in this release (managed by PO) labels Apr 3, 2024
@shuo-wu
Copy link
Contributor

shuo-wu commented Apr 3, 2024

Just realized nodeSelector in StorageClass.Parameters is a bit confusing. The field is for selecting the candidates for replace placement rather than engine placement.

For regular volumes, the engine scheduling is determined by the workload pods hence typically nodeSelector has no ambiguity. But for the RWX volume, YES it's confusing. I am fine with renaming it to replicaNodeSelector.

@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Apr 6, 2024

Pre Ready-For-Testing Checklist

  • Where is the reproduce steps/test steps documented?
    The reproduce steps/test steps are at:
  1. Configure the node selector, allowedTopologies or tolerations according to the doc https://github.com/longhorn/website/pull/891/files
  2. Check the attached node of the rwx volume is expected.
  • Does the PR include the explanation for the fix or the feature?

  • Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
    The PR is at rwx: add "Configuring Volume Locality for an RWX Volume"聽website#891

  • Which areas/issues this PR might have potential impacts on?
    Area: RWX volume, scheduling
    Issues

  • If labeled: require/LEP Has the Longhorn Enhancement Proposal PR submitted?
    The LEP PR is at

#8294

  • If labeled: area/ui Has the UI issue filed or ready to be merged (including backport-needed/*)?
    The UI issue/PR is at

  • If labeled: require/doc Has the necessary document PR submitted or merged (including backport-needed/*)?
    The documentation issue/PR is at

longhorn/website#891

  • If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue (including backport-needed/*)
    The automation skeleton PR is at
    The automation test case PR is at
    The issue of automation test case implementation is at (please create by the template)

  • If labeled: require/automation-engine Has the engine integration test been merged (including backport-needed/*)?
    The engine automation PR is at

  • If labeled: require/manual-test-plan Has the manual test plan been documented?
    The updated manual test plan is at

  • If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility?
    The compatibility issue is filed at

@davidfrickert
Copy link

davidfrickert commented Apr 15, 2024

Also interested in this feature. I've had a look at the design/implementation - it depends on storage class parameters.
Generally fine but as storage classes are not mutable this means anyone that wants to use this feature and is already using longhorn needs to create new storage classes and potentially migrate data, which is pretty annoying.

Would it be possible to implement a fallback by checking PV annotations if the new storageclass params are not present?

@derekbit
Copy link
Member

Also interested in this feature. I've had a look at the design/implementation - it depends on storage class parameters. Generally fine but as storage classes are not mutable this means anyone that wants to use this feature and is already using longhorn needs to create new storage classes and potentially migrate data, which is pretty annoying.

Would it be possible to implement a fallback by checking PV annotations if the new storageclass params are not present?

@davidfrickert
checking PV annotations => Sounds good to me. Can you open a ticket for the feature request? Thank you.

@innobead innobead changed the title [BUG/IMPROVEMENT] Share manager ignores node selector and/or storage class topology [IMPROVEMENT] Share manager ignores node selector and/or storage class topology Apr 19, 2024
@derekbit derekbit changed the title [IMPROVEMENT] Share manager ignores node selector and/or storage class topology [IMPROVEMENT] RWX volume scheduling Apr 20, 2024
@chriscchien chriscchien self-assigned this May 29, 2024
@chriscchien
Copy link
Contributor

Verified pass on longhorn master(longhorn-manager ad7420) with test steps

Share manager followed storageclass parameter shareManagerNodeSelector, allowedTopologies and shareManagerTolerations to attach node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/volume-rwx Volume RWX related kind/improvement Request for improvement of existing function priority/0 Must be fixed in this release (managed by PO) require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated require/backport Require backport. Only used when the specific versions to backport have not been definied. require/doc Require updating the longhorn.io documentation require/lep Require adding/updating enhancement proposal require/manual-test-plan Require adding/updating manual test cases if they can't be automated
Projects
Development

No branches or pull requests

7 participants