Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

totalResource calculation error in proportion plugin #3444

Open
bysph opened this issue Apr 26, 2024 · 4 comments
Open

totalResource calculation error in proportion plugin #3444

bysph opened this issue Apr 26, 2024 · 4 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@bysph
Copy link

bysph commented Apr 26, 2024

Issue Description:

I found proportion plugin has some issues:

  1. It does not consider node label isolation scenarios: such as impala nodes being unavailable for use by the spark queue, but are still factored into the calculation of available resources for the flink queue. (I think the node-group plugin may also fails to consider this)
  2. it does not filter out cordoned nodes
  3. it does not consider about nodeSelectors: some nodes does not managed by volcano will also be included in the calculation.
  4. It does not consider multi-scheduler scenarios: for example, in the case of the "deserved" parameter calculated through the cluster's overall allocatable resources, in a multi-scheduler setting, Volcano's available resource amount is actually less, resulting in each queue's "deserved" being larger than actual, rendering the constraints ineffective. (This scenario might be fine, as handling multiple schedulers can indeed be challenging)

These issues may result in an inflated deserved value for all queues, potentially rendering the guarantee ineffective.

Description of case 1 (node label isolation scenarios):
image

Expected Behavior:

The total resource of the queue needs to be recalculated, rather than being calculated uniformly based on the total resources of the entire cluster.

Steps to Reproduce:

for example of case 2:

  1. Set a nodeSelector for nodes in the cluster.
  2. Use the proportion plugin to manage resources.
  3. Enter vscode debug mode to see the value of ssn.TotalResource or the list of ssn.Nodes

Additional Information:

Volcano Version: v1.8.2
Kubernetes Version: v1.19.3

Proposed Solution:

The total resources of the queue need to be recalculated as follows:

  1. Sum up the allocated resources of all available nodes (nodeLabels and nodegroup).
  2. Deduct resources from cordoned nodes.

And the "deserved" value also needs to be recalculated:

  1. Accumulate weights for the same nodegroup, then calculate proportionally.
    .....
@bysph bysph added the kind/bug Categorizes issue or PR as related to a bug. label Apr 26, 2024
@bysph
Copy link
Author

bysph commented Apr 28, 2024

I found proportion plugin has some issues:

  1. It does not consider multi-scheduler scenarios: for example, in the case of the "deserved" parameter calculated through the cluster's overall allocatable resources, in a multi-scheduler setting, Volcano's available resource amount is actually less, resulting in each queue's "deserved" being larger than actual, rendering the constraints ineffective.
  2. It also fails to consider node label isolation scenarios: such as impala nodes being unavailable for use by the flink queue, but are still factored into the calculation of available resources for the flink queue.
  3. Additionally:it does not filter out cordoned nodes, which should be excluded from the calculation of queue available resources even when set to disallow scheduling.

@Monokaix @lowang-bh Hi, Could you help confirm this issue? thx~

@bysph bysph changed the title totalResource calculation error when setting nodeSelector totalResource calculation error in proportion plugin Apr 28, 2024
@Monokaix
Copy link
Member

  1. it does not consider about nodeSelectors: some nodes does not managed by volcano will also be included in the calculation.

You mean what node-selector specified?

@bysph
Copy link
Author

bysph commented Apr 29, 2024

  1. it does not consider about nodeSelectors: some nodes does not managed by volcano will also be included in the calculation.

You mean what node-selector specified?

Yes,I found that AddPod will add all nodes to cache, althrough AddNode filter them out.
image

@bysph
Copy link
Author

bysph commented Apr 29, 2024

  1. it does not consider about nodeSelectors: some nodes does not managed by volcano will also be included in the calculation.

You mean what node-selector specified?

Yes,I found that AddPod will add all nodes to cache, althrough AddNode filter them out. image

Sorry, the response here is incorrect; the nodeselector issue may be caused by the csinode informer.
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

2 participants