Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFE] Implement pod prioritization #977

Open
bdunne opened this issue Jun 26, 2023 · 4 comments · May be fixed by #978
Open

[RFE] Implement pod prioritization #977

bdunne opened this issue Jun 26, 2023 · 4 comments · May be fixed by #978
Assignees
Projects

Comments

@bdunne
Copy link
Member

bdunne commented Jun 26, 2023

We have issues with pods being killed and rescheduled in busier environments. Unfortunately postgres is just as likely to be killed as any other worker pods. After a discussion with @Fryguy and @jrafanie we think the design should be as follows:

  • Add RBAC permissions for the operator to read, list and write priorityClassNames
  • Add 3 items to the CRD for high, medium and low priorityClassName values
  • Assign class name values as follows:
    • If all values are specified in CR, use them
    • If no values are set, detect the cluster default. Set low to cluster default, medium = low + 100, high = medium + 100
  • Validate that values are reasonable:
    • High should not be more than 1,000,000,000 (use CRD JSON schema validation)
    • Error if high, medium & low are out of order (code validation)
    • Warn if low is less than cluster default? Warn if low is less than 0? (code validation)
  • Assign pod priorities:
    • High: postgres, memcached, kafka, httpd
    • Medium: UI & API, orchestrator, maybe operators if possible (may not work if the class names don't exist yet)
    • Low: all other workers
@bdunne bdunne self-assigned this Jun 26, 2023
@bdunne
Copy link
Member Author

bdunne commented Jun 26, 2023

@Fryguy @jrafanie throw 🍅 🍅

@Fryguy Fryguy added this to the Quinteros milestone Jun 26, 2023
@Fryguy Fryguy added this to In progress in Roadmap Jun 26, 2023
@Fryguy
Copy link
Member

Fryguy commented Jun 26, 2023

High should not be more than 1,000,000,000 (use CRD JSON schema validation)

Good call. This keeps us under openshift defaults for critical values

$ oc get priorityclasses
NAME                      VALUE        GLOBAL-DEFAULT   AGE
openshift-user-critical   1000000000   false            89d
system-cluster-critical   2000000000   false            89d
system-node-critical      2000001000   false            89d

@bdunne bdunne linked a pull request Jun 26, 2023 that will close this issue
@miq-bot miq-bot added the stale label Oct 2, 2023
@miq-bot
Copy link
Member

miq-bot commented Oct 2, 2023

This issue has been automatically marked as stale because it has not been updated for at least 3 months.

If you can still reproduce this issue on the current release or on master, please reply with all of the information you have about it in order to keep the issue open.

Thank you for all your contributions! More information about the ManageIQ triage process can be found in the triage process documentation.

@miq-bot
Copy link
Member

miq-bot commented Jan 8, 2024

This issue has been automatically marked as stale because it has not been updated for at least 3 months.

If you can still reproduce this issue on the current release or on master, please reply with all of the information you have about it in order to keep the issue open.

@Fryguy Fryguy removed this from the Quinteros milestone Mar 8, 2024
@Fryguy Fryguy moved this from In progress to To do in Roadmap Mar 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Roadmap
  
To do
Development

Successfully merging a pull request may close this issue.

3 participants