Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP #77 Dynamically Sized Jobs #1851

Merged
merged 10 commits into from Apr 3, 2024

Conversation

vicentefb
Copy link
Contributor

@vicentefb vicentefb commented Mar 15, 2024

What type of PR is this?

/kind documentation
/kind feature

What this PR does / why we need it:

KEP for Dynamically Sized Jobs

Which issue(s) this PR fixes:

Fixes #77

Special notes for your reviewer:

A WIP for Phase 1 can be found here: #1852

Does this PR introduce a user-facing change?


@k8s-ci-robot k8s-ci-robot added kind/documentation Categorizes issue or PR as related to documentation. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 15, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @vicentefb. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 15, 2024
Copy link

netlify bot commented Mar 15, 2024

Deploy Preview for kubernetes-sigs-kueue ready!

Name Link
🔨 Latest commit d8f5ef0
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/660d99aae146b5000837ecce
😎 Deploy Preview https://deploy-preview-1851--kubernetes-sigs-kueue.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@tenzen-y
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 15, 2024
@vicentefb
Copy link
Contributor Author


## Phases for MVP (alpha)

### Phase 1 - Scale Down
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WIP implementation for Phase 1: #1852

@alculquicondor
Copy link
Contributor

/release-note-none

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Mar 15, 2024
keps/77-dynamically-sized-jobs/README.md Show resolved Hide resolved
keps/77-dynamically-sized-jobs/README.md Show resolved Hide resolved
keps/77-dynamically-sized-jobs/README.md Show resolved Hide resolved
keps/77-dynamically-sized-jobs/README.md Outdated Show resolved Hide resolved
keps/77-dynamically-sized-jobs/README.md Outdated Show resolved Hide resolved
keps/77-dynamically-sized-jobs/README.md Outdated Show resolved Hide resolved
keps/77-dynamically-sized-jobs/README.md Outdated Show resolved Hide resolved
keps/77-dynamically-sized-jobs/README.md Outdated Show resolved Hide resolved
keps/77-dynamically-sized-jobs/README.md Show resolved Hide resolved
keps/77-dynamically-sized-jobs/README.md Outdated Show resolved Hide resolved
keps/77-dynamically-sized-jobs/README.md Outdated Show resolved Hide resolved
keps/77-dynamically-sized-jobs/README.md Outdated Show resolved Hide resolved
keps/77-dynamically-sized-jobs/README.md Outdated Show resolved Hide resolved
keps/77-dynamically-sized-jobs/README.md Show resolved Hide resolved
keps/77-dynamically-sized-jobs/README.md Outdated Show resolved Hide resolved
@alculquicondor
Copy link
Contributor

@astefanutti, is this something your team is still interested in?

## Design Details

### Workload Slices
To support horizontal scaling of jobs, we will introduce the concept of a "Workload Slice”. A Workload Slice is a Workload object with an owner reference to the original Workload for a job. Workload Slices represent per-replica changes to a job that were not initially accounted for when the job was created.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Questions to confirm / update my understanding:

  1. Do the workload slices need to be submitted to the same LocalQueue, or just the same ClusterQueue is enough?
  2. Do the workload slices need to use identical PodTemplates?

If the two need to be true, what if there is a workload slice that violates the constraints?

Also,
3. Are the workload slices admitted, by Kueue scheduler. to the same resource flavors? (I suppose this is required so that we can aggregate them)

I think notes clarifying the above questions would be helpful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the workload slices need to be submitted to the same LocalQueue, or just the same ClusterQueue is enough?

Because LocalQueue is referenced by the top-level RayCluster, it will be submitted to the same LocalQueue

Do the workload slices need to use identical PodTemplates?

Yes, this is already enforced by KubeRay

If the two need to be true, what if there is a workload slice that violates the constraints?

From our initial prototyping so far, I think it will be hard to break these constraints. Both the LocalQueue and Pod template are referenced top-level in the RayCluster API.

  1. Are the workload slices admitted, by Kueue scheduler. to the same resource flavors? (I suppose this is required so that we can aggregate them)

Yes, I think from discussions with Aldo we agreed that all workload slices need to be belong to the same resource flavor

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is already enforced by KubeRay

Are the Workload slices going to be created by KubeRay?

From our initial prototyping so far, I think it will be hard to break these constraints. Both the LocalQueue and Pod template are referenced top-level in the RayCluster API.

The constraints probably can be violated if a user creates the Workload slices by hand. Might be worth a note how we handle a mismatch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the Workload slices going to be created by KubeRay?

No I meant KubeRay enforces identical podtemplates across replicas, which translates to identical podtemplates across WorkloadSlices

The constraints probably can be violated if a user creates the Workload slices by hand. Might be worth a note how we handle a mismatch.

Good point, we should document this in the KEP

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constraints probably can be violated if a user creates the Workload slices by hand. Might be worth a note how we handle a mismatch.

Good point, we should document this in the KEP

I'm especially interested in the behavior with the prebuild feature.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prebuilt workload feature for context #1575

keps/77-dynamically-sized-jobs/README.md Show resolved Hide resolved
keps/77-dynamically-sized-jobs/README.md Show resolved Hide resolved
keps/77-dynamically-sized-jobs/README.md Show resolved Hide resolved
## Design Details

### Workload Slices
To support horizontal scaling of jobs, we will introduce the concept of a "Workload Slice”. A Workload Slice is a Workload object with an owner reference to the original Workload for a job. Workload Slices represent per-replica changes to a job that were not initially accounted for when the job was created.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constraints probably can be violated if a user creates the Workload slices by hand. Might be worth a note how we handle a mismatch.

Good point, we should document this in the KEP

I'm especially interested in the behavior with the prebuild feature.

keps/77-dynamically-sized-jobs/README.md Outdated Show resolved Hide resolved
keps/77-dynamically-sized-jobs/README.md Show resolved Hide resolved
keps/77-dynamically-sized-jobs/README.md Show resolved Hide resolved
keps/77-dynamically-sized-jobs/README.md Outdated Show resolved Hide resolved
keps/77-dynamically-sized-jobs/README.md Outdated Show resolved Hide resolved
keps/77-dynamically-sized-jobs/README.md Show resolved Hide resolved
Copy link
Contributor

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any further questions.

I'll leave the lgtm to @tenzen-y

/approve

keps/77-dynamically-sized-jobs/kep.yaml Outdated Show resolved Hide resolved
keps/77-dynamically-sized-jobs/README.md Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 2, 2024
@vicentefb vicentefb requested a review from tenzen-y April 2, 2024 21:36
Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise lgtm.

I left a comment to clarify MultiKueue feature.

keps/77-dynamically-sized-jobs/README.md Outdated Show resolved Hide resolved
keps/77-dynamically-sized-jobs/README.md Outdated Show resolved Hide resolved
keps/77-dynamically-sized-jobs/kep.yaml Outdated Show resolved Hide resolved
keps/77-dynamically-sized-jobs/kep.yaml Show resolved Hide resolved
Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!
/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 3, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 0b6b913a553f2022fc45dc516e7044467714352c

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, tenzen-y, vicentefb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [alculquicondor,tenzen-y]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit e63709b into kubernetes-sigs:main Apr 3, 2024
14 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.7 milestone Apr 3, 2024
vsoch pushed a commit to researchapps/kueue that referenced this pull request Apr 18, 2024
* added kep

* kep updated

applied toc

* updated kep

* toc updated

* added info in unit tests and integration tests section

* added details about workload slices

* rephrase scale down section

* updated and added details on slices, generalized design details and typos

* update

* added details about mutikueue and removed users from approvers
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/documentation Categorizes issue or PR as related to documentation. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support dynamically sized (elastic) jobs
9 participants