Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion for Adding a Batch Scheduling Sub-Landscape #3761

Open
stackedsax opened this issue Feb 23, 2024 · 6 comments
Open

Discussion for Adding a Batch Scheduling Sub-Landscape #3761

stackedsax opened this issue Feb 23, 2024 · 6 comments

Comments

@stackedsax
Copy link

As part of the CNCF Batch Working Group (part of the TAG Runtime), we'd like to discuss adding a sub landscape focused on Batch Scheduling similar to the wasm sub landscape.

Example Draft

To illustrate what we were hoping to do, we worked up an example Batch Scheduling landscape here:

Please note that this is merely a rough draft of what a Batch Scheduling landscape could look like. We anticipate more projects will be added as we socialize this landscape throughout the community.

If this discussion would be better in a PR, we'd be happy to submit the changes that would be necessary and we can have the discussion there.

Rationale

The conversation around Batch Schedulers in the context of cloud and Kubernetes has been a complicated one over the last couple of years. As AI/ML continues to dominate discussions, the desire for solutions in this space has amplified. However, we find that people who want to solve this particular challenge often don't know where to start and don't know that there are existing options available.

As a result, companies often create their own bespoke solutions. Just about every KubeCon, another company announces that they are planning to open-source their new Batch Scheduler, often with extremely similar properties to the existing solutions. We'd much prefer to guide people to join forces on the existing solutions, ideally contributing to the conversations ongoing in the Kubernetes Batch Working Group (a sister working group the CNCF group working on k8s-specific issues) around Kueue and improving the core of Kubernetes to be more Batch Scheduling-friendly.

We think adding a landscape for Batch Scheduling could help bring awareness to the community that potential solutions already exist and that they have a place to start from.

We don't intend for the landscape to answer every question people have about Batch Scheduling on Kubernetes. Much like the vast CNCF landscape itself, it will be a starting point for people to work from and do their own diligence on what will work for them.

We don't relish bringing more complexity to an already overwhelming array of options on the existing landscape (and we really appreciate the recent improvements and simplifications in the recent update). However, there did not seem to be any meaningful way of describing the current landscape of Batch Schedulers within the context of the larger landscape. We are open to ideas, of course, which is why we're reaching out for discussion.

@caniszczyk
Copy link
Contributor

caniszczyk commented Feb 23, 2024 via email

@stackedsax
Copy link
Author

We tried to make that approach work at first and it really didn't fit. The problem is that there are a bunch of batch schedulers that need to be mentioned (Slurm/SUNK, LFS, PBS, etc.) that don't really belong in the larger cloud landscape. Yet, in terms of batch scheduling we'd like to acknowledge that there are ways of using these more traditional batch schedulers in the context of k8s.

@caniszczyk
Copy link
Contributor

caniszczyk commented Feb 23, 2024 via email

@k82cn
Copy link

k82cn commented Feb 24, 2024

honestly I don't mind listing SLURM and some of that in the larger scape
but that's just me.

IMO, it's better to identify the project about batch system for HPC & AI; as some projects has dependencies, e.g. volcano/kueue vs. k8s.

We're trying to propose different solutions in ecosystem with those projects.

I need to figure out how to come up with some rules about sub landscapes
and how to ensure we don't have TOO MANY of them.

Do we have something like label/tag to filter projects? That may makes it easier.

@ArangoGutierrez
Copy link
Contributor

We tried to make that approach work at first and it really didn't fit.

I am curious about the where and why didn't fit?

I need to figure out how to come up with some rules about sub landscapes
and how to ensure we don't have TOO MANY of them.

I agree, once you open the gates, there's no going back.

Projects like Slurm do belong to orchestration + management/Scheduling & Orchestration, but Volcano/kueue are plugins to K8S and they don't modify the Scheduler functionality. so that do opens the question for a subcategory in orchestration + management for these type of applications. There is an entire ecosystem being built ON TOP OF kubernetes to enable AI/ML workloads (let's leave the HPC word out of this). So it makes me think in favor of an orchestration + management/batch + workload engines

my 2 cents...

@resker
Copy link

resker commented Apr 10, 2024

+1 to melding this (and most other notions of "sub-landscape") into the larger one albeit w/ appropriate tag/label and ability to depict it through a lens (e.g. batch) based on query filter. Even within the context of "batch" there a a number of things that IMO ought to show up that are substantial but don't wholly live within the category (e.g. multi-cluster scheduling, gang / co-scheduling, feature discovery, DRA, et cetera are all relevant but certainly not confined to batch).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants