Heads up: Kubernetes script adapter! #434

vsoch · 2024-01-29T03:20:46Z

hey maestro maintainers! 👋

Heads up that we wrote a script adapter (according to maestro's design) for Kubernetes - it will create a batch job and associated config map with an entrypoint to run a job as a Job. Instead of using the factory (that would require the code to live here) I simply just instantiate it while we are developing. It's currently working pretty nicely for me locally (but only has existed for a few hours, so likely more work TBA), but I wanted to give you a heads up in case there is any development detail I should know about (e.g., next release date, change in design of something, etc). I also added an OCI registry handle to our mini mummi component that knows how to handle pushing/pulling artifacts shared between steps, and likely we want to imagine something like that for here too (I haven't thought about it yet). I'm opening this issue for any discussion we might have, and for when the PR is opened, to track potential features or needs.

We are prototyping it for a mini-mummi (and eventually more scheduling experiments I suspect) and I'd like to submit a PR when we are further along with that.

Ping @tpatki and @milroy to stay in the loop. 🌀

jwhite242 · 2024-01-29T23:47:59Z

Well, this sounds like an interesting development! So, I do have a few comments/questions on future tasks that may impact this, and more general questions on what you might need for this script adapter. And apologies in advance for any dumb questions about containers in general as I've not had much experience with them yet.

For the resource specification, is there anything special you need for this, i.e. things like procs/nodes/cores per task/etc that the scheduler adapters all support? We're looking at refactoring this to be more unified/comprehensive across the adapters, so I'd be very interested in knowing what/if anything else might be needed here?
What, if any, are the implications for the shell used in kube launched steps? Are they still nominally written in bash, or something else?
Do you think there's a need to move to more of a per step script adapter, enabling mixing of the kube adapter alongside local and the other hpc adapters in a single study?
What about nested adapters? So one case i can maybe see applying here that we're already looking to refactor is decoupling the parallelize command from the step script adapters -> i.e. enable slurm batch script with flux $(LAUNCHER) in a single step, or other such mixing and matching. And related, what about the $(LAUNCHER) token for this new adapter?

vsoch · 2024-01-30T00:10:41Z

I'm running out the door for a quick run, but some quick answers (and can follow up with more detail where needed)

For the resource specification, is there anything special you need for this, i.e. things like procs/nodes/cores per task/etc that the scheduler adapters all support? We're looking at refactoring this to be more unified/comprehensive across the adapters, so I'd be very interested in knowing what/if anything else might be needed here?

For scheduling to kubernetes what we are generally taking are cores, tasks, gpu, the basic high level stuff (for now). We can also support memory (which flux cannot). This goes into a resource request for the kubelet. So for now, just scope it to the same stuff that flux might use.

What, if any, are the implications for the shell used in kube launched steps? Are they still nominally written in bash, or something else?

It comes down to the shell provided for the container - e.g., if there is no bash in the container, the entrypoint would need to use /bin/sh. As for the script contents, we write them on the fly (just a bash script) so it can be whatever. Right now I'm writing a config map entrypoint that gets mounted read only and executed with /bin/bash and that could be a variable for the user to tweak (e.g., sh or python).

Do you think there's a need to move to more of a per step script adapter, enabling mixing of the kube adapter alongside local and the other hpc adapters in a single study?

I'm assuming one script == one kubernetes abstraction. That could be a single node / pod, or it could be an entire flux operator cluster. I might ask for more detail on this because I'm not sure about what a "step" means here.

What about nested adapters? So one case i can maybe see applying here that we're already looking to refactor is decoupling the parallelize command from the step script adapters -> i.e. enable slurm batch script with flux $(LAUNCHER) in a single step, or other such mixing and matching. And related, what about the $(LAUNCHER) token for this new adapter?

If you mean to have the inner logic of the script further run maestro, I'd stay away from that design for now. Snakemake does something similar and it works pretty well for controlled environments, but for kubernetes I've found it adds a ton of complexity because (for the most part) I don't want to build a container with my application and snakemake (or maestro).

OK - I'm off! 🏃

jwhite242 · 2024-02-07T18:53:53Z

I'm running out the door for a quick run, but some quick answers (and can follow up with more detail where needed)

For the resource specification, is there anything special you need for this, i.e. things like procs/nodes/cores per task/etc that the scheduler adapters all support? We're looking at refactoring this to be more unified/comprehensive across the adapters, so I'd be very interested in knowing what/if anything else might be needed here?

For scheduling to kubernetes what we are generally taking are cores, tasks, gpu, the basic high level stuff (for now). We can also support memory (which flux cannot). This goes into a resource request for the kubelet. So for now, just scope it to the same stuff that flux might use.

Sounds nice and easy there!

What, if any, are the implications for the shell used in kube launched steps? Are they still nominally written in bash, or something else?

It comes down to the shell provided for the container - e.g., if there is no bash in the container, the entrypoint would need to use /bin/sh. As for the script contents, we write them on the fly (just a bash script) so it can be whatever. Right now I'm writing a config map entrypoint that gets mounted read only and executed with /bin/bash and that could be a variable for the user to tweak (e.g., sh or python).

Do you think there's a need to move to more of a per step script adapter, enabling mixing of the kube adapter alongside local and the other hpc adapters in a single study?

I'm assuming one script == one kubernetes abstraction. That could be a single node / pod, or it could be an entire flux operator cluster. I might ask for more detail on this because I'm not sure about what a "step" means here.

So, bundling these two as they're closely related. First, just to clarify and make sure we're talking about the same thing, I was really asking about what the 'cmd' block in the maestro spec looks like in this mode. Currently they're all bash, so question was aimed really at does the step definition in the maestro spec change appreciably for this mode, or is the only real difference being that the step's cmd/script is just executing in a container vs by an hpc scheduler? And then the question becomes where's that container defined, and how, as defining containers on a per step basis seems like a requirement for this mode vs the one scheduler per study mode it has now.

What about nested adapters? So one case i can maybe see applying here that we're already looking to refactor is decoupling the parallelize command from the step script adapters -> i.e. enable slurm batch script with flux (LAUNCHER)inasinglestep,orothersuchmixingandmatching.Andrelated,whataboutthe(LAUNCHER) token for this new adapter?

If you mean to have the inner logic of the script further run maestro, I'd stay away from that design for now. Snakemake does something similar and it works pretty well for controlled environments, but for kubernetes I've found it adds a ton of complexity because (for the most part) I don't want to build a container with my application and snakemake (or maestro).

OK - I'm off! 🏃

Not quite; this wouldn't be maestro executing itself inside a step (though i'm sure there'll be requests for something like that at some point). The $(LAUNCHER) token is very much like a jinja template -> maestro just drops the flux run ... or srun ... string here at step construction time, and then it's up to whatever's running the steps' cmd script to do the work. So by nested schedulers, I really mean there's a difference between allocating resources and actually executing/doing work. For example, consider a container that uses openmpi with mpirun/mpiexec needed for parallel tasks -> this would be the nested idea, where the step launch mechanism would be requesting resources from kube, but inside that container the flux scheduler is executing the mpiexec strings dumped in there from $(LAUNCHER). Currently those two things are tightly coupled. We are planning on decoupling that, so I'd be interested in additional cases to handle when kube's in the loop on the resource acquisition phase (and a potential second layer of bootstrapping allocations or containers and then scheduling the study itself to said allocation or container).

vsoch · 2024-02-07T18:57:37Z

So, bundling these two as they're closely related. First, just to clarify and make sure we're talking about the same thing, I was really asking about what the 'cmd' block in the maestro spec looks like in this mode. Currently they're all bash, so question was aimed really at does the step definition in the maestro spec change appreciably for this mode, or is the only real difference being that the step's cmd/script is just executing in a container vs by an hpc scheduler?

It doesn't really change, it's still a lump of bash script stuff and just executed when the container starts. From what I've seen, there would usually be a flux submit or similar in that block, and you wouldn't have that here. Once that script is running, you are already in an isolated job (the batch/v1 Job that has one or more pods with container(s).

And then the question becomes where's that container defined, and how, as defining containers on a per step basis seems like a requirement for this mode vs the one scheduler per study mode it has now.

I'm defining the container on the level of the job step. You are correct that for different abstractions (e.g., JobSet) where there could be two containers deployed for one job, you'd have variation there.

The $(LAUNCHER) token is very much like a jinja template -> maestro just drops the flux run ... or srun ...

We would not want that in a container. mpirun, yes, but not a flux submit, unless it's the flux operator, in which case that is handled for you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Heads up: Kubernetes script adapter! #434

Heads up: Kubernetes script adapter! #434

vsoch commented Jan 29, 2024

jwhite242 commented Jan 29, 2024 •

edited

vsoch commented Jan 30, 2024

jwhite242 commented Feb 7, 2024

vsoch commented Feb 7, 2024

Heads up: Kubernetes script adapter! #434

Heads up: Kubernetes script adapter! #434

Comments

vsoch commented Jan 29, 2024

jwhite242 commented Jan 29, 2024 • edited

vsoch commented Jan 30, 2024

jwhite242 commented Feb 7, 2024

vsoch commented Feb 7, 2024

jwhite242 commented Jan 29, 2024 •

edited