Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow --dependency on job name #5917

Open
vsoch opened this issue Apr 27, 2024 · 25 comments
Open

Allow --dependency on job name #5917

vsoch opened this issue Apr 27, 2024 · 25 comments

Comments

@vsoch
Copy link
Member

vsoch commented Apr 27, 2024

Right now I think I'm required to put a flux job id for a job dependency. I'd ideally like to be able to do this:

flux submit --job-name task-1 -N 1 bash -c echo Starting task 1; sleep 3; echo Finishing task 1
flux submit --job-name task-2 --dependency=task-1  -N 1 bash -c echo Starting task 2; sleep 3; echo Finishing task 2

It could be the case this is only allowed in environments where the user owns all their jobs (one level in an instance). Having to keep a lookup of ids for different tasks is possible, but hugely adds to the complexity. I'd like to be able to read in a jobspec (where a task to submit has a name) and not to depend on the jobid that is going to be output. Could something like this be possible?

@grondo
Copy link
Contributor

grondo commented Apr 28, 2024

Could something like this be possible?

Yes, dependencies now are mostly implemented as jobtap plugins that register a callback of the form job.dependency.<scheme>. So you coudl write a dependency-name plugin that registers for job.dependency.name and would handle adding job dependencies based on a job name.

I guess one thing to look out for is that job names are not guaranteed to be unique like jobids, so the implementation would have to decide how to handle that.

Edit: Oh, and the usage would be more like:

flux submit --job-name task-2 --dependency=name:task-1  -N 1 bash -c echo Starting task 2; sleep 3; echo Finishing task 2

to tell the dependency system you're using the name dependency scheme.

@vsoch
Copy link
Member Author

vsoch commented Apr 28, 2024

I can try writing one! And I understand this point:

I guess one thing to look out for is that job names are not guaranteed to be unique like jobids, so the implementation would have to decide how to handle that.

At least how I'm setting things up, I'm controlling all the --job-name so if there is redundancy it is explicitly done.

And is the idea of "jobtap" like you are tapping a job on the shoulder? tap tap, are you my dependency?

@grondo
Copy link
Contributor

grondo commented Apr 28, 2024

And is the idea of "jobtap" like you are tapping a job on the shoulder? tap tap, are you my dependency?

😆 Yeah, like you're "tapping" in to the job flow control in the job manager.

Also I should note that jobtap plugins are true plugins and can be developed without flux-core source, just add the <flux/jobtap.h> header. They can be dynamically loaded with flux jobtap load /path/to/plugin.so.
Read all about it here.

@vsoch
Copy link
Member Author

vsoch commented Apr 29, 2024

@grondo if we put together a PR to flux-core, could it be considered? I took a look at the design this weekend, and it could be that we add a job-manager/plugins/dependency-name.c but also could work to add an entry for dependency.name to dependency-after.c

The reason I'm hoping it might be considered for core is that it would be an essential component of our JobSpec next generation library, which tightly controls the job names and (I think) should be able to set dependency relationships without needing to get back jobids to hand around to different tasks. I don't like any approach that requires that (because what if you need to write the logic before you have an ID)? And I don't like the idea that the user will need to configure a custom plugin otherwise the library depends_on won't work at all. What do you think? Would you at least consider a small PR if we can put one together?

@garlick
Copy link
Member

garlick commented Apr 29, 2024

Just FYI, in case you were unaware, there is a design by Tom and Steven for "openmp style" dependencies described in RFC 26 that IIRC tackles the need to express a workflow DAG in advance. What we have now was intended just to reach parity with slurm.

Hackathon project here: https://github.com/flux-framework/flux-depend

I don't have a problem with dependencies based on job name being built in. I think @grondo just meant that you could probably get something working without waiting on us, since we are totally slammed with el cap problems right now. We can definitely consider a PR but we are experiencing a large call volume right now and you may need to listen to the slack hold #music channel for a few weeks.

Changing jobspec in flux-core is a whole other kettle of fish that AFAIK is not on the plan at the moment.

@vsoch
Copy link
Member Author

vsoch commented Apr 30, 2024

We can definitely consider a PR but we are experiencing a large call volume right now and you may need to listen to the slack hold #music channel for a few weeks.

Very happy to wait! Likely I'd do a hackathon with the rest of the thrust team to work on it because I'm terrible at C++.

Changing jobspec in flux-core is a whole other kettle of fish that AFAIK is not on the plan at the moment.

Oh gosh, absolutely not - the "JobSpec the next generation" is a different jobspec, a more abstract one that maps into the existing flux jobspec. From the FAQ on the first page:

image

@garlick
Copy link
Member

garlick commented Apr 30, 2024

Alright, but if you call that "jobspec", we're going to have an operator in flux core with a cute gopher mascot. It does the same thing as the flux operator except it's also helps you make collect phone calls and can read off the time in a robotic female voice.

@grondo
Copy link
Contributor

grondo commented Apr 30, 2024

flux operator - keeping your jobs on hold since 1967

@vsoch
Copy link
Member Author

vsoch commented Apr 30, 2024

image

That's how I roll 😎

@trws
Copy link
Member

trws commented May 7, 2024

Just to make sure I'm not giving @vsoch bad advice here, I think we can pull this together as a python-based jobtap plugin with cffi producing a regular plugin.so but internals in python. Would either of you have issues with that approach? No additional dependencies required since we already require python and cffi of a sufficient version for our command line interfaces. Then if it ever becomes a performance bottleneck or similar we can optimize, being the thought.

Here's the cffi feature I'm talking about for embedding python as a shared library: https://cffi.readthedocs.io/en/latest/embedding.html

@grondo
Copy link
Contributor

grondo commented May 7, 2024

That would be a neat little demo

@trws
Copy link
Member

trws commented May 7, 2024

Another thought here, I was reading through the dependency-after plugin and had a thought. That already implements all the functionality necessary for this, and most dependency resolution really, it just doesn't accept the new scheme. Is there a feasible way to add the scheme and translate it into an after: dependency? Maybe a frobnicator plugin or a jobtap plugin that just translates the name to a jobid then passes that over to after:?

@grondo
Copy link
Contributor

grondo commented May 7, 2024

Well, why not just do that on your submit line?

$ flux run --dependency=after:$(flux jobs -ac1 --name=job-a -no {id}) -vvv hostname
jobid: fD1XCrosV
0.000s: job.submit {"userid":6885,"urgency":16,"flags":0,"version":1}
0.013s: job.dependency-add {"description":"after-start=fCC1iVf35"}
0.013s: job.dependency-remove {"description":"after-start=fCC1iVf35"}

BTW, one gotcha here is that after: means after-start - if you want to remove the dependency after the job finishes use afterany: or afterok: (for success only)

@trws
Copy link
Member

trws commented May 7, 2024

Oof, after: meaning after-start is a bit of a trip-hazard... I saw that in the code but didn't internalize it. That one's honestly rough enough I'd kinda be in favor of deprecating after and maybe make after be equivalent to afterany, but that's tangential.

@v mentioned in slack that part of her reason is so that it can be expressed declaratively in a file, say as part of jobspec. That's my main use-case for the OpenMP-style dependencies that could probably be implemented in a similar way. In the end it's always possible to do that if you have a shell running the flux run command, but interactive submission is not the main use for either of these IMO (though some users will probably want it for that).

@grondo
Copy link
Contributor

grondo commented May 7, 2024

Oof, after: meaning after-start is a bit of a trip-hazard...

Yeah, I agree. This syntax was borrowed directly from Slurm for ease-of-transition, so it was darned if you, darned if you don't situation.

@grondo
Copy link
Contributor

grondo commented May 7, 2024

@vsoch mentioned in slack that part of her reason is so that it can be expressed declaratively in a file, say as part of jobspec.

Oh, that makes sense. Doing as a frobnicator plugin is compelling and an interesting little trick :-)

@trws
Copy link
Member

trws commented May 7, 2024

Yeah, I think it would work for the more complex one too, so we might get the dependencies Stephen and I were discussing all that time ago.

@garlick
Copy link
Member

garlick commented May 7, 2024

Might be good to review RFC 26 as part of designing this. That spec is a bit of a chimera - the openmp deps were defined earlier, but not implemented, and then the "simple dependencies" were added. Perhaps a proposal could start with a PR on that document, if we're pivoting on the declarative approach.

@trws
Copy link
Member

trws commented May 7, 2024

Yeah, I agree. This syntax was borrowed directly from Slurm for ease-of-transition, so it was darned if you, darned if you don't situation.

Oh man, yeah that stings. At least it's consistent with what people are used to, thanks for the explanation.

@trws
Copy link
Member

trws commented May 8, 2024

Might be good to review RFC 26 as part of designing this. That spec is a bit of a chimera - the openmp deps were defined earlier, but not implemented, and then the "simple dependencies" were added. Perhaps a proposal could start with a PR on that document, if we're pivoting on the declarative approach.

I was planning to go back and implement the OpenMP-style dependencies basically, this issue would be about being able to use the job name instead of a fluid. Some thought on how that part fits I think makes sense now that you say it, not immediately clear if it should actually be a different scheme (thus needing to replicate the ok/failed/... stuff) or something else to indicate that it isn't a fluid and needs to be translated, like afterok:name:<name>

@grondo
Copy link
Contributor

grondo commented May 8, 2024

Yeah, it would be nice to have a different scheme with simpler interface.

BTW, there's also #5872. I've been pondering in the back of my mind how to expand the dependencies to allow this kind of logic. Would the OpenMP-style work for that? If so, it sounds like that would be a big win.

@trws
Copy link
Member

trws commented May 8, 2024

Hmm, not directly. Did they have a use-case for that specific thing? There is something similar, that I thought was the same and actually had to rewrite this when I realized otherwise, called the inoutset and (the horrifically named) mutexinoutset dependence types. They allow jobs that are part of the set to execute in any order, concurrently without the mutex part, but the next out or inout waits for the whole set.

I don't think I've ever heard someone ask for a way to wait for at least two of the past three, or something like that, other than when what they really meant was "run no more than N at a time" as a kind of end-run to implement a semaphore. Is that what that request was or was there something else there?

Either way, I don't think we could express that in terms of what the existing after module provides. It probably wouldn't be too hard to do, but I'd want to know more about the use-case.

@grondo
Copy link
Contributor

grondo commented May 8, 2024

I think it was something like this: every Nth job they're ensemble is potentially converged so they want to release a dependent job after any N jobs to combine the results of the N jobs that have finished so far, while still allowing the compute jobs to continue. The runtime of each job can vary by a factor of 4 or 5 which is why they don't know which N will finish first, nor is it feasible to submit N and wait for them all to finish since resource would go idle.

@trws
Copy link
Member

trws commented May 8, 2024

Ok that's interesting. It seems like that would still be kinda hard to express with the "or" syntax, since you'd have to enumerate all the permutations. If we had the symbolic dependencies, or some other way to group them, maybe we could enable users to say something like --dependency=stringcountok:myensemble?count=4 to mean "after four of the jobs that are currently in the system with this label as an output have completed"? Have to chew on that some, it might make the implementation of the symbolic dependencies a bit more expensive because you would have to actually keep the information around while the jobs are running, where normally you only need to keep a hash-table of /->jobid where jobid is the most recent job with an output or set dependency.

Actually that reminds me, I've been looking over the code for some of these trying to think through how to do each of these. The name dependency translation should be pretty trivial because it doesn't need state. The symbolic dependencies need that hash table to exist, and probably to persist across restarts or be re-generated. Is there a good way to do that in the frobnicator? It could always use the KVS I suppose, but that seems kinda wasteful at first glance.

Also, to your note about flux-depend, it never got to the point of actually being a module. That could in principle be resurrected, though I would probably simplify some aspects of the scope and make it a jobtap plugin rather than a module. Also we'd have to work through incorporating rust into core, not that I would argue too hard about that. 😁

@vsoch
Copy link
Member Author

vsoch commented May 8, 2024

I really just want to have differently named jobs (that I'm controlling everything for) and say "Mr. Job B, you depend on A." So just

flux run --dependency.name="A" --name="B" hostname

That's all I need!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants