Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove/revise Slurm launch integration #12471

Open
rhc54 opened this issue Apr 17, 2024 · 17 comments
Open

Remove/revise Slurm launch integration #12471

rhc54 opened this issue Apr 17, 2024 · 17 comments

Comments

@rhc54
Copy link
Contributor

rhc54 commented Apr 17, 2024

I've been getting reports lately of problems launching OMPI and PRRTE jobs under Slurm. Apparently, Slurm is now attempting to modify the ORTE/PRRTE internal srun command line used to launch the mpirun daemons by injecting an MCA parameter into the environment when the allocation is created.

This isn't the first time we have encountered problems with either discovering the Slurm allocation (where we rely on specific envars) or launching the daemons (where we rely on a stable srun cmd line). It has been a recurring problem that rears its head every couple of years. The other two MPIs that support the Slurm environment (MPICH and IntelMPI) both follow our same integration approach, so this is something that has impacted us all for some time.

In talking with SchedMD, the feeling on their side is that the various MPIs are interfering with their ability to develop Slurm, particularly when trying to add features. The requirement that specific envars and the srun cmd line remain stable is simply no longer going to be acceptable - some of their new features introduced last year just cannot accommodate it.

What they were trying to do was use the MCA param system to add a flag to the internal srun command that instructed srun to fallback to a "legacy" mode. However, that really doesn't work in a reliable manner. They can only inject it at the envar level, but we read values from default param files (which then get silently overwritten by their value) and take the final value of the param from the cmd line - which then overwrites the Slurm setting. So either the user unwittingly removes the new flag (and subsequently gets bizarre behavior as the daemons don't properly launch), or sys admins and users wind up confused when their flags simply "disappear".

One could argue that the param in question (plm_slurm_args) isn't that heavily used. However, that may not be universally true - and is impossible to really validate. My initial report came from an AWS user who couldn't get PRRTE to launch. Wenduo and I tried to advise and help debug the situation, to no avail. It was only after the user gave up that I finally figured out the root cause of the problem - aided by getting other reports of similar issues.

I have tried to propose alternative approaches based on what other environments have done. For example, LSF and PBS provide us with a launch API that is stable, but leaves them free to modify the underlying implementation at will to conform to their own changes. For resource discovery, pretty much everyone else provides a hostfile that we can read.

Unfortunately, SchedMD has rejected those proposals. Their basic position at the moment is that (a) we shouldn't be constraining their development path, and (b) they shouldn't have to provide us with alternative integration methods. I'm still pressing them for a positive proposal, but not getting very far at the moment. My sense is that they are frustrated by the problems over the years and are basically just giving up.

Assuming their stance doesn't change, the only alternative I can see is to remove the RAS and PLM Slurm components, and drop back to using hostfiles and ssh for starting the job. Since SchedMD is refusing to provide a hostfile upon making the allocation, it would place the burden on the user to generate it (note that executing srun hostname may not result in the correct hostfile due to recent changes in the srun behavior).

Beyond the issue of generating the hostfile, a move to using ssh means the loss of fine-grained accounting since Slurm will not see the processes launched by mpirun. When we launch the daemons via srun, the accounting system "sees" the daemon and therefore tracks all resources used by the daemon and its children (i.e., the app procs). When launched by ssh, this doesn't happen. So the overall job still gets accounted as the system "sees" mpirun itself and the use of resources on the nodes - but it doesn't provide per-proc tracking.

So we should discuss how we want to resolve this. The release of Slurm 23.11 is where the breakage occurs, so it is still early in the adoption process - but it breaks all existing OMPI releases, which is a problem we cannot control.

@wenduwan
Copy link
Contributor

Related openpmix/prrte#1960

@bosilca
Copy link
Member

bosilca commented Apr 17, 2024

A fair question is what is SchedMD seeing as a potential development path in which none of the parties constraint the development of the others ?

@rhc54
Copy link
Contributor Author

rhc54 commented Apr 17, 2024

That is a fair question - unfortunately, they haven't provided one other than insisting we stop using their cmd line and envars as they will not assure stability.

@hppritcha
Copy link
Member

Problem for removing srun based launch of prrted's on systems of some interest like ORNL frontier and NERSC perlmutter is that we will lose support for using the OFI libfabric CXI provider, as it was via the slurmd that we acquire (or more precisely the CXI provider acquires) the VNI information needed for initialization.

@rhc54
Copy link
Contributor Author

rhc54 commented Apr 17, 2024

True, though that is readily remedied. Cray was contractually obligated to provide a method for doing that as part of Aurora, but refused to do so at the last minute. So the "hooks" are all present in PMIx - we are just missing the last step of inserting it into the CXI server on the compute nodes.

FWIW: many organizations that use Slingshot simply turn the VNI "off" as it actually does virtually nothing for you. It simply enforces that only procs operating in the same allocation can communicate with each other - but it takes little effort to demonstrate that attempting to decipher what an unknown application in a different allocation is doing by "sniffing" raw MPI exchanges is effectively impossible.

Obviously, policy overrides reality in some situations, so that may not be an option for some. Still, there are some fairly simple ways of solving this if someone cares to provide access to an appropriate machine, or if someone with such access is willing to collaborate on the solution.

@rhc54
Copy link
Contributor Author

rhc54 commented Apr 17, 2024

I had a very long (sometimes heated) discussion with SchedMD this afternoon, and I think we have hit upon a solution that is acceptable to both sides. I'll lay it out in more detail at the meeting next week, but basically we have agreed that:

  • what is out there today is not something we can call back - it is done and gone. The changes introduced by Slurm 23.11 were well intentioned, but flawed - and we are going to have to live with them. Fortunately, they should hopefully not impact many people.
  • we need to focus on the long-term solution for Slurm integration to try and break this continual cycle of problems, both in discovering the allocation and launching the job. We also need to define a method for this that provides adequate isolation between the two projects so Slurm can do what it wants/needs, without being hindered by OMPI's installed base.

We spent a fair amount of time kicking over possible approaches before finally agreeing on creating a separate Slurm integration library that would provide at least two APIs - one to retrieve allocated resources and the other to launch the job. Ideas were exchanged on how to make those APIs future-proof, how to detect which methods were supported by the Slurm version we are operating under, etc. What Slurm does inside the library to implement those APIs will be invisible to us - and frankly, we don't care. SchedMD is free to modify that to meet their evolving needs.

At the end, I think we are on a good path. Tim went off to pursue this more and will circle back with me next week for a more detailed look at the design. Initial goal is to (a) introduce a slight modification to the current envar method in Slurm 24.05 which will require a mod to PRRTE to support, and (b) include the new library-based method in Slurm 24.11 so it can be used in OMPI v6 and PRRTE v4 (or v5, depending on release schedules).

@jsquyres
Copy link
Member

I think we need to update the Open MPI docs to explicitly state that SLURM >= 23.11 is silently injecting (and/or modifying?) the OMPI_MCA_plm_slurm_args environment variable. This may cause unexpected behavior in all versions of Open MPI (because the plm_slurm_args MCA param goes back many years / many Open MPI versions).

What else do we need to tell Open MPI users?

@rhc54
Copy link
Contributor Author

rhc54 commented Apr 18, 2024

A few more things for you folks to consider - just a "heads up" to help seed the discussion next week:

  1. The modification SchedMD is requesting is that we (a) add the new --external-launcher flag to our Slurm plm plugin, and (b) extend that option to include the OMPI version (i.e., --external-launcher=ompi5.0.4). In order to do that, we will have to runtime detect the Slurm version we are operating under so we know that we can add that flag, and know that we can/should add the extension. SchedMD would provide a sentinel envar for this purpose - if that isn't sufficient, then we would have to execute srun -v to obtain the version prior to launch. They would like us to include this in the next OMPI and PRRTE releases.
  2. SchedMD proposes that we use JSON as the go-between for the new library interfaces. It was clear that JSON would be used to return the allocation - not quite clear how JSON would be used in the launch API. This will require that we add a JSON parser to OMPI. I link to and use an external JSON library (libjansson) in PMIx for the PALS support (since everything in PALS is a web service accessed via curl), but I don't know how widely installed that library is or how willing admins will be to add it. It isn't a GPL library, so we could consider absorbing it or replicating just the parts we need - either way, this is something to consider.

Personally, I have to confess to being amused at the first request since it represents another modification of our hard-coded srun cmd line, coming almost immediately after being promised that --external-launcher would be the "forever" solution. We would of course have to check the value of the MCA param to ensure we don't also add it to the cmd line since that envar will still be out there. A little tricky, but doable if you want to agree to the request.

JSON for the allocation is workable, although I'm not wild about it. I'll have to wait to see their proposal next week to understand why we would pass JSON to the launch API - that one isn't intuitive to me.

Please come prepared to offer your advice and questions next week. If some of you want to be involved in this process with SchedMD, let me know - there is no intent to restrict involvement.

@rhc54
Copy link
Contributor Author

rhc54 commented May 2, 2024

So we have a bit of a problem. Slurm 24.05 has been released with the --external-launcher option now taking the optional argument. However, if we pass that argument to an earlier version of srun, it will error out - so we have to somehow determine when we can/should provide the option and when we should not.

SchedMD has proposed that we test-run srun to make the determination instead of trying to parse the srun --version output. So we would modify the slurm launcher to first try srun --external-launcher=foo --version to see if that fails. If it fails, then try srun --external-launcher --version to see if the --external-launcher was supported at all. Then we would go ahead and launch the daemons with the correct version of the cmd line.

My concern here is that we are potentially executing multiple srun calls to launch the daemons. Do folks find this of concern? Any other suggestions on how we might determine what cmd line to use?

@naughtont3
Copy link
Contributor

I wonder if we should add some configure stage checks for this and set a flag internally to avoid having to deal with this at runtime. I guess one possible problem could be if you end up executing on a system different from what you used during configure/build time (e.g., binary packages), but maybe that case has better version checking across the binary packages it mixes.

@rhc54
Copy link
Contributor Author

rhc54 commented May 3, 2024

We do have a prte_check_slurm.m4 that checks for srun in the PATH, but doesn't attempt to execute anything. However, that only happens at build time, and it is entirely possible someone would execute this under a different Slurm version. So I suspect this does have to be a runtime check of some type.

Tim had talked about maybe adding an envar to help steer us, but apparently that didn't happen. So we're kinda stuck with having to directly test srun, I think. We could, I suppose, just try to launch with what we think is the most likely environment and then fallback down the chain until the daemon launch succeeds. After all, srun will immediately fail if it doesn't recognize the option, so we wouldn't have to worry about daemons hanging around - I don't see what doing an initial test buys us.

On a separate note: the Slurm support will eventually add a dependency to it - namely, you will need to have a JSON library available. The API library will be separate, so we can check for it too...but building Slurm support is going to undergo some changes. At the very least, we probably need to advise people to add --with-slurm to the configure line to ensure the build fails if all the dependencies aren't found - otherwise, we will silently not build the Slurm components even if srun is found, which is a new behavior.

Tim has suggested creating new slurm2 components to handle the API library - might help, need to see. Would allow the current components to build if JSON isn't around - but then we are back to the confused cmd line and envars again, so we could well have malfunctioning components. Not sure that's any better.

@hppritcha
Copy link
Member

could use sinfo to get version info. not sure that's really any different though.

@rhc54
Copy link
Contributor Author

rhc54 commented May 3, 2024

Maybe the best answer is to simply defer this until we get the API? Yeah, it causes problems - but I'm not sure that the solution isn't going to be flaky as well. Trying to adapt across multiple Slurm variants feels kinda iffy. Perhaps we just add it to the docs and advise people to move off those Slurm releases, either downwards or (once released) up to 24.11 and above? Of course, moving up implies that OMPI also has to move up to integrate to those changes.

Doesn't feel like there is any really good answer 🤷‍♂️

@rhc54
Copy link
Contributor Author

rhc54 commented May 7, 2024

Thought about this some more and I believe there really are only three options at this point:

  • detect the Slurm version and error out if it is 23.11 or 24.05. Give them a cmd line option to override if they know what they are doing.
  • output a warning if it is 23.11 or 24.05 and run. Give them a cmd line option to disable the warning.
  • just document that these are bad Slurm versions and recommend people avoid them.

I'm inclined to implement the 3rd option as it is the easiest to do and this hasn't been a total mess - at least, so far. Definitely don't want to try and mess with the srun cmd line to chase Slurm options - that seems a futile exercise.

@wickberg
Copy link

wickberg commented May 8, 2024

As some background: our goal with injecting --external-launcher is to avoid issues from the srun command inheriting other job-specific launch attributes that do not work alongside mpirun/mpiexec. Most notably newer job layout options like --tres-per-task, but this has been a problem in the past around interpretation of --exclusive and other options.

When set, this flag gives srun / slurmctld a clear indication that Slurm should - as much as possible - get out of the way, ignore almost all other environment variables, and just launch prred/orted/hydra in a straightforward manner.

I had reached out to Ralph privately back in January '23 while we were exploring this approach, and believed I had tacit approval given when he noted we'd need to set PRTE_MCA_plm_slurm_args alongside OMPI_MCA_plm_slurm_args to influence newer OMPI releases. But, in retrospect, we obviously weren't on the same page, and we're now working through the impact of that decision.

I will say that, so far, outside of one issue that was addressed before Slurm 24.05.2 23.11.2, this approach does seem to be working. I've been asking for any specific issues that this has introduced and (outside of openpmix/prrte#1960 which doesn't seem to be directly related to this) I haven't seen specific problems reported here.

I agree that, architecturally, we're stepping on each others toes here, but that's been an ongoing issue for years as Ralph alludes to. If there are outstanding problems I am happy to take ownership of those issues if/when they arise - our intent was not to complicate the already-complicated relationship between MPI runtimes and the WLM.

@wickberg
Copy link

wickberg commented May 8, 2024

Now, on to what I'd suggest as a path forward: in Slurm 24.11 we've agreed to provide a dedicated library, akin to slurm/libslurm_pmi.so which is used with PMI-1, that a new plm module can use going forward to handle this bootstrapping issue. I wish I could ship this in Slurm 24.05 this month, but I don't want to rush that as we need to make sure the ABI is permanently unchanged for this to be a viable approach

I would prefer this be shipped as a separate PLM module for at least a few OMPI releases though. OMPI is going to be run in older Slurm versions for a while, just as older OMPI versions commonly run under Slurm, and dropping support for the existing mode of operations I think would be a net detriment to both communities as certainly platforms (as was previously noted) can't use SSH for bootstrapping.

I had briefly talked with Ralph about exposing an envvar alongside the optional argument to --external-launcher, but the root cause of all of these issues comes down to environment variable inheritance, and I'd rather avoid that. The other approach discussed - which we can develop patches for if desired - is runtime detection of support for --external-launcher=version flag. (I have to apologize here - the 24.05 release close-out process has run on a bit, and I haven't had as much time to tackle this as expected. That will wrap up in the next couple of weeks.)

From Ralph's proposed approaches, I do think that documentation noting this behavior would be appropriate, and would be happy to take a stab at submitting something suitable.

@rhc54
Copy link
Contributor Author

rhc54 commented May 8, 2024

Thanks @wickberg for the very good review. It is difficult to assess how much trouble there is out there - as you know, not everything gets reported every time it is encountered. From the contacts I've had, people have heard enough that they either revert to an earlier unaffected Slurm version or find a workaround if/when they encounter problems from this change. I think in general there is a feeling that filing issues for known problems is simply "piling on" for no benefit.

Regardless, I have no reliable estimate of the magnitude of the issue in the community. I only know it is hitting at least some people.

I don't see any problem with retaining the current PLM module in parallel with a new one. We can use a configure-time check for the new API library, and the component simply won't load if the library isn't present on the target system. It would be cleaner if the library was not licensed as GPL.

Not entirely sure on where to put the documentation. Ultimately, it has to go in both PRRTE and OMPI, so perhaps in PRRTE for now and let OMPI pick it up from there? Probably the right way to start and we can figure out the mechanics between the projects later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants