Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPICH support #478

Closed
wants to merge 12 commits into from
Closed

MPICH support #478

wants to merge 12 commits into from

Conversation

sheevy
Copy link
Contributor

@sheevy sheevy commented Oct 11, 2022

This is my initial attempt to have MPICH working on mpi-operator. Bear in mind I'm a noob when it comes to MPI and Kubernetes.

This is to close #477 which @alculquicondor offered to help with.

I've demonstrated pi example works when using MPICH.

There's a single TODO in the commit, because I'm not sure what a value of one of the constant should be for MPICH.

I've changed some of the test files, but not all. I'd appreciate a hint how to progress on that front.

Also, the autogenerated files were in fact edited by me manually...

@google-cla
Copy link

google-cla bot commented Oct 11, 2022

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@@ -115,6 +115,7 @@ const (

openMPISlotsEnv = "OMPI_MCA_orte_set_default_slots"
intelMPISlotsEnv = "I_MPI_PERHOST"
mpichSlotsEnv = "TODO"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does anyone know how to figure out what should be provided here?
Interestingly, the pi example works as is, i.e. with mpichSlotsEnv == TODO

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be the name of the environment variable for configuring the number of workers per host.

It works in the sense that it runs, because this would just set an environment variable called TODO. But it's not doing the intended outcome.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's -ppn command line option (both for Intel and MPICH), but I don't think it can be controlled via an environemnt variable

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, it's hard to overwrite the command line, because users might be using wrappers around mpiexec. That's why environment variables is better. Another option that openMPI and Intel MPI support is to set them in the hostfile.

Actually, it might be useful to use that in general because of this #445

@alculquicondor
Copy link
Collaborator

Thanks, this PR is still in my radar, but I don't think I'll have a chance to look at it in the next 2 weeks.

@sheevy
Copy link
Contributor Author

sheevy commented Oct 17, 2022

Thanks. Are there any actions on me at this point?

@alculquicondor
Copy link
Collaborator

Did you figure a solution for this?

mpichSlotsEnv    = "TODO"

@sheevy
Copy link
Contributor Author

sheevy commented Nov 3, 2022

Not yet. I think I misunderstood the purpose od -ppn option. Just asked a follow up question in MPICH repo: pmodels/mpich#6209

@sheevy
Copy link
Contributor Author

sheevy commented Nov 3, 2022

Thanks for the discussion in the other thread. So, with my limit understanding, the next step for me would be to remove the part with sets up mpichSlotsEnv env variable, and use SlotsPerWorker from the spec to drive the hostfile generation. And it'll be on the user to know to skip the -n argument.
Is that correct understanding?

@alculquicondor
Copy link
Collaborator

That is correct. It's up to the user to set -n if they want to, but they should be able to skip it.

@alculquicondor
Copy link
Collaborator

And if you have a chance, maybe we could be consistent across the MPI implementations and always use the hostfile instead of an environment variable.

That would fix #445 too

@alculquicondor
Copy link
Collaborator

@sheevy are you still working on this?

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign alculquicondor for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@alculquicondor
Copy link
Collaborator

can you rebase?

@sheevy
Copy link
Contributor Author

sheevy commented Jun 7, 2023

oopsie, sorry for noise, this was meant to to be a local change in my fork. I forgot the PR is still open.

I started reappling the change from scratch onto master as rebasing this branch was just too much work. Too much has changed in the time this PR was dormant. But I now have time to finalyl close it.

As I understand, the recent addtion of slots to host file is exactly what was missing. So now it's just a case of porting the change from this PR onto master ,as everything else is now in place.

@sheevy sheevy mentioned this pull request Jun 8, 2023
@tenzen-y
Copy link
Member

tenzen-y commented Jun 9, 2023

@sheevy Do we yet need this PR? If not, we can close this PR in favor of #562.

@alculquicondor
Copy link
Collaborator

/close
I suspect not

@google-oss-prow google-oss-prow bot closed this Jun 9, 2023
@google-oss-prow
Copy link

@alculquicondor: Closed this PR.

In response to this:

/close
I suspect not

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MPICH question
3 participants