Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mpi4py bug #417

Open
aowen87 opened this issue Apr 5, 2023 · 5 comments
Open

mpi4py bug #417

aowen87 opened this issue Apr 5, 2023 · 5 comments

Comments

@aowen87
Copy link

aowen87 commented Apr 5, 2023

Description

Hello,

I'm working on an LLNL project that uses maestro to manage ML workflows, and we've recently encountered an odd bug. If the following conditions are met, the job will hang indefinitely:

  1. We pass a p-gen file to maestro that imports mpi4py directly or indirectly (through another imported module).
  2. Maestro launches a job using more than one processor.
  3. The job being launched also imports mpi4py.

Reproducer

I've included files reproduce the issue below. There is 1 yaml file, 1 python script that maestro will launch with srun, and 3 parameter generation files. 1 of the parameter generation files works fine because it doesn't import mpi4py, and the other two parameter generation files will import mpi4py directly or indirectly and cause the job to hang.

Here are commands to reproduce each scenario:

This works: maestro run -p param_gen.py mpi_bug.yaml

This causes job hang: maestro run -p mpi_param_gen.py mpi_bug.yaml

This causes job hang: maesturo run -p kosh_param_gen.py mpi_bug.yaml

Files to reproduce:

mpi_bug.yaml:

batch:
  bank: wbronze
  host: rzgenie
  queue: pdebug
  type: slurm
description:
  description: Reproduces mpi4py bug
  name: bug_demo
env:
  variables:
    nodes: 1
    procs: 4
    walltime: '00:10:00'
    script: /path/to/hello_world.py
study:
- description: Launch a simple script using srun
  name: hello_world
  run:
    cmd: "#SBATCH --ntasks $(procs)\n\n $(LAUNCHER) python $(script)"
    nodes: $(nodes)
    procs: $(procs)
    walltime: $(walltime)

hello_world.py:

from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()

print(f"{rank}: hello world!")

param_gen.py:

from maestrowf.datastructures.core import ParameterGenerator

def get_custom_generator(*args, **kw_args):
    p_gen = ParameterGenerator()
    return p_gen

mpi_param_gen.py:

from maestrowf.datastructures.core import ParameterGenerator
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()

def get_custom_generator(*args, **kw_args):
    p_gen = ParameterGenerator()
    return p_gen

kosh_param_gen.py:

from maestrowf.datastructures.core import ParameterGenerator
#
# Kosh relies on mpi4py. This also causes maestro to hang.
#
import kosh

def get_custom_generator(*args, **kw_args):
    p_gen = ParameterGenerator()
    return p_gen
@jwhite242
Copy link
Collaborator

Well thanks for the detailed reproducer on this! Will take a look and see if we can get this sorted out for you.

@jwhite242
Copy link
Collaborator

@aowen87 I've a few more questions for you on this:

  1. What's the environment in which you're running this (login node/batch job/something else)?
  2. How's it being launched -> i.e. are you launching maestro with mpirun/exec/srun?

@aowen87
Copy link
Author

aowen87 commented Apr 18, 2023

@aowen87 I've a few more questions for you on this:

  1. What's the environment in which you're running this (login node/batch job/something else)?
  2. How's it being launched -> i.e. are you launching maestro with mpirun/exec/srun?

I'm running this from the login node, and the commands I'm using are the exact commands shown above (no srun/mpirun/etc. just maestro).

@jwhite242
Copy link
Collaborator

jwhite242 commented May 3, 2023

Ok, @aowen87 , I finally made some headway here. Part of the problem appears to be pgen calling mpi ends up setting a bunch of mpi related env vars which confuses the batch job in an unintuitive way: the default slurm configuration is to treat missing --export=[opts] in the slurm batch headers to be set to ALL, which exports all env vars in the current environment when calling sbatch. This is why you don't have to specify using your virtualenvironment python that maestro is installed into in your job steps. However it also lets some MPI set things go though. I was able to get a version with mpi in pgen and the job step to work just fine by hardwiring the --export=NONE option in maestro. So a potential solution here is maybe adding some hooks in the spec somewhere to control this option and a few more potential options for controlling it:

  • either explicitly purge some env vars from what's passed to slurm using the --export to then reset what's left
  • try perturbing the subprocess env to not inherit as much from the parent python process (initial trial didn't work here..)
  • add some user hooks for automatically injecting a bashrc to source in steps or letting you, the user, explicitly do that (have seen the latter using env block to define the path to said rc file)

@aowen87
Copy link
Author

aowen87 commented May 3, 2023

Great info! Thanks for digging into this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants