Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPC/Slurm clusters, gprmax using gpu(s) failed to start, if the first gpu is already occupied. #241

Open
LaanstraGJ opened this issue Mar 10, 2020 · 9 comments

Comments

@LaanstraGJ
Copy link

On our cluster, users started to use grpmax to run simulations.
Other tools using the gpu's are running fine, these nodes contain multiple gpu's.

When running gprmax with gpu support (pycuda) we've noticed that if the first gpu is already claimed by another user, the additional job on the second, third (etc) won't start due to an error.

Probably there is a conflict between the Environment variable being set by the slurm scheduler and the required ordinal or ordinal list required by (py)cuda.

$CUDA_VISIBLE_DEVICES contains the list of physical devices on the actual node given by slurm.
So if device 2 is given, $CUDA_VISIBLE_DEVICES contains "2"
(Py)cuda requires the ordinal to start from 0 for the first device, etc...

Solution :
Either:

  • Unset CUDA_VISIBLE_DEVICES
  • Modify CUDA_VISIBLE_DEVICES To 0[ or 0,1 or 0,1,2 or ... etc ]
  • Remove the following lines (388-390) from “utilities.py”
    elif 'CUDA_VISIBLE_DEVICES' in os.environ:
    deviceIDsavail = os.environ.get('CUDA_VISIBLE_DEVICES')
    deviceIDsavail = [int(s) for s in deviceIDsavail.split(',')]
  • Modify utilities.py in a way that it uses the following code to generate the correct deviceIDsavail
    deviceIDsavail = range(drv.Device.count())
@craig-warren
Copy link
Member

@LaanstraGJ interesting....I assumed (perhaps incorrectly) that the Slurm scheduler set $CUDA_VISIBLE_DEVICES to whatever GPUs were available solely for that users job. Therefore the GPU resource couldn't be in conflict with another user.

I don't think pycuda requires the ordinal to start from 0, you can supply any deviceID that is valid.

@LaanstraGJ
Copy link
Author

According the documentation of Slurm CUDA_VISIBLE_DEVICES is set correctly.
see : https://slurm.schedmd.com/gres.html (section GPU Management).
I suppose pycuda.driver.Device uses an ordinal range of available Gpus instead of the actual ID set in CUDA_VISIBLE_DEVICES.
The third or fourth fixes are suggested changes. The first two just quick dirty fix to confirm it.

@craig-warren
Copy link
Member

@LaanstraGJ pycuda.driver.Device(number) just takes the PCI bus ID of the device you want to run on. This should be what is given in CUDA_VISIBLE_DEVICES. I am still confused by what you said on a GPU device being used by another user. When a user launches a job Slurm will set CUDA_VISIBLE_DEVICES to the available GPUs for that users job (which no one else should be able to use?), then gprMax will read the list of CUDA_VISIBLE_DEVICES and use that to set pycuda.driver.Device.

@LaanstraGJ
Copy link
Author

The problem is that pycuda.driver.Device always uses id=0 for the first available card, id=1 for the second, etc
Even if CUDA_VISIBLE_DEVICES contains a list of [2,3] (here 0 and 1 are already used by another job)

@craig-warren
Copy link
Member

@LaanstraGJ I don't think so, pycuda.driver.Device will take whatever integer number you give it (but should be a valid PCI bus ID) see https://documen.tician.de/pycuda/driver.html#pycuda.driver.Device &

dev = drv.Device(G.gpu.deviceID)

@LaanstraGJ
Copy link
Author

It's pycuda.driver.Device(number) or pycuda.driver.Device(pci_bus_id) see https://documen.tician.de/pycuda/driver.html#pycuda.driver.Device
gprMax uses the number method (utilities.py)
gprMax/utilities.py#L403

@craig-warren
Copy link
Member

Yes, I think we are saying the same thing here. So I'm wondering if the problem you are seeing is related to the following:

When possible, Slurm automatically determines the GPUs on the system using NVML. NVML (which powers the nvidia-smi tool) numbers GPUs in order by their PCI bus IDs. For this numbering to match the numbering reported by CUDA, the CUDA_DEVICE_ORDER environmental variable must be set to CUDA_DEVICE_ORDER=PCI_BUS_ID.

GPU device files (e.g. /dev/nvidia1) are based on the Linux minor number assignment, while NVML's device numbers are assigned via PCI bus ID, from lowest to highest. Mapping between these two is indeterministic and system dependent, and could vary between boots after hardware or OS changes. For the most part, this assignment seems fairly stable. However, an after-bootup check is required to guarantee that a GPU device is assigned to a specific device file.

@craig-warren
Copy link
Member

@LaanstraGJ I've been thinking about this some more. A couple of questions:

  1. Are you using the MPI task farm with GPUs? i.e. requesting multiple GPUs?
  2. How are you providing the GPU deviceIDs when you call gprMax?

I think it is this second question that maybe the root of the issue. The code in the utilities module is just a checker, it still relies on you to give valid deviceIDs when you call gprMax, and this in itself is dependent on what is available and therefore offered by SLURM via CUDA_VISIBLE_DEVICES. So in essence if you are not passing a list containing CUDA_VISIBLE_DEVICES when you call gprMax, that is likely why it is not working properly.

I think what would be useful is to change the code in gprMax so that if the -mpi and -gpu flags are given then it automatically looks up and sets the GPU deviceIDs based on CUDA_VISIBLE_DEVICES. What do you think?

@craig-warren
Copy link
Member

When I checked how I'd been using gprMax on a HPC with GPUs I found the code below (for 8 traces and 2 GPUs):

devIDs="$(srun echo $CUDA_VISIBLE_DEVICES | sed 's/,/ /g')"
conda activate gprMax
srun -n 3 python -m gprMax user_models/cylinder_Bscan_2D.in -n 8 -mpi 3 -gpu $devIDs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants