HPC/Slurm clusters, gprmax using gpu(s) failed to start, if the first gpu is already occupied. #241

LaanstraGJ · 2020-03-10T13:33:39Z

On our cluster, users started to use grpmax to run simulations.
Other tools using the gpu's are running fine, these nodes contain multiple gpu's.

When running gprmax with gpu support (pycuda) we've noticed that if the first gpu is already claimed by another user, the additional job on the second, third (etc) won't start due to an error.

Probably there is a conflict between the Environment variable being set by the slurm scheduler and the required ordinal or ordinal list required by (py)cuda.

$CUDA_VISIBLE_DEVICES contains the list of physical devices on the actual node given by slurm.
So if device 2 is given, $CUDA_VISIBLE_DEVICES contains "2"
(Py)cuda requires the ordinal to start from 0 for the first device, etc...

Solution :
Either:

Unset CUDA_VISIBLE_DEVICES
Modify CUDA_VISIBLE_DEVICES To 0[ or 0,1 or 0,1,2 or ... etc ]
Remove the following lines (388-390) from “utilities.py”
elif 'CUDA_VISIBLE_DEVICES' in os.environ:
deviceIDsavail = os.environ.get('CUDA_VISIBLE_DEVICES')
deviceIDsavail = [int(s) for s in deviceIDsavail.split(',')]
Modify utilities.py in a way that it uses the following code to generate the correct deviceIDsavail
deviceIDsavail = range(drv.Device.count())

craig-warren · 2020-03-27T16:44:50Z

@LaanstraGJ interesting....I assumed (perhaps incorrectly) that the Slurm scheduler set $CUDA_VISIBLE_DEVICES to whatever GPUs were available solely for that users job. Therefore the GPU resource couldn't be in conflict with another user.

I don't think pycuda requires the ordinal to start from 0, you can supply any deviceID that is valid.

LaanstraGJ · 2020-03-28T11:49:29Z

According the documentation of Slurm CUDA_VISIBLE_DEVICES is set correctly.
see : https://slurm.schedmd.com/gres.html (section GPU Management).
I suppose pycuda.driver.Device uses an ordinal range of available Gpus instead of the actual ID set in CUDA_VISIBLE_DEVICES.
The third or fourth fixes are suggested changes. The first two just quick dirty fix to confirm it.

craig-warren · 2020-04-02T09:44:15Z

@LaanstraGJ pycuda.driver.Device(number) just takes the PCI bus ID of the device you want to run on. This should be what is given in CUDA_VISIBLE_DEVICES. I am still confused by what you said on a GPU device being used by another user. When a user launches a job Slurm will set CUDA_VISIBLE_DEVICES to the available GPUs for that users job (which no one else should be able to use?), then gprMax will read the list of CUDA_VISIBLE_DEVICES and use that to set pycuda.driver.Device.

LaanstraGJ · 2020-04-02T10:05:33Z

The problem is that pycuda.driver.Device always uses id=0 for the first available card, id=1 for the second, etc
Even if CUDA_VISIBLE_DEVICES contains a list of [2,3] (here 0 and 1 are already used by another job)

craig-warren · 2020-04-02T10:12:27Z

@LaanstraGJ I don't think so, pycuda.driver.Device will take whatever integer number you give it (but should be a valid PCI bus ID) see https://documen.tician.de/pycuda/driver.html#pycuda.driver.Device &

gprMax/gprMax/model_build_run.py

Line 501 in 856afd4

dev = drv.Device(G.gpu.deviceID)

LaanstraGJ · 2020-04-02T10:59:39Z

It's pycuda.driver.Device(number) or pycuda.driver.Device(pci_bus_id) see https://documen.tician.de/pycuda/driver.html#pycuda.driver.Device
gprMax uses the number method (utilities.py)
gprMax/utilities.py#L403

craig-warren · 2020-04-02T11:41:29Z

Yes, I think we are saying the same thing here. So I'm wondering if the problem you are seeing is related to the following:

When possible, Slurm automatically determines the GPUs on the system using NVML. NVML (which powers the nvidia-smi tool) numbers GPUs in order by their PCI bus IDs. For this numbering to match the numbering reported by CUDA, the CUDA_DEVICE_ORDER environmental variable must be set to CUDA_DEVICE_ORDER=PCI_BUS_ID.

GPU device files (e.g. /dev/nvidia1) are based on the Linux minor number assignment, while NVML's device numbers are assigned via PCI bus ID, from lowest to highest. Mapping between these two is indeterministic and system dependent, and could vary between boots after hardware or OS changes. For the most part, this assignment seems fairly stable. However, an after-bootup check is required to guarantee that a GPU device is assigned to a specific device file.

craig-warren · 2020-04-06T09:01:52Z

@LaanstraGJ I've been thinking about this some more. A couple of questions:

Are you using the MPI task farm with GPUs? i.e. requesting multiple GPUs?
How are you providing the GPU deviceIDs when you call gprMax?

I think it is this second question that maybe the root of the issue. The code in the utilities module is just a checker, it still relies on you to give valid deviceIDs when you call gprMax, and this in itself is dependent on what is available and therefore offered by SLURM via CUDA_VISIBLE_DEVICES. So in essence if you are not passing a list containing CUDA_VISIBLE_DEVICES when you call gprMax, that is likely why it is not working properly.

I think what would be useful is to change the code in gprMax so that if the -mpi and -gpu flags are given then it automatically looks up and sets the GPU deviceIDs based on CUDA_VISIBLE_DEVICES. What do you think?

craig-warren · 2020-04-06T14:01:25Z

When I checked how I'd been using gprMax on a HPC with GPUs I found the code below (for 8 traces and 2 GPUs):

devIDs="$(srun echo $CUDA_VISIBLE_DEVICES | sed 's/,/ /g')"
conda activate gprMax
srun -n 3 python -m gprMax user_models/cylinder_Bscan_2D.in -n 8 -mpi 3 -gpu $devIDs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HPC/Slurm clusters, gprmax using gpu(s) failed to start, if the first gpu is already occupied. #241

HPC/Slurm clusters, gprmax using gpu(s) failed to start, if the first gpu is already occupied. #241

LaanstraGJ commented Mar 10, 2020

craig-warren commented Mar 27, 2020

LaanstraGJ commented Mar 28, 2020

craig-warren commented Apr 2, 2020

LaanstraGJ commented Apr 2, 2020

craig-warren commented Apr 2, 2020

LaanstraGJ commented Apr 2, 2020

craig-warren commented Apr 2, 2020

craig-warren commented Apr 6, 2020

craig-warren commented Apr 6, 2020

HPC/Slurm clusters, gprmax using gpu(s) failed to start, if the first gpu is already occupied. #241

HPC/Slurm clusters, gprmax using gpu(s) failed to start, if the first gpu is already occupied. #241

Comments

LaanstraGJ commented Mar 10, 2020

craig-warren commented Mar 27, 2020

LaanstraGJ commented Mar 28, 2020

craig-warren commented Apr 2, 2020

LaanstraGJ commented Apr 2, 2020

craig-warren commented Apr 2, 2020

LaanstraGJ commented Apr 2, 2020

craig-warren commented Apr 2, 2020

craig-warren commented Apr 6, 2020

craig-warren commented Apr 6, 2020