New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HPC/Slurm clusters, gprmax using gpu(s) failed to start, if the first gpu is already occupied. #241
Comments
@LaanstraGJ interesting....I assumed (perhaps incorrectly) that the Slurm scheduler set $CUDA_VISIBLE_DEVICES to whatever GPUs were available solely for that users job. Therefore the GPU resource couldn't be in conflict with another user. I don't think pycuda requires the ordinal to start from 0, you can supply any deviceID that is valid. |
According the documentation of Slurm CUDA_VISIBLE_DEVICES is set correctly. |
@LaanstraGJ pycuda.driver.Device(number) just takes the PCI bus ID of the device you want to run on. This should be what is given in CUDA_VISIBLE_DEVICES. I am still confused by what you said on a GPU device being used by another user. When a user launches a job Slurm will set CUDA_VISIBLE_DEVICES to the available GPUs for that users job (which no one else should be able to use?), then gprMax will read the list of CUDA_VISIBLE_DEVICES and use that to set pycuda.driver.Device. |
The problem is that pycuda.driver.Device always uses id=0 for the first available card, id=1 for the second, etc |
@LaanstraGJ I don't think so, pycuda.driver.Device will take whatever integer number you give it (but should be a valid PCI bus ID) see https://documen.tician.de/pycuda/driver.html#pycuda.driver.Device & gprMax/gprMax/model_build_run.py Line 501 in 856afd4
|
It's pycuda.driver.Device(number) or pycuda.driver.Device(pci_bus_id) see https://documen.tician.de/pycuda/driver.html#pycuda.driver.Device |
Yes, I think we are saying the same thing here. So I'm wondering if the problem you are seeing is related to the following:
|
@LaanstraGJ I've been thinking about this some more. A couple of questions:
I think it is this second question that maybe the root of the issue. The code in the utilities module is just a checker, it still relies on you to give valid deviceIDs when you call gprMax, and this in itself is dependent on what is available and therefore offered by SLURM via I think what would be useful is to change the code in gprMax so that if the |
When I checked how I'd been using gprMax on a HPC with GPUs I found the code below (for 8 traces and 2 GPUs):
|
On our cluster, users started to use grpmax to run simulations.
Other tools using the gpu's are running fine, these nodes contain multiple gpu's.
When running gprmax with gpu support (pycuda) we've noticed that if the first gpu is already claimed by another user, the additional job on the second, third (etc) won't start due to an error.
Probably there is a conflict between the Environment variable being set by the slurm scheduler and the required ordinal or ordinal list required by (py)cuda.
$CUDA_VISIBLE_DEVICES contains the list of physical devices on the actual node given by slurm.
So if device 2 is given, $CUDA_VISIBLE_DEVICES contains "2"
(Py)cuda requires the ordinal to start from 0 for the first device, etc...
Solution :
Either:
elif 'CUDA_VISIBLE_DEVICES' in os.environ:
deviceIDsavail = os.environ.get('CUDA_VISIBLE_DEVICES')
deviceIDsavail = [int(s) for s in deviceIDsavail.split(',')]
deviceIDsavail = range(drv.Device.count())
The text was updated successfully, but these errors were encountered: