CalledProcessError: 9 #854

raitis-b · 2024-05-17T10:22:52Z

Hi,

I tried to run openFE tutorial on my laptop and everything worked just fine, but when I tried to run it on our cluster I faced an issue. On a gpu node it gave an error that has been mentioned before (GPU in 'Exclusive_Process' mode (or Prohibited), one context is allowed per device. This may prevent some openmmtools features from working. GPU must be in 'Default' compute mode). While we fix this issue, I wanted to run it without the gpu, but this led to another error:

$ openfe quickrun transformations/easy_rbfe_lig_ejm_31_solvent_lig_ejm_46_solvent.json -o results/easy_rbfe_lig_ejm_31_solvent_lig_ejm_46_solvent_login_node.json -d results/easy_rbfe_lig_ejm_31_solvent_lig_ejm_46_solvent_login_node

Loading file...
Planning simulations for this edge...
Starting the simulations for this edge...
Done with all simulations! Analyzing the results....
Here is the result:
dG = None ± None

Error: The protocol unit 'lig_ejm_31 to lig_ejm_46 repeat 2 generation 0' failed with the error message:
CalledProcessError: 9

Details provided in output.

The only output is the .json file that is attached.

Cheers,
Raitis

easy_rbfe_lig_ejm_31_solvent_lig_ejm_46_solvent_no_gpu.json

mikemhenry · 2024-05-17T15:23:09Z

@raitis-b

Thank you for the bug report! Looking at the json file and cleaning it up a bit (I just used firefox to view it, it does a decent job rendering these json files) it looks like

Traceback (most recent call last):
  File "/mnt/home/bobrovs/software/miniforge3/envs/openfe_env/lib/python3.12/site-packages/gufe/protocols/protocolunit.py", line 320, in execute
    outputs = self._execute(context, **inputs)
  File "/mnt/home/bobrovs/software/miniforge3/envs/openfe_env/lib/python3.12/site-packages/openfe/protocols/openmm_rfe/equil_rfe_methods.py", line 1127, in _execute
    log_system_probe(logging.INFO, paths=[ctx.scratch])
  File "/mnt/home/bobrovs/software/miniforge3/envs/openfe_env/lib/python3.12/site-packages/openfe/utils/system_probe.py", line 502, in log_system_probe
    sysinfo = _probe_system(pl_paths)['system information']
  File "/mnt/home/bobrovs/software/miniforge3/envs/openfe_env/lib/python3.12/site-packages/openfe/utils/system_probe.py", line 471, in _probe_system
    gpu_info = _get_gpu_info()
  File "/mnt/home/bobrovs/software/miniforge3/envs/openfe_env/lib/python3.12/site-packages/openfe/utils/system_probe.py", line 340, in _get_gpu_info
    nvidia_smi_output = subprocess.check_output(
  File "/mnt/home/bobrovs/software/miniforge3/envs/openfe_env/lib/python3.12/subprocess.py", line 466, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/mnt/home/bobrovs/software/miniforge3/envs/openfe_env/lib/python3.12/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['nvidia-smi', '--query-gpu=gpu_uuid,gpu_name,compute_mode,pstate,temperature.gpu,utilization.memory,memory.total,driver_version,', '--format=csv']' returned non-zero exit status 9.

the nvidia-smi command failed. Could you run nvidia-smi on the same machine/node where you ran the simulation and report back what it does? Code 9 is sigkill so I think that command got killed by some other process.

Regardless, we want to make sure this command doesn't prevent a simulation from running, so we need to enhance our error handling of it.

raitis-b · 2024-05-20T06:07:50Z

When I am not asking for the GPU in the queuing script and want to run only on the CPU, the nvidia-smi output is:
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

mikemhenry added the bug Something isn't working label May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CalledProcessError: 9 #854

CalledProcessError: 9 #854

raitis-b commented May 17, 2024

mikemhenry commented May 17, 2024

raitis-b commented May 20, 2024

CalledProcessError: 9 #854

CalledProcessError: 9 #854

Comments

raitis-b commented May 17, 2024

mikemhenry commented May 17, 2024

raitis-b commented May 20, 2024