Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CalledProcessError: 9 #854

Open
raitis-b opened this issue May 17, 2024 · 2 comments
Open

CalledProcessError: 9 #854

raitis-b opened this issue May 17, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@raitis-b
Copy link

Hi,

I tried to run openFE tutorial on my laptop and everything worked just fine, but when I tried to run it on our cluster I faced an issue. On a gpu node it gave an error that has been mentioned before (GPU in 'Exclusive_Process' mode (or Prohibited), one context is allowed per device. This may prevent some openmmtools features from working. GPU must be in 'Default' compute mode). While we fix this issue, I wanted to run it without the gpu, but this led to another error:

$ openfe quickrun transformations/easy_rbfe_lig_ejm_31_solvent_lig_ejm_46_solvent.json -o results/easy_rbfe_lig_ejm_31_solvent_lig_ejm_46_solvent_login_node.json -d results/easy_rbfe_lig_ejm_31_solvent_lig_ejm_46_solvent_login_node

Loading file...
Planning simulations for this edge...
Starting the simulations for this edge...
Done with all simulations! Analyzing the results....
Here is the result:
dG = None ± None

Error: The protocol unit 'lig_ejm_31 to lig_ejm_46 repeat 2 generation 0' failed with the error message:
CalledProcessError: 9

Details provided in output.

The only output is the .json file that is attached.

Cheers,
Raitis

easy_rbfe_lig_ejm_31_solvent_lig_ejm_46_solvent_no_gpu.json

@mikemhenry mikemhenry added the bug Something isn't working label May 17, 2024
@mikemhenry
Copy link
Contributor

@raitis-b

Thank you for the bug report! Looking at the json file and cleaning it up a bit (I just used firefox to view it, it does a decent job rendering these json files) it looks like

Traceback (most recent call last):
  File "/mnt/home/bobrovs/software/miniforge3/envs/openfe_env/lib/python3.12/site-packages/gufe/protocols/protocolunit.py", line 320, in execute
    outputs = self._execute(context, **inputs)
  File "/mnt/home/bobrovs/software/miniforge3/envs/openfe_env/lib/python3.12/site-packages/openfe/protocols/openmm_rfe/equil_rfe_methods.py", line 1127, in _execute
    log_system_probe(logging.INFO, paths=[ctx.scratch])
  File "/mnt/home/bobrovs/software/miniforge3/envs/openfe_env/lib/python3.12/site-packages/openfe/utils/system_probe.py", line 502, in log_system_probe
    sysinfo = _probe_system(pl_paths)['system information']
  File "/mnt/home/bobrovs/software/miniforge3/envs/openfe_env/lib/python3.12/site-packages/openfe/utils/system_probe.py", line 471, in _probe_system
    gpu_info = _get_gpu_info()
  File "/mnt/home/bobrovs/software/miniforge3/envs/openfe_env/lib/python3.12/site-packages/openfe/utils/system_probe.py", line 340, in _get_gpu_info
    nvidia_smi_output = subprocess.check_output(
  File "/mnt/home/bobrovs/software/miniforge3/envs/openfe_env/lib/python3.12/subprocess.py", line 466, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/mnt/home/bobrovs/software/miniforge3/envs/openfe_env/lib/python3.12/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['nvidia-smi', '--query-gpu=gpu_uuid,gpu_name,compute_mode,pstate,temperature.gpu,utilization.memory,memory.total,driver_version,', '--format=csv']' returned non-zero exit status 9.

the nvidia-smi command failed. Could you run nvidia-smi on the same machine/node where you ran the simulation and report back what it does? Code 9 is sigkill so I think that command got killed by some other process.

Regardless, we want to make sure this command doesn't prevent a simulation from running, so we need to enhance our error handling of it.

@raitis-b
Copy link
Author

When I am not asking for the GPU in the queuing script and want to run only on the CPU, the nvidia-smi output is:
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants