Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ssh to launch fails #380

Open
aowen87 opened this issue Nov 5, 2021 · 5 comments
Open

ssh to launch fails #380

aowen87 opened this issue Nov 5, 2021 · 5 comments
Labels
bug Description of reproducible unexpected behavior.

Comments

@aowen87
Copy link

aowen87 commented Nov 5, 2021

I have a simple yaml file that I use to launch a study. The details of the study aren't that important, and I don't think they relate to this problem.

Here's the problem:

I'm on rztopaz, and I can launch the study by simply running maestro run -y path/to/study.yaml. That works fine.

What I really need, though, is to launch the study on rzansel from rztopaz. So, I tried running something very similar:
ssh rzansel 'cd /where/I/should/be; <activate environment>; maestro run -y path/to/study.yaml'

When I do this, it says the study launched successfully, but it clearly dies while trying to set things up. Here's the error I get in the log file:

[2021-11-05 08:24:18,424: INFO] Checking DAG status at 2021-11-05 08:24:18.424257
[2021-11-05 08:24:18,427: ERROR] Error code '127' seen. Unexpected behavior encountered.
[2021-11-05 08:24:18,427: ERROR] Unknown Error (Code = JobStatusCode.ERROR)
[2021-11-05 08:24:18,428: ERROR] Job status check failed -- Aborting.
[2021-11-05 08:24:18,428: ERROR] ('Job status check failed -- Aborting.',)
Traceback (most recent call last):
  File "/usr/WS2/maguire7/virtual_env/bvolatile/lib/python3.7/site-packages/maestrowf/conductor.py", line 382, in main
    completion_status = conductor.monitor_study()
  File "/usr/WS2/maguire7/virtual_env/bvolatile/lib/python3.7/site-packages/maestrowf/conductor.py", line 352, in monitor_study
    completion_status = dag.execute_ready_steps()
  File "/usr/WS2/maguire7/virtual_env/bvolatile/lib/python3.7/site-packages/maestrowf/datastructures/core/executiongraph.py", line 690, in execute_ready_steps
    raise RuntimeError(msg)
RuntimeError: Job status check failed -- Aborting.
[2021-11-05 08:24:18,429: INFO] Study exiting, cleaning up...
[2021-11-05 08:24:18,429: INFO] Squeaky clean!

If I run the following command from rzansel, everything works fine:
cd /where/I/should/be; <activate environment>; maestro run -y path/to/study.yaml

This makes me think that the issue comes from the ssh call. Unfortunately, the ssh is required for my particular study. Any ideas here?

@FrankD412 FrankD412 added the bug Description of reproducible unexpected behavior. label Nov 5, 2021
@aowen87
Copy link
Author

aowen87 commented Nov 5, 2021

I think this might have to do with interactive vs non interactive environments. Adding some of the missing environment variables from an interactive session gets me further. It still fails eventually, but the study is actually launched.

@FrankD412
Copy link
Member

Hi @aowen87 -- thanks for the bug report; I was thinking it might be something with the interactive environment. Do you happen to have an error for the one you got launched but that ended up failing?

@aowen87
Copy link
Author

aowen87 commented Nov 9, 2021

I'm pretty sure that second error was actually my fault. I've gotten a bit distracted and haven't had a chance to look at this again, but I'll post more info later if I'm wrong about this.

@FrankD412
Copy link
Member

@aowen87 -- No worries and no rush; I just wanted to make sure that you got all the support you needed. :-)

@aowen87
Copy link
Author

aowen87 commented Nov 9, 2021

Thanks! I appreciate it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Description of reproducible unexpected behavior.
Projects
None yet
Development

No branches or pull requests

2 participants