New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SSH improvement phase2 #390
base: master
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## master #390 +/- ##
==========================================
- Coverage 53.99% 53.86% -0.14%
==========================================
Files 36 36
Lines 12017 12051 +34
Branches 3708 3709 +1
==========================================
+ Hits 6489 6491 +2
- Misses 4554 4586 +32
Partials 974 974
Continue to review full report at Codecov.
|
This pull request fixes 4 alerts when merging e7226ba into 8956356 - view on LGTM.com fixed alerts:
|
79778b5
to
19ae9a1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution! I added some minor comments
else: | ||
return check_job_status(self.job_id) | ||
server_status = check_job_status(self.job_id) | ||
# `server_status` should only be 'initializing' / 'running' / 'errored' / 'done' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't we need to know the node for troubleshooting by switching nodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the current workflow, changing node actually happens in the function Job.troubleshoot_server()
. (line 1535 in arc/job/job.py). A (possible) available is picked by trsh_job_on_server
in this method. We don't need to know the current node, since it is saved in self.server_nodes
, we don't need to know what nodes are available in prior, since trsh_job_on_server
will find one. The server_status
here is only used to assign a value to self.job_status[0]
and does not play a role in switching nodes. However, if I miss something, please let me know. Thanks!
logger.debug( | ||
f'{remote_file_path} does not exist on {self.server}.') | ||
raise InputError(f'{remote_file_path} does not exist on {self.server}.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should raise ServerError
Are these errors being captured, or will they crash ARC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good question. The answer depends on how we regard the context of this error. If we regard this function as a single API, and we are using this API to download stuff. Then if remote_file_path
does not exist on the server, it simply means we input invalid arguments, since it is not due to a malfunction of this method/class/connection. However, in an ARC job, this function is run since we think there should be a path there, and we want to download it, but the path actually does not exist. Still, I would say this is an error that is not due to a malfunction of this method/class/connection, so I think it is inappropriate to say it is a ServerError
. It is more likely that some previous step goes into an undesired way. I would suggest that let some upper-level function catch the InputError
and re-raise this error as a JobError
. And this is what I did in the Job
(in Job._download_output_file
). Then, the re-raised JobError
will be caught by job.determine_job_status
(two levels upper), and job.determine_job_status
will analyze whether we can fix this or not. If we can, we submit a new job; if we cannot, we mark this job as errored
.
Get the additional information of stdout and stderr according to the | ||
queue software outputs. | ||
|
||
Returns: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while revisiting this method, please add a type hint for the returned value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think one latter commit did. The returned item is a str
which can be either empty str ''
or very long str which combines all of the lines from the files downloaded (basically, files like *.out and *.err).
@xiaoruiDong, setting up a dummy server for the tests is an interesting point. If you have time, could you look into it? Perhaps we can see how paramiko tests itself. |
@xiaoruiDong, what's the status of this PR? |
@alongd Hi Alon, I am sorry that I haven't had time look at the dummy server things. For the branch, I've been using this branch to run my jobs for a while. And it works okay. |
OK, please rebase. |
0f8baa9
to
ff8c3a8
Compare
Both Travis tests fail due to RMG issue (#2011, ReactionMechanismGenerator/RMG-Py#2011), which is not related to changes in this PR. Tests need to rerun when that issue is solved. |
The server status can be confusing if a job is cancelled but still marked as 'done'. Add more comments and todo and clean the codes.
Previously, an error is chosen to be not raised, due to the fact that paramiko may not raise an IOError and simply create an empty file locally. However, the behavior is witnessed to be changed since a certain version. Better to update the codes dealing with non-existing-file consistently.
If we cannot find a downloaded file, the downloading process can not be regarded as succeeded. Add a check for that.
Previously, the errors are not effectively caught. You may have cases where downloading fails, and raise an error, and never run into the check of file existence. After we standardize `ssh.download`, we allow a better error handling.
'download' is not accurate, and 'returns' are not indicated.
Correct the error handling related to downloading. Refactoring the method to 3 parts, (1) setting files to read from, (2) downloading if necessary, and (3) read through files. Ideally, if any further changes needed, developer can focus on the first section.
The node failure has not well determined results: sometimes, output file is available remotely, sometimes not. However, node failure warning is recorded in the stdout/stderr. This commit allows to catch the node failure issue comprehensively.
ff8c3a8
to
4156e8a
Compare
@xiaoruiDong , please rebase again and tag me so I'll merge. Thanks. |
…y change, SSH File Download Correction, Null Byte Read, Remote Remove Files 1. Adapter: Submit Script Format updated with using user provided username {un}, also convert memory to an 'int', and also provide the option of {server_nodes} if required 2. Adapter: Total Submit Memory adjusted to now ensure that when troubleshooting a job, it never attempts to go OVER the maximum memory of the allowed submission memory of the node/server 3. Adapter: SLURM Submit Memory - Using `#SBATCH --mem` as the parameter now as it defines the TOTAL memory of the submission 4. Adapter: SSH File Download - We do not expect to always download or upload certain files depending on the scheduler via SSH. This change allows for recognising if certain files will be uploaded or downloaded depending on the user's scheduler choice 5. Adapter: Null Bytes can appear in files, or rather more specifically, if QChem has an error, it can produce Null Bytes in the out.txt/err.txt files and thus requires a different type of reading of the file contents. This is not 100% full proof though and may need extra work 6. Adapters: In #390 branch, SSH had improvements but were not merged. I have brought forth an improvement from this branch were Remote Files are removed once they are download to the local computer
Inspired by branch #390 1. SSH: Updated decorator to use the correct connect function 2. SSH: If the user provides a server that is not in servers.keys() and server is also not None, then an error is raised to informat the user that they need to fix up the server keys 3. SSH: An error that can occur is when a submission to a scheduler includes an incorrect memory specification, then there is warning to the user that the requested memory is not supported and needs to be checked. May need to make this a ValueError instead of a logger warning 4. SSH: Slight adjustment to checking if there is an stdout after submission attempt 5. SSH: Some servers require private keys. Originally the code was incorrectly adding the private key to the SSH paramiko function. It has now been changed so that system keys are loaded and then if the user provides a private key, it is included in the connect function 6. SSH: Updated default arguments to `get_last_modified_time` 7. SSH: Changed the lowering of the cluster soft 8. SSH: Added a function to remove the directory on the remote server 9. SSH: Azure SLURM has an extra status called 'CF' which means configuring (for the node). This can take 10-15 mins or so before the node is online. We now ensure to caputre this. HOWEVER, a node can get stuck in 'CF' status. Now we check this via checking the current time the node has been active, splitting the time up correctly (different formats of time are possible), and then if it is above 15 minutes, we run the command `scontrol show node {node_id}`. If the stdout includes the phrase 'NOT_RESPONDING' then we return 'errored'
…y change, SSH File Download Correction, Null Byte Read, Remote Remove Files 1. Adapter: Submit Script Format updated with using user provided username {un}, also convert memory to an 'int', and also provide the option of {server_nodes} if required 2. Adapter: Total Submit Memory adjusted to now ensure that when troubleshooting a job, it never attempts to go OVER the maximum memory of the allowed submission memory of the node/server 3. Adapter: SLURM Submit Memory - Using `#SBATCH --mem` as the parameter now as it defines the TOTAL memory of the submission 4. Adapter: SSH File Download - We do not expect to always download or upload certain files depending on the scheduler via SSH. This change allows for recognising if certain files will be uploaded or downloaded depending on the user's scheduler choice 5. Adapter: Null Bytes can appear in files, or rather more specifically, if QChem has an error, it can produce Null Bytes in the out.txt/err.txt files and thus requires a different type of reading of the file contents. This is not 100% full proof though and may need extra work 6. Adapters: In #390 branch, SSH had improvements but were not merged. I have brought forth an improvement from this branch were Remote Files are removed once they are download to the local computer
Inspired by branch #390 1. SSH: Updated decorator to use the correct connect function 2. SSH: If the user provides a server that is not in servers.keys() and server is also not None, then an error is raised to informat the user that they need to fix up the server keys 3. SSH: An error that can occur is when a submission to a scheduler includes an incorrect memory specification, then there is warning to the user that the requested memory is not supported and needs to be checked. May need to make this a ValueError instead of a logger warning 4. SSH: Slight adjustment to checking if there is an stdout after submission attempt 5. SSH: Some servers require private keys. Originally the code was incorrectly adding the private key to the SSH paramiko function. It has now been changed so that system keys are loaded and then if the user provides a private key, it is included in the connect function 6. SSH: Updated default arguments to `get_last_modified_time` 7. SSH: Changed the lowering of the cluster soft 8. SSH: Added a function to remove the directory on the remote server 9. SSH: Azure SLURM has an extra status called 'CF' which means configuring (for the node). This can take 10-15 mins or so before the node is online. We now ensure to caputre this. HOWEVER, a node can get stuck in 'CF' status. Now we check this via checking the current time the node has been active, splitting the time up correctly (different formats of time are possible), and then if it is above 15 minutes, we run the command `scontrol show node {node_id}`. If the stdout includes the phrase 'NOT_RESPONDING' then we return 'errored'
This is the second phase of SSH and server communication improvement. A major impetus is that I found the behavior of
paramiko
is slightly different now.In a previous version, when I tried to download a non-existing file with
sftp.get()
, there will be no error raised but an empty file at thelocalpath
I assigned. For the version I used (I recently updated my arc environment, it is now the latest stable2.7
), the behavior ofparamiko
is changed. It will raise anIOError
(no such file). We tried to catchIOError
in ARC, but due to some of the intermediate step is problematic (raise A another type of Error when catching this error, or have no catches at all), but we can only deal with part of the issues. Besides, if the ARC dead weirdly (say ARC has created an input file but hasn't submitted it to the server when my computer crashed), the current ARC cannot restart the job.Anyway, this is a patch to the previous SSH Improvement to make the system more robust.
However, this PR leaves the case when the quota is reached. (Proposal: We can delete finished jobs done within the scope of the project from the server). It is aimed to solve in the future PR.
For reviewer: Wait until the first SSH improvement finished. But if you are interested, you can start reviewing from
Minor: consistently return `errored` in _check_job_server_status
Another proposal: is there a way to setup a toy / virtual server so that we can test SSH and server-communication-related actions.