SSH improvement phase2 #390

xiaoruiDong · 2020-05-06T17:43:40Z

This is the second phase of SSH and server communication improvement. A major impetus is that I found the behavior of paramiko is slightly different now.

In a previous version, when I tried to download a non-existing file with sftp.get(), there will be no error raised but an empty file at the localpath I assigned. For the version I used (I recently updated my arc environment, it is now the latest stable2.7), the behavior of paramiko is changed. It will raise an IOError (no such file). We tried to catch IOError in ARC, but due to some of the intermediate step is problematic (raise A another type of Error when catching this error, or have no catches at all), but we can only deal with part of the issues. Besides, if the ARC dead weirdly (say ARC has created an input file but hasn't submitted it to the server when my computer crashed), the current ARC cannot restart the job.

Anyway, this is a patch to the previous SSH Improvement to make the system more robust.

However, this PR leaves the case when the quota is reached. (Proposal: We can delete finished jobs done within the scope of the project from the server). It is aimed to solve in the future PR.

For reviewer: Wait until the first SSH improvement finished. But if you are interested, you can start reviewing from Minor: consistently return `errored` in _check_job_server_status

Another proposal: is there a way to setup a toy / virtual server so that we can test SSH and server-communication-related actions.

codecov · 2020-05-06T18:13:57Z

Codecov Report

Merging #390 into master will decrease coverage by 0.13%.
The diff coverage is 2.97%.

@@            Coverage Diff             @@
##           master     #390      +/-   ##
==========================================
- Coverage   53.99%   53.86%   -0.14%     
==========================================
  Files          36       36              
  Lines       12017    12051      +34     
  Branches     3708     3709       +1     
==========================================
+ Hits         6489     6491       +2     
- Misses       4554     4586      +32     
  Partials      974      974

Impacted Files	Coverage Δ
arc/job/trsh.py	`42.17% <0.00%> (ø)`
arc/scheduler.py	`19.94% <0.00%> (-0.02%)`	⬇️
arc/job/job.py	`21.13% <2.46%> (-0.34%)`	⬇️
arc/job/ssh.py	`22.26% <6.66%> (-0.81%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b22df49...4156e8a. Read the comment docs.

lgtm-com · 2020-05-09T23:32:46Z

This pull request fixes 4 alerts when merging e7226ba into 8956356 - view on LGTM.com

fixed alerts:

3 for Except block handles 'BaseException'
1 for Accepting unknown SSH host keys when using Paramiko

alongd

Thanks for the contribution! I added some minor comments

alongd · 2020-06-07T00:14:58Z

arc/job/job.py

        else:
-            return check_job_status(self.job_id)
+            server_status = check_job_status(self.job_id)
+        # `server_status` should only be 'initializing' / 'running' / 'errored' / 'done'


don't we need to know the node for troubleshooting by switching nodes?

Based on the current workflow, changing node actually happens in the function Job.troubleshoot_server(). (line 1535 in arc/job/job.py). A (possible) available is picked by trsh_job_on_server in this method. We don't need to know the current node, since it is saved in self.server_nodes, we don't need to know what nodes are available in prior, since trsh_job_on_server will find one. The server_status here is only used to assign a value to self.job_status[0] and does not play a role in switching nodes. However, if I miss something, please let me know. Thanks!

alongd · 2020-06-07T00:18:46Z

arc/job/ssh.py

            logger.debug(
                f'{remote_file_path} does not exist on {self.server}.')
+            raise InputError(f'{remote_file_path} does not exist on {self.server}.')


I think we should raise ServerError
Are these errors being captured, or will they crash ARC?

I think this is a good question. The answer depends on how we regard the context of this error. If we regard this function as a single API, and we are using this API to download stuff. Then if remote_file_path does not exist on the server, it simply means we input invalid arguments, since it is not due to a malfunction of this method/class/connection. However, in an ARC job, this function is run since we think there should be a path there, and we want to download it, but the path actually does not exist. Still, I would say this is an error that is not due to a malfunction of this method/class/connection, so I think it is inappropriate to say it is a ServerError. It is more likely that some previous step goes into an undesired way. I would suggest that let some upper-level function catch the InputError and re-raise this error as a JobError. And this is what I did in the Job (in Job._download_output_file). Then, the re-raised JobError will be caught by job.determine_job_status (two levels upper), and job.determine_job_status will analyze whether we can fix this or not. If we can, we submit a new job; if we cannot, we mark this job as errored.

arc/job/ssh.py

alongd · 2020-06-07T00:23:18Z

arc/job/job.py

+        Get the additional information of stdout and stderr according to the 
+        queue software outputs.
+
+        Returns:


while revisiting this method, please add a type hint for the returned value

Yeah, I think one latter commit did. The returned item is a str which can be either empty str '' or very long str which combines all of the lines from the files downloaded (basically, files like *.out and *.err).

arc/job/job.py

arc/job/trsh.py

alongd · 2020-06-07T00:28:55Z

@xiaoruiDong, setting up a dummy server for the tests is an interesting point. If you have time, could you look into it? Perhaps we can see how paramiko tests itself.

alongd · 2020-06-30T16:34:38Z

@xiaoruiDong, what's the status of this PR?

xiaoruiDong · 2020-06-30T16:53:36Z

@xiaoruiDong, what's the status of this PR?

@alongd Hi Alon, I am sorry that I haven't had time look at the dummy server things. For the branch, I've been using this branch to run my jobs for a while. And it works okay.

alongd · 2020-06-30T17:04:47Z

OK, please rebase.
It could be nice to add these in the future if any of us ever have some free time... I know we're all tied up.

xiaoruiDong · 2020-08-20T22:31:01Z

Both Travis tests fail due to RMG issue (#2011, ReactionMechanismGenerator/RMG-Py#2011), which is not related to changes in this PR. Tests need to rerun when that issue is solved.

The server status can be confusing if a job is cancelled but still marked as 'done'. Add more comments and todo and clean the codes.

Previously, an error is chosen to be not raised, due to the fact that paramiko may not raise an IOError and simply create an empty file locally. However, the behavior is witnessed to be changed since a certain version. Better to update the codes dealing with non-existing-file consistently.

If we cannot find a downloaded file, the downloading process can not be regarded as succeeded. Add a check for that.

Previously, the errors are not effectively caught. You may have cases where downloading fails, and raise an error, and never run into the check of file existence. After we standardize `ssh.download`, we allow a better error handling.

'download' is not accurate, and 'returns' are not indicated.

Correct the error handling related to downloading. Refactoring the method to 3 parts, (1) setting files to read from, (2) downloading if necessary, and (3) read through files. Ideally, if any further changes needed, developer can focus on the first section.

The node failure has not well determined results: sometimes, output file is available remotely, sometimes not. However, node failure warning is recorded in the stdout/stderr. This commit allows to catch the node failure issue comprehensively.

alongd · 2020-09-22T20:17:20Z

@xiaoruiDong , please rebase again and tag me so I'll merge. Thanks.

…y change, SSH File Download Correction, Null Byte Read, Remote Remove Files 1. Adapter: Submit Script Format updated with using user provided username {un}, also convert memory to an 'int', and also provide the option of {server_nodes} if required 2. Adapter: Total Submit Memory adjusted to now ensure that when troubleshooting a job, it never attempts to go OVER the maximum memory of the allowed submission memory of the node/server 3. Adapter: SLURM Submit Memory - Using `#SBATCH --mem` as the parameter now as it defines the TOTAL memory of the submission 4. Adapter: SSH File Download - We do not expect to always download or upload certain files depending on the scheduler via SSH. This change allows for recognising if certain files will be uploaded or downloaded depending on the user's scheduler choice 5. Adapter: Null Bytes can appear in files, or rather more specifically, if QChem has an error, it can produce Null Bytes in the out.txt/err.txt files and thus requires a different type of reading of the file contents. This is not 100% full proof though and may need extra work 6. Adapters: In #390 branch, SSH had improvements but were not merged. I have brought forth an improvement from this branch were Remote Files are removed once they are download to the local computer

Inspired by branch #390 1. SSH: Updated decorator to use the correct connect function 2. SSH: If the user provides a server that is not in servers.keys() and server is also not None, then an error is raised to informat the user that they need to fix up the server keys 3. SSH: An error that can occur is when a submission to a scheduler includes an incorrect memory specification, then there is warning to the user that the requested memory is not supported and needs to be checked. May need to make this a ValueError instead of a logger warning 4. SSH: Slight adjustment to checking if there is an stdout after submission attempt 5. SSH: Some servers require private keys. Originally the code was incorrectly adding the private key to the SSH paramiko function. It has now been changed so that system keys are loaded and then if the user provides a private key, it is included in the connect function 6. SSH: Updated default arguments to `get_last_modified_time` 7. SSH: Changed the lowering of the cluster soft 8. SSH: Added a function to remove the directory on the remote server 9. SSH: Azure SLURM has an extra status called 'CF' which means configuring (for the node). This can take 10-15 mins or so before the node is online. We now ensure to caputre this. HOWEVER, a node can get stuck in 'CF' status. Now we check this via checking the current time the node has been active, splitting the time up correctly (different formats of time are possible), and then if it is above 15 minutes, we run the command `scontrol show node {node_id}`. If the stdout includes the phrase 'NOT_RESPONDING' then we return 'errored'

…y change, SSH File Download Correction, Null Byte Read, Remote Remove Files 1. Adapter: Submit Script Format updated with using user provided username {un}, also convert memory to an 'int', and also provide the option of {server_nodes} if required 2. Adapter: Total Submit Memory adjusted to now ensure that when troubleshooting a job, it never attempts to go OVER the maximum memory of the allowed submission memory of the node/server 3. Adapter: SLURM Submit Memory - Using `#SBATCH --mem` as the parameter now as it defines the TOTAL memory of the submission 4. Adapter: SSH File Download - We do not expect to always download or upload certain files depending on the scheduler via SSH. This change allows for recognising if certain files will be uploaded or downloaded depending on the user's scheduler choice 5. Adapter: Null Bytes can appear in files, or rather more specifically, if QChem has an error, it can produce Null Bytes in the out.txt/err.txt files and thus requires a different type of reading of the file contents. This is not 100% full proof though and may need extra work 6. Adapters: In #390 branch, SSH had improvements but were not merged. I have brought forth an improvement from this branch were Remote Files are removed once they are download to the local computer

Inspired by branch #390 1. SSH: Updated decorator to use the correct connect function 2. SSH: If the user provides a server that is not in servers.keys() and server is also not None, then an error is raised to informat the user that they need to fix up the server keys 3. SSH: An error that can occur is when a submission to a scheduler includes an incorrect memory specification, then there is warning to the user that the requested memory is not supported and needs to be checked. May need to make this a ValueError instead of a logger warning 4. SSH: Slight adjustment to checking if there is an stdout after submission attempt 5. SSH: Some servers require private keys. Originally the code was incorrectly adding the private key to the SSH paramiko function. It has now been changed so that system keys are loaded and then if the user provides a private key, it is included in the connect function 6. SSH: Updated default arguments to `get_last_modified_time` 7. SSH: Changed the lowering of the cluster soft 8. SSH: Added a function to remove the directory on the remote server 9. SSH: Azure SLURM has an extra status called 'CF' which means configuring (for the node). This can take 10-15 mins or so before the node is online. We now ensure to caputre this. HOWEVER, a node can get stuck in 'CF' status. Now we check this via checking the current time the node has been active, splitting the time up correctly (different formats of time are possible), and then if it is above 15 minutes, we run the command `scontrol show node {node_id}`. If the stdout includes the phrase 'NOT_RESPONDING' then we return 'errored'

xiaoruiDong added the Do not merge label May 6, 2020

xiaoruiDong force-pushed the ssh_phase2 branch from ec69370 to e7226ba Compare May 9, 2020 23:11

xiaoruiDong force-pushed the ssh_phase2 branch 2 times, most recently from 79778b5 to 19ae9a1 Compare May 10, 2020 04:39

xiaoruiDong requested a review from alongd May 10, 2020 18:17

xiaoruiDong self-assigned this May 10, 2020

xiaoruiDong added Module: Job Module: SSH Topic: Server Type: Risk of Error and removed Do not merge labels May 10, 2020

xiaoruiDong force-pushed the ssh_phase2 branch from 19ae9a1 to 0f8baa9 Compare May 19, 2020 16:39

alongd reviewed Jun 7, 2020

View reviewed changes

xiaoruiDong force-pushed the ssh_phase2 branch from 0f8baa9 to ff8c3a8 Compare August 20, 2020 22:15

xiaoruiDong added 9 commits September 17, 2020 17:50

Minor: consistently return errored in _check_job_server_status

2b436a3

Update server status in determine_job_status and add more comments

92a44f6

The server status can be confusing if a job is cancelled but still marked as 'done'. Add more comments and todo and clean the codes.

More comprehensive error handling regarding downloading files

43dfd8e

Add a check of local file existence after downloading a file

b2b1619

If we cannot find a downloaded file, the downloading process can not be regarded as succeeded. Add a check for that.

Correct the error handling in _download_output_file

74f03d6

Previously, the errors are not effectively caught. You may have cases where downloading fails, and raise an error, and never run into the check of file existence. After we standardize `ssh.download`, we allow a better error handling.

Correctly catch the JobError when having trouble downloading a file

606d079

Correct the docstring of _get_additional_job_info

e331c92

'download' is not accurate, and 'returns' are not indicated.

xiaoruiDong added 8 commits September 17, 2020 17:50

Minor: slightly formatting change to handling time limit issue

fa94768

Add support for unknown issues in determine_job_status

fea93f3

Minor: add a process description inside the determine_job_status

ccd8f2f

BugFix: correct trsh_job_on_server()

0f4b4a2

Correctly handle bad downloading issue in the scheduler

95cfb4c

handling the case if the job is never submitted

09e8554

Remove remote files after the job is finished

4156e8a

xiaoruiDong force-pushed the ssh_phase2 branch from ff8c3a8 to 4156e8a Compare September 17, 2020 21:51

calvinp0 mentioned this pull request Jul 30, 2023

Azure server #689

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SSH improvement phase2 #390

SSH improvement phase2 #390

xiaoruiDong commented May 6, 2020 •

edited

codecov bot commented May 6, 2020 •

edited

lgtm-com bot commented May 9, 2020

alongd left a comment

alongd Jun 7, 2020

xiaoruiDong Aug 20, 2020

alongd Jun 7, 2020

xiaoruiDong Aug 20, 2020 •

edited

alongd Jun 7, 2020

xiaoruiDong Aug 20, 2020 •

edited

alongd commented Jun 7, 2020

alongd commented Jun 30, 2020

xiaoruiDong commented Jun 30, 2020

alongd commented Jun 30, 2020

xiaoruiDong commented Aug 20, 2020

alongd commented Sep 22, 2020

SSH improvement phase2 #390

Are you sure you want to change the base?

SSH improvement phase2 #390

Conversation

xiaoruiDong commented May 6, 2020 • edited

codecov bot commented May 6, 2020 • edited

Codecov Report

lgtm-com bot commented May 9, 2020

alongd left a comment

Choose a reason for hiding this comment

alongd Jun 7, 2020

Choose a reason for hiding this comment

xiaoruiDong Aug 20, 2020

Choose a reason for hiding this comment

alongd Jun 7, 2020

Choose a reason for hiding this comment

xiaoruiDong Aug 20, 2020 • edited

Choose a reason for hiding this comment

alongd Jun 7, 2020

Choose a reason for hiding this comment

xiaoruiDong Aug 20, 2020 • edited

Choose a reason for hiding this comment

alongd commented Jun 7, 2020

alongd commented Jun 30, 2020

xiaoruiDong commented Jun 30, 2020

alongd commented Jun 30, 2020

xiaoruiDong commented Aug 20, 2020

alongd commented Sep 22, 2020

xiaoruiDong commented May 6, 2020 •

edited

codecov bot commented May 6, 2020 •

edited

xiaoruiDong Aug 20, 2020 •

edited

xiaoruiDong Aug 20, 2020 •

edited