Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSH improvement phase2 #390

Open
wants to merge 17 commits into
base: master
Choose a base branch
from
Open

SSH improvement phase2 #390

wants to merge 17 commits into from

Conversation

xiaoruiDong
Copy link
Contributor

@xiaoruiDong xiaoruiDong commented May 6, 2020

This is the second phase of SSH and server communication improvement. A major impetus is that I found the behavior of paramiko is slightly different now.

In a previous version, when I tried to download a non-existing file with sftp.get(), there will be no error raised but an empty file at the localpath I assigned. For the version I used (I recently updated my arc environment, it is now the latest stable2.7), the behavior of paramiko is changed. It will raise an IOError (no such file). We tried to catch IOError in ARC, but due to some of the intermediate step is problematic (raise A another type of Error when catching this error, or have no catches at all), but we can only deal with part of the issues. Besides, if the ARC dead weirdly (say ARC has created an input file but hasn't submitted it to the server when my computer crashed), the current ARC cannot restart the job.

Anyway, this is a patch to the previous SSH Improvement to make the system more robust.

However, this PR leaves the case when the quota is reached. (Proposal: We can delete finished jobs done within the scope of the project from the server). It is aimed to solve in the future PR.

For reviewer: Wait until the first SSH improvement finished. But if you are interested, you can start reviewing from Minor: consistently return `errored` in _check_job_server_status

Another proposal: is there a way to setup a toy / virtual server so that we can test SSH and server-communication-related actions.

@codecov
Copy link

codecov bot commented May 6, 2020

Codecov Report

Merging #390 into master will decrease coverage by 0.13%.
The diff coverage is 2.97%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #390      +/-   ##
==========================================
- Coverage   53.99%   53.86%   -0.14%     
==========================================
  Files          36       36              
  Lines       12017    12051      +34     
  Branches     3708     3709       +1     
==========================================
+ Hits         6489     6491       +2     
- Misses       4554     4586      +32     
  Partials      974      974              
Impacted Files Coverage Δ
arc/job/trsh.py 42.17% <0.00%> (ø)
arc/scheduler.py 19.94% <0.00%> (-0.02%) ⬇️
arc/job/job.py 21.13% <2.46%> (-0.34%) ⬇️
arc/job/ssh.py 22.26% <6.66%> (-0.81%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b22df49...4156e8a. Read the comment docs.

@lgtm-com
Copy link

lgtm-com bot commented May 9, 2020

This pull request fixes 4 alerts when merging e7226ba into 8956356 - view on LGTM.com

fixed alerts:

  • 3 for Except block handles 'BaseException'
  • 1 for Accepting unknown SSH host keys when using Paramiko

Copy link
Member

@alongd alongd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! I added some minor comments

else:
return check_job_status(self.job_id)
server_status = check_job_status(self.job_id)
# `server_status` should only be 'initializing' / 'running' / 'errored' / 'done'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we need to know the node for troubleshooting by switching nodes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the current workflow, changing node actually happens in the function Job.troubleshoot_server(). (line 1535 in arc/job/job.py). A (possible) available is picked by trsh_job_on_server in this method. We don't need to know the current node, since it is saved in self.server_nodes, we don't need to know what nodes are available in prior, since trsh_job_on_server will find one. The server_status here is only used to assign a value to self.job_status[0] and does not play a role in switching nodes. However, if I miss something, please let me know. Thanks!

logger.debug(
f'{remote_file_path} does not exist on {self.server}.')
raise InputError(f'{remote_file_path} does not exist on {self.server}.')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should raise ServerError
Are these errors being captured, or will they crash ARC?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good question. The answer depends on how we regard the context of this error. If we regard this function as a single API, and we are using this API to download stuff. Then if remote_file_path does not exist on the server, it simply means we input invalid arguments, since it is not due to a malfunction of this method/class/connection. However, in an ARC job, this function is run since we think there should be a path there, and we want to download it, but the path actually does not exist. Still, I would say this is an error that is not due to a malfunction of this method/class/connection, so I think it is inappropriate to say it is a ServerError. It is more likely that some previous step goes into an undesired way. I would suggest that let some upper-level function catch the InputError and re-raise this error as a JobError. And this is what I did in the Job (in Job._download_output_file). Then, the re-raised JobError will be caught by job.determine_job_status (two levels upper), and job.determine_job_status will analyze whether we can fix this or not. If we can, we submit a new job; if we cannot, we mark this job as errored.

arc/job/ssh.py Outdated Show resolved Hide resolved
arc/job/ssh.py Outdated Show resolved Hide resolved
Get the additional information of stdout and stderr according to the
queue software outputs.

Returns:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while revisiting this method, please add a type hint for the returned value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think one latter commit did. The returned item is a str which can be either empty str '' or very long str which combines all of the lines from the files downloaded (basically, files like *.out and *.err).

arc/job/job.py Outdated Show resolved Hide resolved
arc/job/trsh.py Outdated Show resolved Hide resolved
@alongd
Copy link
Member

alongd commented Jun 7, 2020

@xiaoruiDong, setting up a dummy server for the tests is an interesting point. If you have time, could you look into it? Perhaps we can see how paramiko tests itself.

@alongd
Copy link
Member

alongd commented Jun 30, 2020

@xiaoruiDong, what's the status of this PR?

@xiaoruiDong
Copy link
Contributor Author

@xiaoruiDong, what's the status of this PR?

@alongd Hi Alon, I am sorry that I haven't had time look at the dummy server things. For the branch, I've been using this branch to run my jobs for a while. And it works okay.

@alongd
Copy link
Member

alongd commented Jun 30, 2020

OK, please rebase.
It could be nice to add these in the future if any of us ever have some free time... I know we're all tied up.

@xiaoruiDong
Copy link
Contributor Author

Both Travis tests fail due to RMG issue (#2011, ReactionMechanismGenerator/RMG-Py#2011), which is not related to changes in this PR. Tests need to rerun when that issue is solved.

The server status can be confusing if a job is cancelled but still marked as 'done'. Add more comments and todo and clean the codes.
Previously, an error is chosen to be not raised, due to the fact that paramiko may not raise an IOError and simply create an empty file locally. However, the behavior is witnessed to be changed since a certain version. Better to update the codes dealing with non-existing-file consistently.
If we cannot find a downloaded file, the downloading process can not be regarded as succeeded. Add a check for that.
Previously, the errors are not effectively caught. You may have cases where downloading fails, and raise an error, and never run into the check of file existence. After we standardize `ssh.download`, we allow a better error handling.
'download' is not accurate, and 'returns' are not indicated.
Correct the error handling related to downloading. Refactoring the method to 3 parts, (1) setting files to read from, (2) downloading if necessary, and (3) read through files. Ideally, if any further changes needed, developer can focus on the first section.
@alongd
Copy link
Member

alongd commented Sep 22, 2020

@xiaoruiDong , please rebase again and tag me so I'll merge. Thanks.

calvinp0 added a commit that referenced this pull request Jul 30, 2023
…y change, SSH File Download Correction, Null Byte Read, Remote Remove Files

1.  Adapter: Submit Script Format updated with using user provided username {un}, also convert memory to an 'int', and also provide the option of {server_nodes} if required
2.  Adapter: Total Submit Memory adjusted to now ensure that when troubleshooting a job, it never attempts to go OVER the maximum memory of the allowed submission memory of the node/server
3.  Adapter: SLURM Submit Memory - Using `#SBATCH --mem` as the parameter now as it defines the TOTAL memory of the submission
4.  Adapter: SSH File Download - We do not expect to always download or upload certain files depending on the scheduler via SSH. This change allows for recognising if certain files will be uploaded or downloaded depending on the user's scheduler choice
5.  Adapter: Null Bytes can appear in files, or rather more specifically, if QChem has an error, it can produce Null Bytes in the out.txt/err.txt files and thus requires a different type of reading of the file contents. This is not 100% full proof though and may need extra work
6.  Adapters: In #390 branch, SSH had improvements but were not merged. I have brought forth an improvement from this branch were Remote Files are removed once they are download to the local computer
calvinp0 added a commit that referenced this pull request Jul 30, 2023
Inspired by branch #390

1. SSH: Updated decorator to use the correct connect function
2. SSH: If the user provides a server that is not in servers.keys() and server is also not None, then an error is raised to informat the user that they need to fix up the server keys
3. SSH: An error that can occur is when a submission to a scheduler includes an incorrect memory specification, then there is warning to the user that the requested memory is not supported and needs to be checked. May need to make this a ValueError instead of a logger warning
4. SSH: Slight adjustment to checking if there is an stdout after submission attempt
5. SSH: Some servers require private keys. Originally the code was incorrectly adding the private key to the SSH paramiko function. It has now been changed so that system keys are loaded and then if the user provides a private key, it is included in the connect function
6. SSH: Updated default arguments to `get_last_modified_time`
7. SSH: Changed the lowering of the cluster soft
8. SSH: Added a function to remove the directory on the remote server
9. SSH: Azure SLURM has an extra status called 'CF' which means configuring (for the node). This can take 10-15 mins or so before the node is online. We now ensure to caputre this. HOWEVER, a node can get stuck in 'CF' status. Now we check this via checking the current time the node has been active, splitting the time up correctly (different formats of time are possible), and then if it is above 15 minutes, we run the command `scontrol show node {node_id}`. If the stdout includes the phrase 'NOT_RESPONDING' then we return 'errored'
@calvinp0 calvinp0 mentioned this pull request Jul 30, 2023
calvinp0 added a commit that referenced this pull request Mar 18, 2024
…y change, SSH File Download Correction, Null Byte Read, Remote Remove Files

1.  Adapter: Submit Script Format updated with using user provided username {un}, also convert memory to an 'int', and also provide the option of {server_nodes} if required
2.  Adapter: Total Submit Memory adjusted to now ensure that when troubleshooting a job, it never attempts to go OVER the maximum memory of the allowed submission memory of the node/server
3.  Adapter: SLURM Submit Memory - Using `#SBATCH --mem` as the parameter now as it defines the TOTAL memory of the submission
4.  Adapter: SSH File Download - We do not expect to always download or upload certain files depending on the scheduler via SSH. This change allows for recognising if certain files will be uploaded or downloaded depending on the user's scheduler choice
5.  Adapter: Null Bytes can appear in files, or rather more specifically, if QChem has an error, it can produce Null Bytes in the out.txt/err.txt files and thus requires a different type of reading of the file contents. This is not 100% full proof though and may need extra work
6.  Adapters: In #390 branch, SSH had improvements but were not merged. I have brought forth an improvement from this branch were Remote Files are removed once they are download to the local computer
calvinp0 added a commit that referenced this pull request Mar 18, 2024
Inspired by branch #390

1. SSH: Updated decorator to use the correct connect function
2. SSH: If the user provides a server that is not in servers.keys() and server is also not None, then an error is raised to informat the user that they need to fix up the server keys
3. SSH: An error that can occur is when a submission to a scheduler includes an incorrect memory specification, then there is warning to the user that the requested memory is not supported and needs to be checked. May need to make this a ValueError instead of a logger warning
4. SSH: Slight adjustment to checking if there is an stdout after submission attempt
5. SSH: Some servers require private keys. Originally the code was incorrectly adding the private key to the SSH paramiko function. It has now been changed so that system keys are loaded and then if the user provides a private key, it is included in the connect function
6. SSH: Updated default arguments to `get_last_modified_time`
7. SSH: Changed the lowering of the cluster soft
8. SSH: Added a function to remove the directory on the remote server
9. SSH: Azure SLURM has an extra status called 'CF' which means configuring (for the node). This can take 10-15 mins or so before the node is online. We now ensure to caputre this. HOWEVER, a node can get stuck in 'CF' status. Now we check this via checking the current time the node has been active, splitting the time up correctly (different formats of time are possible), and then if it is above 15 minutes, we run the command `scontrol show node {node_id}`. If the stdout includes the phrase 'NOT_RESPONDING' then we return 'errored'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants