New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: .h5 file of remote job is transferred twice, but it is deleted after the first time #1526
Comments
@hujay2019 Thank you for testing pyiron. We might be a little slow to respond during the Christmas break. To understand a bit better what is causing your issue, can you post your |
@jan-janssen Thanks for your response.
Of course, |
In source file def wait_for_job(job, interval_in_s=5, max_iterations=100):
"""
Sleep until the job is finished but maximum interval_in_s * max_iterations seconds.
Args:
job (pyiron_base.job.utils.GenericJob): Job to wait for
interval_in_s (int): interval when the job status is queried from the database - default 5 sec.
max_iterations (int): maximum number of iterations - default 100
Raises:
ValueError: max_iterations reached, job still running
"""
if job.status.string not in job_status_finished_lst:
if (
state.queue_adapter is not None
and state.queue_adapter.remote_flag
and job.server.queue is not None
):
finished = False
for _ in range(max_iterations):
if not queue_check_job_is_waiting_or_running(item=job):
state.queue_adapter.transfer_file_to_remote(
file=job.project_hdf5.file_name,
transfer_back=True,
)
status_hdf5 = job.project_hdf5["status"]
job.status.string = status_hdf5
else:
status_hdf5 = job.status.string
if status_hdf5 in job_status_finished_lst:
job.transfer_from_remote()
finished = True
break
time.sleep(interval_in_s)
if not finished:
raise ValueError(
"Maximum iterations reached, but the job was not finished."
)
else:
finished = False
for _ in range(max_iterations):
if state.database.database_is_disabled:
job.project.db.update()
job.refresh_job_status()
if job.status.string in job_status_finished_lst:
finished = True
break
elif isinstance(job.server.future, Future):
job.server.future.result(timeout=interval_in_s)
finished = job.server.future.done()
break
else:
time.sleep(interval_in_s)
if not finished:
raise ValueError(
"Maximum iterations reached, but the job was not finished."
) In this loop: for _ in range(max_iterations):
if not queue_check_job_is_waiting_or_running(item=job):
state.queue_adapter.transfer_file_to_remote(
file=job.project_hdf5.file_name,
transfer_back=True,
)
status_hdf5 = job.project_hdf5["status"]
job.status.string = status_hdf5
else:
status_hdf5 = job.status.string
if status_hdf5 in job_status_finished_lst:
job.transfer_from_remote()
finished = True
break
time.sleep(interval_in_s) When the status of the remote job changes to I think that's probably where the mistake was made. |
If I comment these lines of code in # state.queue_adapter.transfer_file_to_remote(
# file=self.project_hdf5.file_name,
# transfer_back=True,
# ) The error won't happen. |
@hujay2019 Thank you for your feedback - I am happy you got it working. I am still confused why this happens or when the behaviour changed. Basically, when the local job has a queuing system ID, then there should never be a new transfer of the local job to the remote location as it is happening here. But I need to take a deeper look at this and that might take a bit more time. |
As the |
I made sure that the remote task is executed correctly and the status is In try:
sftp_client.get(file_dst, file_src)
except FileNotFoundError:
pass If the remote h5 was previously deleted, I changed the code to: try:
sftp_client.stat(file_dst)
sftp_client.get(file_dst, file_src)
except FileNotFoundError:
pass It works. The In summary, the solution is skiping the files that not exist in the remote. I think there should be solutions that don't add extra time overhead, but I haven't been able to do that yet. |
It's a behavior of I tested import paramiko.client
client = paramiko.SSHClient()
client.connect(*******)
sftp = client.open_sftp()
try:
sftp.get("/home/**/1.txt", "1.txt")
except FileNotFoundError:
print("File Not Found")
So |
That sounds like a good idea, do you want to open a pull request to prevent this issue in future? |
This also sounds like a bug in paramiko to me. Consider reporting just this snippet to them as well. |
@jan-janssen I opened a pull request. pyiron/pysqa#248 |
@pmrv Yes. I'll consider reporting to |
Summary
I am new to Pyiron and start with the official installation guide https://pyiron.readthedocs.io/en/latest/source/installation.html.
I ran into some trouble in the section "Installing pyiron so you can submit to remote HPCs from a local machine".
These steps worked correctly:
But there was an error here:
The error is:
OSError: Unable to synchronously open file (file signature not found)
I noticed that the local .h5 was empty. Then I found out that this is caused by the remote .h5 being deleted before the last transfer from remote to local, because for the .h5 file,
transfer_file
in pysqa/ext/remote.py is executed twice. The first time, the remote .h5 file is deleted, and the second time, the .h5 file becomes empty.Finally, I set the
ssh_delete_file_on_remote
parameter in thequeue.yaml
file toFalse
. This fixed the error.I'm not sure if I made mistakes at some steps or if this is indeed a bug.
pyiron Version and Platform
pyiron-0.5.1
OS: Manjaro Linux x86_64
Kernel: 6.1.68-1-MANJARO
The text was updated successfully, but these errors were encountered: