Bug: .h5 file of remote job is transferred twice, but it is deleted after the first time #1526

hujay2019 · 2023-12-23T11:05:42Z

Summary

I am new to Pyiron and start with the official installation guide https://pyiron.readthedocs.io/en/latest/source/installation.html.

I ran into some trouble in the section "Installing pyiron so you can submit to remote HPCs from a local machine".

These steps worked correctly:

from pyiron_atomistics import Project
import os

pr = Project("test_lammps")
job = pr.create_job(job_type=pr.job_type.Lammps, job_name='Al_T800K_remote')

basis = pr.create.structure.bulk('Al', cubic=True)
supercell_3x3x3 = basis.repeat([3, 3, 3])
job.structure = supercell_3x3x3

pot = job.list_potentials()[0]
print ('Selected potential: ', pot)
job.potential = pot

job.calc_md(temperature=800, pressure=0, n_ionic_steps=10000)

job.server.queue = "work"
job.server.cores = 2
job.server.memory_limit = 2

job.run(run_mode="queue", delete_existing_job=True)

But there was an error here:

pr = Project("test_lammps")
job_name = "Al_T800K_remote"
pr.wait_for_job(pr.load(job_specifier=job_name))

The error is:
OSError: Unable to synchronously open file (file signature not found)

I noticed that the local .h5 was empty. Then I found out that this is caused by the remote .h5 being deleted before the last transfer from remote to local, because for the .h5 file, transfer_file in pysqa/ext/remote.py is executed twice. The first time, the remote .h5 file is deleted, and the second time, the .h5 file becomes empty.

Finally, I set the ssh_delete_file_on_remote parameter in the queue.yaml file to False. This fixed the error.

I'm not sure if I made mistakes at some steps or if this is indeed a bug.

pyiron Version and Platform

pyiron-0.5.1
OS: Manjaro Linux x86_64
Kernel: 6.1.68-1-MANJARO

The text was updated successfully, but these errors were encountered:

jan-janssen · 2023-12-23T11:11:53Z

@hujay2019 Thank you for testing pyiron. We might be a little slow to respond during the Christmas break.

To understand a bit better what is causing your issue, can you post your queue.yaml file for both your local workstation as well as the HPC? To understand a bit better what is causing this issue, I would suggest to login to the remote cluster using SSH, then try to submit a calculation on the login node and once this is working try to do the same from the remote workstation. This helps us to identify if the file is corrupted during transfer, or if something goes wrong before.

hujay2019 · 2023-12-23T13:08:24Z

@jan-janssen Thanks for your response.
queue.yaml in local workstation:

queue_type: REMOTE
queue_primary: slurm
ssh_host: *******
ssh_port: ****
ssh_username: hujie
known_hosts: /home/hu/.ssh/known_hosts
ssh_key: /home/hu/.ssh/***
ssh_remote_config_dir: /home/hujie/pyiron/resources/queues/
ssh_remote_path: /home/hujie/pyiron/pro/
ssh_local_path: /home/hu/pyiron/projects/
ssh_continous_connection: True
#ssh_delete_file_on_remote: False
queues:
    slurm: {cores_max: 64, cores_min: 1, run_time_max: 1440}

queue.ymal in HPC:

queue_type: SLURM
queue_primary: slurm
queues:
    slurm: {cores_max: 64, cores_min: 1, run_time_max: 1440, script: slurm.sh}

Of course, job.server.squeue = "slurm" and job.server.cores = 64.
Submitting calculations on the login node works fine.

hujay2019 · 2023-12-23T14:45:13Z

In source file pyiron_base/jobs/job/extension/server/queuestatus.py

def wait_for_job(job, interval_in_s=5, max_iterations=100):
    """
    Sleep until the job is finished but maximum interval_in_s * max_iterations seconds.

    Args:
        job (pyiron_base.job.utils.GenericJob): Job to wait for
        interval_in_s (int): interval when the job status is queried from the database - default 5 sec.
        max_iterations (int): maximum number of iterations - default 100

    Raises:
        ValueError: max_iterations reached, job still running
    """
    if job.status.string not in job_status_finished_lst:
        if (
            state.queue_adapter is not None
            and state.queue_adapter.remote_flag
            and job.server.queue is not None
        ):
            finished = False
            for _ in range(max_iterations):
                if not queue_check_job_is_waiting_or_running(item=job):
                    state.queue_adapter.transfer_file_to_remote(
                        file=job.project_hdf5.file_name,
                        transfer_back=True,
                    )
                    status_hdf5 = job.project_hdf5["status"]
                    job.status.string = status_hdf5
                else:
                    status_hdf5 = job.status.string
                if status_hdf5 in job_status_finished_lst:
                    job.transfer_from_remote()
                    finished = True
                    break
                time.sleep(interval_in_s)
            if not finished:
                raise ValueError(
                    "Maximum iterations reached, but the job was not finished."
                )
        else:
            finished = False
            for _ in range(max_iterations):
                if state.database.database_is_disabled:
                    job.project.db.update()
                job.refresh_job_status()
                if job.status.string in job_status_finished_lst:
                    finished = True
                    break
                elif isinstance(job.server.future, Future):
                    job.server.future.result(timeout=interval_in_s)
                    finished = job.server.future.done()
                    break
                else:
                    time.sleep(interval_in_s)
            if not finished:
                raise ValueError(
                    "Maximum iterations reached, but the job was not finished."
                )

In this loop:

            for _ in range(max_iterations):
                if not queue_check_job_is_waiting_or_running(item=job):
                    state.queue_adapter.transfer_file_to_remote(
                        file=job.project_hdf5.file_name,
                        transfer_back=True,
                    )
                    status_hdf5 = job.project_hdf5["status"]
                    job.status.string = status_hdf5
                else:
                    status_hdf5 = job.status.string
                if status_hdf5 in job_status_finished_lst:
                    job.transfer_from_remote()
                    finished = True
                    break
                time.sleep(interval_in_s)

When the status of the remote job changes to finished, state.queue_adapter.transfer_file_to_remote() is called in the first if statement, and the remote .h5 file is deleted. Then job.transfer_from_remote() is called in the second if statement, but the remote .h5 file has already been deleted.

I think that's probably where the mistake was made.

hujay2019 · 2023-12-23T14:56:48Z

If I comment these lines of code in def transfer_from_remote(self): of file pyiron_base/jobs/job/generic.py

        # state.queue_adapter.transfer_file_to_remote(
        #     file=self.project_hdf5.file_name,
        #     transfer_back=True,
        # )

The error won't happen.

jan-janssen · 2023-12-23T23:03:58Z

@hujay2019 Thank you for your feedback - I am happy you got it working. I am still confused why this happens or when the behaviour changed. Basically, when the local job has a queuing system ID, then there should never be a new transfer of the local job to the remote location as it is happening here. But I need to take a deeper look at this and that might take a bit more time.

jan-janssen · 2023-12-23T23:14:52Z

As the transfer_back parameter is set to True it should already copy the correct job, so I am surprised that the job seems to be empty. Can you try to disable the deletion on the remote location with you .pyiron settings file and check the content of the HDF5 file on the remote location? I just want to make sure the file is correctly executed on the remote location and gets corrupted during the transfer.

hujay2019 · 2023-12-24T01:53:12Z

I made sure that the remote task is executed correctly and the status is finished and can also be parsed by h5py. I think the remote h5 was deleted before the second transfer of the h5 file, resulting in the correctly received h5 file being overwritten with an empty content.

In _transfer_files method of pysqa/ext/remote.py:

                try:
                    sftp_client.get(file_dst, file_src)
                except FileNotFoundError:
                    pass

If the remote h5 was previously deleted, sftp_client.get(file_dst, file_src) will throw a FileNotFoundError. However, before the error is raised, the local h5 file is opened with mode='wb'. When the `FileNotFoundError' is raised, the local h5 is overwritten to empty.

I changed the code to:

                try:
                    sftp_client.stat(file_dst)
                    sftp_client.get(file_dst, file_src)
                except FileNotFoundError:
                    pass

It works. The FileNotFoundError is thrown in sftp_client.stat(file_dst) before sftp_client.get(file_dst, file_src). The local h5 file isn't opened at all. This solution adds additional time overhead by adding an ssh connection to check if the remote file exists.

In summary, the solution is skiping the files that not exist in the remote. I think there should be solutions that don't add extra time overhead, but I haven't been able to do that yet.

hujay2019 · 2023-12-24T03:25:44Z

It's a behavior of paramiko .

I tested paramiko:

import paramiko.client

client = paramiko.SSHClient()
client.connect(*******)
sftp = client.open_sftp()

try:
    sftp.get("/home/**/1.txt", "1.txt")
except FileNotFoundError:
    print("File Not Found")

Remote non-empty file 1.txt exists and local non-empty 1.txt exists. Local 1.txt is overwritten by remote 1.txt. Correct.
Delete remote 1.txt, and local non-empty 1.txt exists. Catch FileNotFoundError, and local 1.txt becomes empty. Unexpected behavior.

So pyiron needs some work to avoid empty overwriting local files.

jan-janssen · 2023-12-24T06:26:32Z

I changed the code to:

                try:
                    sftp_client.stat(file_dst)
                    sftp_client.get(file_dst, file_src)
                except FileNotFoundError:
                    pass

That sounds like a good idea, do you want to open a pull request to prevent this issue in future?

pmrv · 2023-12-24T07:29:44Z

It's a behavior of paramiko .

I tested paramiko:

import paramiko.client

client = paramiko.SSHClient()
client.connect(*******)
sftp = client.open_sftp()

try:
    sftp.get("/home/**/1.txt", "1.txt")
except FileNotFoundError:
    print("File Not Found")

1. Remote non-empty file `1.txt` exists and local non-empty `1.txt` exists. Local `1.txt` is overwritten by remote `1.txt`. Correct.

2. Delete remote `1.txt`, and local non-empty `1.txt` exists. Catch `FileNotFoundError`, and local `1.txt` becomes empty. Unexpected behavior.

So pyiron needs some work to avoid empty overwriting local files.

This also sounds like a bug in paramiko to me. Consider reporting just this snippet to them as well.

hujay2019 · 2023-12-24T07:54:23Z

@jan-janssen I opened a pull request. pyiron/pysqa#248

hujay2019 · 2023-12-24T07:55:43Z

@pmrv Yes. I'll consider reporting to paramiko.

hujay2019 added the bug Category: Something does not work label Dec 23, 2023

pmrv mentioned this issue Dec 24, 2023

Check the existence of the remote file before transferring the file pyiron/pysqa#248

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: .h5 file of remote job is transferred twice, but it is deleted after the first time #1526

Bug: .h5 file of remote job is transferred twice, but it is deleted after the first time #1526

hujay2019 commented Dec 23, 2023

jan-janssen commented Dec 23, 2023

hujay2019 commented Dec 23, 2023

hujay2019 commented Dec 23, 2023

hujay2019 commented Dec 23, 2023

jan-janssen commented Dec 23, 2023

jan-janssen commented Dec 23, 2023

hujay2019 commented Dec 24, 2023 •

edited

hujay2019 commented Dec 24, 2023

jan-janssen commented Dec 24, 2023

pmrv commented Dec 24, 2023

hujay2019 commented Dec 24, 2023

hujay2019 commented Dec 24, 2023

Bug: .h5 file of remote job is transferred twice, but it is deleted after the first time #1526

Bug: .h5 file of remote job is transferred twice, but it is deleted after the first time #1526

Comments

hujay2019 commented Dec 23, 2023

jan-janssen commented Dec 23, 2023

hujay2019 commented Dec 23, 2023

hujay2019 commented Dec 23, 2023

hujay2019 commented Dec 23, 2023

jan-janssen commented Dec 23, 2023

jan-janssen commented Dec 23, 2023

hujay2019 commented Dec 24, 2023 • edited

hujay2019 commented Dec 24, 2023

jan-janssen commented Dec 24, 2023

pmrv commented Dec 24, 2023

hujay2019 commented Dec 24, 2023

hujay2019 commented Dec 24, 2023

hujay2019 commented Dec 24, 2023 •

edited