Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: .h5 file of remote job is transferred twice, but it is deleted after the first time #1526

Open
hujay2019 opened this issue Dec 23, 2023 · 12 comments
Labels
bug Category: Something does not work

Comments

@hujay2019
Copy link

Summary

I am new to Pyiron and start with the official installation guide https://pyiron.readthedocs.io/en/latest/source/installation.html.

I ran into some trouble in the section "Installing pyiron so you can submit to remote HPCs from a local machine".

These steps worked correctly:

from pyiron_atomistics import Project
import os

pr = Project("test_lammps")
job = pr.create_job(job_type=pr.job_type.Lammps, job_name='Al_T800K_remote')

basis = pr.create.structure.bulk('Al', cubic=True)
supercell_3x3x3 = basis.repeat([3, 3, 3])
job.structure = supercell_3x3x3

pot = job.list_potentials()[0]
print ('Selected potential: ', pot)
job.potential = pot

job.calc_md(temperature=800, pressure=0, n_ionic_steps=10000)

job.server.queue = "work"
job.server.cores = 2
job.server.memory_limit = 2

job.run(run_mode="queue", delete_existing_job=True)

But there was an error here:

pr = Project("test_lammps")
job_name = "Al_T800K_remote"
pr.wait_for_job(pr.load(job_specifier=job_name))

The error is:
OSError: Unable to synchronously open file (file signature not found)

I noticed that the local .h5 was empty. Then I found out that this is caused by the remote .h5 being deleted before the last transfer from remote to local, because for the .h5 file, transfer_file in pysqa/ext/remote.py is executed twice. The first time, the remote .h5 file is deleted, and the second time, the .h5 file becomes empty.

Finally, I set the ssh_delete_file_on_remote parameter in the queue.yaml file to False. This fixed the error.

I'm not sure if I made mistakes at some steps or if this is indeed a bug.

pyiron Version and Platform

pyiron-0.5.1
OS: Manjaro Linux x86_64
Kernel: 6.1.68-1-MANJARO

@hujay2019 hujay2019 added the bug Category: Something does not work label Dec 23, 2023
@jan-janssen
Copy link
Member

@hujay2019 Thank you for testing pyiron. We might be a little slow to respond during the Christmas break.

To understand a bit better what is causing your issue, can you post your queue.yaml file for both your local workstation as well as the HPC? To understand a bit better what is causing this issue, I would suggest to login to the remote cluster using SSH, then try to submit a calculation on the login node and once this is working try to do the same from the remote workstation. This helps us to identify if the file is corrupted during transfer, or if something goes wrong before.

@hujay2019
Copy link
Author

@jan-janssen Thanks for your response.
queue.yaml in local workstation:

queue_type: REMOTE
queue_primary: slurm
ssh_host: *******
ssh_port: ****
ssh_username: hujie
known_hosts: /home/hu/.ssh/known_hosts
ssh_key: /home/hu/.ssh/***
ssh_remote_config_dir: /home/hujie/pyiron/resources/queues/
ssh_remote_path: /home/hujie/pyiron/pro/
ssh_local_path: /home/hu/pyiron/projects/
ssh_continous_connection: True
#ssh_delete_file_on_remote: False
queues:
    slurm: {cores_max: 64, cores_min: 1, run_time_max: 1440}

queue.ymal in HPC:

queue_type: SLURM
queue_primary: slurm
queues:
    slurm: {cores_max: 64, cores_min: 1, run_time_max: 1440, script: slurm.sh}

Of course, job.server.squeue = "slurm" and job.server.cores = 64.
Submitting calculations on the login node works fine.

@hujay2019
Copy link
Author

In source file pyiron_base/jobs/job/extension/server/queuestatus.py

def wait_for_job(job, interval_in_s=5, max_iterations=100):
    """
    Sleep until the job is finished but maximum interval_in_s * max_iterations seconds.

    Args:
        job (pyiron_base.job.utils.GenericJob): Job to wait for
        interval_in_s (int): interval when the job status is queried from the database - default 5 sec.
        max_iterations (int): maximum number of iterations - default 100

    Raises:
        ValueError: max_iterations reached, job still running
    """
    if job.status.string not in job_status_finished_lst:
        if (
            state.queue_adapter is not None
            and state.queue_adapter.remote_flag
            and job.server.queue is not None
        ):
            finished = False
            for _ in range(max_iterations):
                if not queue_check_job_is_waiting_or_running(item=job):
                    state.queue_adapter.transfer_file_to_remote(
                        file=job.project_hdf5.file_name,
                        transfer_back=True,
                    )
                    status_hdf5 = job.project_hdf5["status"]
                    job.status.string = status_hdf5
                else:
                    status_hdf5 = job.status.string
                if status_hdf5 in job_status_finished_lst:
                    job.transfer_from_remote()
                    finished = True
                    break
                time.sleep(interval_in_s)
            if not finished:
                raise ValueError(
                    "Maximum iterations reached, but the job was not finished."
                )
        else:
            finished = False
            for _ in range(max_iterations):
                if state.database.database_is_disabled:
                    job.project.db.update()
                job.refresh_job_status()
                if job.status.string in job_status_finished_lst:
                    finished = True
                    break
                elif isinstance(job.server.future, Future):
                    job.server.future.result(timeout=interval_in_s)
                    finished = job.server.future.done()
                    break
                else:
                    time.sleep(interval_in_s)
            if not finished:
                raise ValueError(
                    "Maximum iterations reached, but the job was not finished."
                )

In this loop:

            for _ in range(max_iterations):
                if not queue_check_job_is_waiting_or_running(item=job):
                    state.queue_adapter.transfer_file_to_remote(
                        file=job.project_hdf5.file_name,
                        transfer_back=True,
                    )
                    status_hdf5 = job.project_hdf5["status"]
                    job.status.string = status_hdf5
                else:
                    status_hdf5 = job.status.string
                if status_hdf5 in job_status_finished_lst:
                    job.transfer_from_remote()
                    finished = True
                    break
                time.sleep(interval_in_s)

When the status of the remote job changes to finished, state.queue_adapter.transfer_file_to_remote() is called in the first if statement, and the remote .h5 file is deleted. Then job.transfer_from_remote() is called in the second if statement, but the remote .h5 file has already been deleted.

I think that's probably where the mistake was made.

@hujay2019
Copy link
Author

If I comment these lines of code in def transfer_from_remote(self): of file pyiron_base/jobs/job/generic.py

        # state.queue_adapter.transfer_file_to_remote(
        #     file=self.project_hdf5.file_name,
        #     transfer_back=True,
        # )

The error won't happen.

@jan-janssen
Copy link
Member

@hujay2019 Thank you for your feedback - I am happy you got it working. I am still confused why this happens or when the behaviour changed. Basically, when the local job has a queuing system ID, then there should never be a new transfer of the local job to the remote location as it is happening here. But I need to take a deeper look at this and that might take a bit more time.

@jan-janssen
Copy link
Member

As the transfer_back parameter is set to True it should already copy the correct job, so I am surprised that the job seems to be empty. Can you try to disable the deletion on the remote location with you .pyiron settings file and check the content of the HDF5 file on the remote location? I just want to make sure the file is correctly executed on the remote location and gets corrupted during the transfer.

@hujay2019
Copy link
Author

hujay2019 commented Dec 24, 2023

I made sure that the remote task is executed correctly and the status is finished and can also be parsed by h5py. I think the remote h5 was deleted before the second transfer of the h5 file, resulting in the correctly received h5 file being overwritten with an empty content.

In _transfer_files method of pysqa/ext/remote.py:

                try:
                    sftp_client.get(file_dst, file_src)
                except FileNotFoundError:
                    pass

If the remote h5 was previously deleted, sftp_client.get(file_dst, file_src) will throw a FileNotFoundError. However, before the error is raised, the local h5 file is opened with mode='wb'. When the `FileNotFoundError' is raised, the local h5 is overwritten to empty.

I changed the code to:

                try:
                    sftp_client.stat(file_dst)
                    sftp_client.get(file_dst, file_src)
                except FileNotFoundError:
                    pass

It works. The FileNotFoundError is thrown in sftp_client.stat(file_dst) before sftp_client.get(file_dst, file_src). The local h5 file isn't opened at all. This solution adds additional time overhead by adding an ssh connection to check if the remote file exists.

In summary, the solution is skiping the files that not exist in the remote. I think there should be solutions that don't add extra time overhead, but I haven't been able to do that yet.

@hujay2019
Copy link
Author

It's a behavior of paramiko .

I tested paramiko:

import paramiko.client

client = paramiko.SSHClient()
client.connect(*******)
sftp = client.open_sftp()

try:
    sftp.get("/home/**/1.txt", "1.txt")
except FileNotFoundError:
    print("File Not Found")
  1. Remote non-empty file 1.txt exists and local non-empty 1.txt exists. Local 1.txt is overwritten by remote 1.txt. Correct.

  2. Delete remote 1.txt, and local non-empty 1.txt exists. Catch FileNotFoundError, and local 1.txt becomes empty. Unexpected behavior.

So pyiron needs some work to avoid empty overwriting local files.

@jan-janssen
Copy link
Member

I changed the code to:

                try:
                    sftp_client.stat(file_dst)
                    sftp_client.get(file_dst, file_src)
                except FileNotFoundError:
                    pass

That sounds like a good idea, do you want to open a pull request to prevent this issue in future?

@pmrv
Copy link
Contributor

pmrv commented Dec 24, 2023

It's a behavior of paramiko .

I tested paramiko:

import paramiko.client

client = paramiko.SSHClient()
client.connect(*******)
sftp = client.open_sftp()

try:
    sftp.get("/home/**/1.txt", "1.txt")
except FileNotFoundError:
    print("File Not Found")
1. Remote non-empty file `1.txt` exists and local non-empty `1.txt` exists. Local `1.txt` is overwritten by remote `1.txt`. Correct.

2. Delete remote `1.txt`, and local non-empty `1.txt` exists. Catch `FileNotFoundError`, and local `1.txt` becomes empty. Unexpected behavior.

So pyiron needs some work to avoid empty overwriting local files.

This also sounds like a bug in paramiko to me. Consider reporting just this snippet to them as well.

@hujay2019
Copy link
Author

@jan-janssen I opened a pull request. pyiron/pysqa#248

@hujay2019
Copy link
Author

@pmrv Yes. I'll consider reporting to paramiko.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Category: Something does not work
Projects
None yet
Development

No branches or pull requests

3 participants