Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why do some records have error set? #772

Open
peastman opened this issue Oct 14, 2023 · 17 comments
Open

Why do some records have error set? #772

peastman opened this issue Oct 14, 2023 · 17 comments

Comments

@peastman
Copy link

While running calculations, I find that some records have their error field set, even though the status field does not indicate an error. Some are running and others are waiting. In all cases the value of the error field is

{'error_type': 'unknown_error', 'error_message': 'QCEngine Unknown Error: Unknown error, error message is not found, possibly segfaulted', 'extras': None}

What does this mean?

@peastman
Copy link
Author

It appears that of all my managers running on different nodes, only one is successfully completing tasks. All the others fail every task. The log gives no indication of why:

[2023-10-13 16:47:47 PDT]     INFO: ComputeManager: Task Stats: Total finished=0, Failed=0, Success=0, Rejected=0
[2023-10-13 16:47:47 PDT]     INFO: ComputeManager: Worker Stats (est.): Core Hours Used=0.00
[2023-10-13 16:47:47 PDT]     INFO: ComputeManager: Executor local_executor has 3 active tasks and 0 open slots
[2023-10-13 16:48:06 PDT]     INFO: parsl.dataflow.dflow: Task 0 completed (launched -> exec_done)
[2023-10-13 16:48:48 PDT]     INFO: ComputeManager: Successfully return tasks to the fractal server
[2023-10-13 16:48:48 PDT]     INFO: ComputeManager: Accepted task ids: 18909312
[2023-10-13 16:48:48 PDT]     INFO: ComputeManager: Executor local_executor: Processed 1 tasks: 0 success / 1 failed
[2023-10-13 16:48:48 PDT]     INFO: ComputeManager: Executor local_executor: Task ids, submission status, calculation status below
[2023-10-13 16:48:48 PDT]     INFO: ComputeManager:     Task 18909312 : sent / failed
[2023-10-13 16:48:48 PDT]     INFO: ComputeManager: Task Stats: Total finished=1, Failed=1, Success=0, Rejected=0
[2023-10-13 16:48:48 PDT]     INFO: ComputeManager: Worker Stats (est.): Core Hours Used=46.78
[2023-10-13 16:48:48 PDT]     INFO: ComputeManager: Executor local_executor has 2 active tasks and 1 open slots
[2023-10-13 16:48:48 PDT]     INFO: ComputeManager: Acquired 1 new tasks.

Querying the records, the error field has the value shown above, and stdout and stderr are both None.

@bennybp
Copy link
Contributor

bennybp commented Oct 14, 2023

For the first question: If the error field is set, but the status isn't error, then it was reset at some point to be allowed to run again. This could have been done automatically (there is an auto-resetting feature on the server which will reset a few times before giving up). You can check the whole history with record.compute_history.

For the second question: Task 18909312 corresponds to record 119038388. For error I see:

Traceback (most recent call last):
  File "/home/users/peastman/miniconda3/envs/qcfractal-worker-psi4-18.1/lib/python3.10/site-packages/psi4/driver/schema_wrapper.py", line 459, in run_qcschema
    ret_data = run_json_qcschema(input_model.dict(), clean, False, keep_wfn=keep_wfn)
  File "/home/users/peastman/miniconda3/envs/qcfractal-worker-psi4-18.1/lib/python3.10/site-packages/psi4/driver/schema_wrapper.py", line 618, in run_json_qcschema
    val, wfn = methods_dict_[json_data["driver"]](method, **kwargs)
  File "/home/users/peastman/miniconda3/envs/qcfractal-worker-psi4-18.1/lib/python3.10/site-packages/psi4/driver/driver.py", line 639, in gradient
    wfn = procedures['gradient'][lowername](lowername, molecule=molecule, **kwargs)
  File "/home/users/peastman/miniconda3/envs/qcfractal-worker-psi4-18.1/lib/python3.10/site-packages/psi4/driver/procrouting/proc.py", line 93, in select_scf_gradient
    return func(name, **kwargs)
  File "/home/users/peastman/miniconda3/envs/qcfractal-worker-psi4-18.1/lib/python3.10/site-packages/psi4/driver/procrouting/proc.py", line 2674, in run_scf_gradient
    ref_wfn = run_scf(name, **kwargs)
  File "/home/users/peastman/miniconda3/envs/qcfractal-worker-psi4-18.1/lib/python3.10/site-packages/psi4/driver/procrouting/proc.py", line 2574, in run_scf
    scf_wfn = scf_helper(name, post_scf=False, **kwargs)
  File "/home/users/peastman/miniconda3/envs/qcfractal-worker-psi4-18.1/lib/python3.10/site-packages/psi4/driver/procrouting/proc.py", line 1873, in scf_helper
    e_scf = scf_wfn.compute_energy()
  File "/home/users/peastman/miniconda3/envs/qcfractal-worker-psi4-18.1/lib/python3.10/site-packages/psi4/driver/procrouting/scf_proc/scf_iterator.py", line 85, in scf_compute_energy
    self.iterations()
  File "/home/users/peastman/miniconda3/envs/qcfractal-worker-psi4-18.1/lib/python3.10/site-packages/psi4/driver/procrouting/scf_proc/scf_iterator.py", line 304, in scf_iterate
    self.form_G()
RuntimeError: 
Fatal Error: WRITE failed. Error description from the OS: Cannot allocate memory
Error in PSIO::wt_toclen()! Cannot write TOC length, unit 64.
PSIO_ERROR: 12 (error writing to file)

Error occurred in file: /home/conda/feedstock_root/build_artifacts/psi4_1691021555030/work/psi4/src/psi4/libpsio/error.cc on line: 135
The most recent 5 function calls were:

psi::PsiException::PsiException(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, char const*, int)
psi::PSIO::wt_toclen(unsigned long, unsigned long)
psi::PSIO::tocwrite(unsigned long)

The task vs. record ID trips up a lot of people (and rightly so). I will make a PR soon that makes the difference explicit in the manager logs (for example, by printing both the task ID and the record id).

@peastman
Copy link
Author

Fatal Error: WRITE failed. Error description from the OS: Cannot allocate memory

Does that mean the root problem is that it ran out of memory? Is there anything I can do about that? These are large systems but not huge, 90 atoms. And it's running on a node with 256 GB of memory.

I've noticed from the log that it acquires three tasks at a time and processes them (I assume?) in parallel. Is there a way to tell it to attempt fewer tasks at a time?

@bennybp
Copy link
Contributor

bennybp commented Oct 14, 2023

It depends on how many jobs you are running in parallel. Could you post the executors from the configuration?

The manager will pull down more tasks than it can compute, so that there is a buffer (since there is a delay between tasks finishing and the manager fetching more tasks).

Looking at that error a little bit, I'm not sure I can make sense of it (says both cannot allocate memory, and cannot write to disk). Let me ask the psi4 developers.

I will take a look at the others in this dataset and see if there's any pattern or if psi4 just has an issue with that one

@peastman
Copy link
Author

executors:
  local_executor:
    type: local
    max_workers: $MAX_WORKERS     # max number of workers to spawn
    cores_per_worker: 32          # cores per worker
    memory_per_worker: 216        # memory per worker, in GiB
    scratch_directory: "$L_SCRATCH/$SLURM_JOBID"
    queue_tags:
      - spice-psi4-181
    environments:
      use_manager_environment: False
      conda:
        - qcfractal-worker-psi4-18.1      # name of conda env used by worker; see below for example
    worker_init:
      - source /home/users/peastman/worker_init.sh

The one node where it works has 128 cores and 1 TB of memory. On that node, $MAX_WORKERS is set to 4. All the other nodes have 32 cores and 256 GB of memory. On those it's set to 1.

@jchodera
Copy link

@dotsdl Can you try spinning up some lilac workers for @peastman to see if they have the same issue?

@peastman
Copy link
Author

Here is the submission script:

#! /usr/bin/bash
#SBATCH --job-name=qcfractal
#SBATCH --partition=normal
#SBATCH -t 2-00:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=32
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=7gb

## USAGE

# Make sure to run bashrc
source $HOME/.bashrc

# Don't limit stack size
ulimit -s unlimited

# Activate QCFractal conda env
conda activate qcfractalcompute

# Create a YAML file with specific substitutions
export MAX_WORKERS=1
envsubst < qcfractal-manager-config.yml > configs/config.${SLURM_JOBID}.yml

# Run qcfractal-compute-manager
qcfractal-compute-manager --config configs/config.${SLURM_JOBID}.yml

Some things I tried that didn't help:

  • Reducing memory_per_worker to 192 to give a bigger margin between what psi4 thinks it can use and what slurm allows it to use.
  • Reducing cores_per_worker to 24 in case the memory use is scaling with the number of cores.

@bennybp
Copy link
Contributor

bennybp commented Oct 17, 2023

#! /usr/bin/bash
#SBATCH --job-name=qcfractal
#SBATCH --partition=normal
#SBATCH -t 2-00:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=32
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=7gb

I'm not positive (I'm not a slurm expert), but I believe this might limit things to 1 core/7GB memory (because ntasks is largely for MPI). I am trying to verify that, but SLURM is one of those things I have to re-learn every time I use it.

Could you try (just swaps ntasks and cpus-per-task)

#! /usr/bin/bash
#SBATCH --job-name=qcfractal
#SBATCH --partition=normal
#SBATCH -t 2-00:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=32
#SBATCH --mem-per-cpu=7gb

@peastman
Copy link
Author

I submitted a job with those settings. I'll let you know what happens.

@peastman
Copy link
Author

It still failed. Check task 18909693.

@jchodera
Copy link

@bennybp : Any chance you are able to look at this?

@bennybp
Copy link
Contributor

bennybp commented Oct 20, 2023

I will try running the 18909693 calculation locally. I do see that task 18909312/record 119038388 has now completed, though.

@peastman
Copy link
Author

It presumably completed on a node with 1 TB of memory. I can run these calculations on those nodes, but not on ones with 256 GB.

@bennybp
Copy link
Contributor

bennybp commented Oct 21, 2023

We might have to debug this live over zoom.

There are a few possibilities. One possibility is that a ramdisk is being used for storage (not sure how, since $L_SCRATCH appears to be a local disk). The particular calculation I was looking at created a 105GB scratch file, which seems dangerous if $L_SCRATCH has ~150GB as according to here

I just ran it interactively on one of our nodes and there wasn't any problem. I ran several at the same time in order to fill up the local scratch. This caused an error, but it was not the PSIO error we've seen. It could have caused the "unknown errors" though.

I submitted the errored task to a private instance to make sure the managers behave correctly. I submitted a manager job pretty much identically to yours, and it all behaves as expected, with only one psi4 job running using 32 cores. One difference is that I submitted mine with --exclusive because our nodes have 128 cores/256GB memory.

If your interested in a live debugging session, send me an email. I will have time next week.

@peastman
Copy link
Author

I'll try using $SCRATCH (100 TB quota) instead of $L_SCRATCH. Live debugging could be difficult. After I submit a job, it can sometimes take a few hours for it to start running, and then I have to let it run for about an hour before it fails.

That dataset has now finished, computed on only the 1 TB, 128 core nodes. The dataset I'm working on right now has smaller molecules, 50 atoms maximum, that can run successfully on the smaller nodes. But one of the other datasets I'll be computing later has even bigger ones, up to 110 atoms.

@jchodera
Copy link

This still sounds like a local cluster batch submission script configuration issue, right?

@dotsdl @mikemhenry : Is there any way for us to try running workers on our local cluster (lilac) as well, where we are more confident we have correct configuration settings, so they can monitor if those fail too? Eventually we will want pass our own QCFractal responsibilities to @chrisiacovella too, so this might be a good option for training.

@peastman
Copy link
Author

It's still failing when using $SCRATCH. Check these records: 119213804, 119213805, 119213806.

How did you find the error log shown above? As before, all three records have None for stdout and stderr, and error is set to {'error_type': 'unknown_error', 'error_message': 'QCEngine Unknown Error: Unknown error, error message is not found, possibly segfaulted', 'extras': None}.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants