Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError: checkpoint #1244

Closed
openpaul opened this issue Nov 1, 2021 · 12 comments
Closed

KeyError: checkpoint #1244

openpaul opened this issue Nov 1, 2021 · 12 comments
Labels
bug Something isn't working

Comments

@openpaul
Copy link

openpaul commented Nov 1, 2021

Snakemake version
Snakemake 6.10.0

Describe the bug
I am using checkpoints and if a job fails instead of gracefully finishing the rest of the pipeline, Snakemake fails with a KeyError. I created a minimal example below.

Let me know if this is caused by Snakemake or by me abusing checkpoints and directories.

An error message looks like this:

Traceback (most recent call last):
  File "/home/paul/miniconda3/envs/snakemake/lib/python3.9/site-packages/snakemake/__init__.py", line 699, in snakemake
    success = workflow.execute(
  File "/home/paul/miniconda3/envs/snakemake/lib/python3.9/site-packages/snakemake/workflow.py", line 1069, in execute
    success = self.scheduler.schedule()
  File "/home/paul/miniconda3/envs/snakemake/lib/python3.9/site-packages/snakemake/scheduler.py", line 440, in schedule
    self._finish_jobs()
  File "/home/paul/miniconda3/envs/snakemake/lib/python3.9/site-packages/snakemake/scheduler.py", line 534, in _finish_jobs
    self._handle_error(job)
  File "/home/paul/miniconda3/envs/snakemake/lib/python3.9/site-packages/snakemake/scheduler.py", line 615, in _handle_error
    self.running.remove(job)
KeyError: rule_A

Logs

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job          count    min threads    max threads
---------  -------  -------------  -------------
all              1              1              1
cpA              6              1              1
rule_A           6              1              1
summarize        6              1              1
total           19              1              1

Select jobs to execute...

[Mon Nov  1 11:10:26 2021]
checkpoint rule_A:
    output: data/ruleA/A/failed
    jobid: 9
    wildcards: project=A, sample=failed
    resources: tmpdir=/tmp
Downstream jobs will be updated after completion.

Waiting at most 5 seconds for missing files.
MissingOutputException in line 17 of /home/paul/tmp/key/Snakefile:
Job Missing files after 5 seconds:
data/ruleA/A/failed
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Job id: 9 completed successfully, but some output files are missing. 9
Job failed, going on with independent jobs.
Select jobs to execute...

[Mon Nov  1 11:10:31 2021]
checkpoint rule_A:
    output: data/ruleA/B/b
    jobid: 15
    wildcards: project=B, sample=b
    resources: tmpdir=/tmp
Downstream jobs will be updated after completion.

Waiting at most 5 seconds for missing files.
MissingOutputException in line 17 of /home/paul/tmp/key/Snakefile:
Job Missing files after 5 seconds:
data/ruleA/A/failed
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Job id: 9 completed successfully, but some output files are missing. 9
Traceback (most recent call last):
  File "/home/paul/miniconda3/envs/snakemake/lib/python3.9/site-packages/snakemake/dag.py", line 493, in check_and_touch_output
    wait_for_files(
  File "/home/paul/miniconda3/envs/snakemake/lib/python3.9/site-packages/snakemake/io.py", line 825, in wait_for_files
    raise IOError(
OSError: Missing files after 5 seconds:
data/ruleA/A/failed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/paul/miniconda3/envs/snakemake/lib/python3.9/site-packages/snakemake/scheduler.py", line 529, in _finish_jobs
    self.get_executor(job).handle_job_success(job)
  File "/home/paul/miniconda3/envs/snakemake/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 608, in handle_job_success
    super().handle_job_success(job)
  File "/home/paul/miniconda3/envs/snakemake/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 265, in handle_job_success
    job.postprocess(
  File "/home/paul/miniconda3/envs/snakemake/lib/python3.9/site-packages/snakemake/jobs.py", line 1011, in postprocess
    self.dag.check_and_touch_output(
  File "/home/paul/miniconda3/envs/snakemake/lib/python3.9/site-packages/snakemake/dag.py", line 500, in check_and_touch_output
    raise MissingOutputException(
snakemake.exceptions.MissingOutputException: Job Missing files after 5 seconds:
data/ruleA/A/failed
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Job id: 9 completed successfully, but some output files are missing. 9

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/paul/miniconda3/envs/snakemake/lib/python3.9/site-packages/snakemake/__init__.py", line 699, in snakemake
    success = workflow.execute(
  File "/home/paul/miniconda3/envs/snakemake/lib/python3.9/site-packages/snakemake/workflow.py", line 1069, in execute
    success = self.scheduler.schedule()
  File "/home/paul/miniconda3/envs/snakemake/lib/python3.9/site-packages/snakemake/scheduler.py", line 440, in schedule
    self._finish_jobs()
  File "/home/paul/miniconda3/envs/snakemake/lib/python3.9/site-packages/snakemake/scheduler.py", line 534, in _finish_jobs
    self._handle_error(job)
  File "/home/paul/miniconda3/envs/snakemake/lib/python3.9/site-packages/snakemake/scheduler.py", line 615, in _handle_error
    self.running.remove(job)
KeyError: rule_A

Minimal example
I reduced my workflow to this minimal not working example. If you omit the failed sample the workflow works. Only upon failure the pipeline does not even create the files it could create. In my opinion it should do so.

import os



rule all:
    input:
        expand("data/done/{project}/{sample}/done",
            project = ['A', 'B'],
            sample = ['a', 'b', 'failed'])

checkpoint rule_A:
    output:
        out = directory("data/ruleA/{project}/{sample}/")
    shell:
        """
        if [[ "{wildcards.sample}" != "failed" ]]; then
            mkdir -p {output.out}
            NAME=$(hexdump -n 4 -v -e '/1 "%02X"'  /dev/urandom)
            touch {output.out}/${{NAME}}_1.fa
            touch {output.out}/${{NAME}}_2.fa
            touch {output.out}/${{NAME}}_3.fa
        fi
        """

checkpoint rule_B:
    output:
        out = directory("data/ruleB/{project}/{sample}/")
    shell:
        """
        if [[ "{wildcards.sample}" != "failed" ]]; then
            mkdir -p {output.out}
            NAME=$(hexdump -n 4 -v -e '/1 "%02X"' /dev/urandom)
            touch {output.out}/${{NAME}}_1.fa
            touch {output.out}/${{NAME}}_2.fa
            touch {output.out}/${{NAME}}_3.fa
        fi
        """

def load_folder(wc):
    from glob import glob
    files = []
    folder_A = checkpoints.rule_A.get(project = wc.project, sample = wc.sample).output.out
    folder_B = checkpoints.rule_B.get(project = wc.project, sample = wc.sample).output.out
    for p in [folder_A, folder_B]:
        files.extend(glob(os.path.join(p,"*.fa")))
    return files


checkpoint cpA:
    input:
        load_folder
    output:
        out = directory("data/combined/{project}/{sample}")
    shell:
        """
        mkdir -p {output}
        cp {input} {output}
        """

def get_stats(wc):
    from glob import glob
    folder = checkpoints.cpA.get(project = wc.project, sample = wc.sample).output.out
    fasta = [os.path.basename(x).rsplit(".", 1)[0] for x in glob(os.path.join(folder, "*.fa"))]
    return expand("data/stats/{project}/{sample}/{file}.csv",
            project = wc.project,
            sample = wc.sample,
            file=fasta)

rule make_stats:
    input:
        out = "data/combined/{project}/{sample}/{file}.fa"
    output:
        "data/stats/{project}/{sample}/{file}.csv"
    shell:
        """
        touch {output}
        """

rule summarize:
    input:
        get_stats
    output:
        touch("data/done/{project}/{sample}/done")
    shell:
        """
        echo "Done"
        """
@openpaul openpaul added the bug Something isn't working label Nov 1, 2021
@openpaul
Copy link
Author

openpaul commented Nov 1, 2021

The error message looks a lot like #1024 but as this bug has been fixed I assume it is similar but different.

@openpaul
Copy link
Author

openpaul commented Nov 1, 2021

This seems to be a regression that was introduced between 6.8.0 and 6.8.1.

To be exact it seems to me this was introduced with commit 41a5071 as it works before this commit as expected.

Reverting the commit 41a5071 on main (git revert 41a5071b) solves the issue for me. So I am pretty sure it is something in there.

@iamh2o
Copy link
Contributor

iamh2o commented Nov 15, 2021

On 6.10.0 I am experiencing what appears to be the same bug (except don't use checkpoints, so do not know if it's exactly the same- but the same lines of code are complaining). For me it occurs unpredictably and mostly hits when I run really large jobs-( big jobs being 10,000-100,000).... And it seems to be correlated with managing releasing jobs from the dag- largely due to error states (But sometimes successful jobs trigger the same kind of fail). This pipeline has an expected high failure rate, so it might be particularly vulnerable.

And this error happens both running on a single local box, or via an ubunu SGE cluster setup or a centos AWS hosted UGE cluster. Python 3.8 and 3.9. It's a pretty serious issue to work around due to the time and costs of re-rerunning these jobs several times hoping to squeak through.

The most common case, is preceded by the missing files error

MissingOutputException in line 451 of  marigold-pipes/workflow/rules/glimpse.smk:     
Job Missing files after 3 seconds:                                                                         
   ....                                                                                        
This might be due to filesystem latency. If that is the case, consider to increase the wait time with ...

Which is fine, I've set -k, but not a single workflow has completed that is bigger than ~10K jobs w/out a version of what is described above happening and crashing the whole workflow (that is usually many hours into running).

Here is an example of the most common killer double exception:

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "lib/python3.8/site-packages/snakemake/scheduler.py", line 529, in _finish_jobs
    self.get_executor(job).handle_job_success(job)
  File "lib/python3.8/site-packages/snakemake/executors/__init__.py", line 870, in handle_job_success
    super().handle_job_success(
  File "lib/python3.8/site-packages/snakemake/executors/__init__.py", line 265, in handle_job_success
    job.postprocess(
  File "lib/python3.8/site-packages/snakemake/jobs.py", line 1011, in postprocess
    self.dag.check_and_touch_output(
  File "lib/python3.8/site-packages/snakemake/dag.py", line 500, in check_and_touch_output

I also get the same crash from what seems to be successfully completed jobs failing to be removed.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "lib/python3.8/site-packages/snakemake/__init__.py", line 699
, in snakemake
    success = workflow.execute(
  File "lib/python3.8/site-packages/snakemake/workflow.py", line 1069, in execute
    success = self.scheduler.schedule()
  File "lib/python3.8/site-packages/snakemake/scheduler.py", line 440, in schedule
    self._finish_jobs()
  File "/lib/python3.8/site-packages/snakemake/scheduler.py", line 534, in _finish_jobs
    self._handle_error(job)
File "lib/python3.8/site-packages/snakemake/scheduler.py", line 615, in _handle_error                                                                                        
    self.running.remove(job) 
KeyError: pre_prep_raw_fq

The error KeyError: pre_prep_raw_fq implies the rule name is the job name? Which would be problematic when rules parallelizes. I can't imagine the name presented is actually the key--- but for completeness, I'm mentioning it b/c at the very least - the error is not very helpful in discriminating which job of that key type was the problem.

I've put an ipython embed break at the point it happens to see if anything jumped out at me when I poked around in the code frozen in the exception state - or if I could figure out a hacky additional exception handler to prevent the crashes- i could not :-/ And in the ipython sessions, the job object seems intact- or at least it has what seem to be sensible properties.

It seems to me that there is something going on with how these failed for missing file exceptions are working.... are the serially single threaded? From what I'm seeing, if 4 of these exceptions happen and i have latency set to 60, they all finish waiting one after the other, so 4min of holding, not 1m--- this seems like a good place for a queue to form and get overloaded. But, I've not yet found where that stuff is happening, so it just a hunch.

I tried squashing the exception in a variety of ways to see if things will proceed which unsurprisingly all caused more problems than they solved.

I've also tried slowing down the number of submitted jobs, limiting the total number (which can be quite high).
There seems to be little correlation with the data itself, just with the amount of it and the number of jobs.
I'm right now testing setting latency to zero, but adding global sleep statements to the end of all of my rules.... I'll report back on that.

I'll try to revert that commit which helped openpaul-- but may not be able to depending if i've adopted a new feature rolled out would be impacted (and that is even if these two issues have the same root cause).

Is there anything I could dig up from the next instance that would be helpful? I'm not very familiar with this codebase, so am a little lost, but happy to take some direction.

thanks-- jem

@cjops
Copy link
Contributor

cjops commented Dec 20, 2021

I believe I am experiencing the same or a similar bug on Snakemake 6.12.1.

I can reproduce the error when these two conditions are met:

  • Running Snakemake with the --keep-going flag
  • Using a slightly faulty --cluster-sync submission script that returns exit code 0 even when the submitted job has been aborted and did not produce the expected output

Even though the job is initially reported as a success, Snakemake checks for missing files and catches the failure. I get the following output and it appears that all is well:

Waiting at most 60 seconds for missing files.
MissingOutputException in line 237 of /u/project/gandalm/jops/X/Snakefile:
Job Missing files after 60 seconds:
results/a/a.hla.top
results/a/a.hla.all
This might be due to filesystem latency. If that is the case, consider to increase the wait time wi
th --latency-wait.
Job id: 108 completed successfully, but some output files are missing. 108
Removing output files of failed job bwa_postalt since they might be corrupted:
results/a/a.postalt.bam
Job failed, going on with independent jobs.

However, later on Snakemake appears to check for missing files a second time, and I get a KeyError as it presumably tries to remove the job from the set of running jobs again.

Waiting at most 60 seconds for missing files.
MissingOutputException in line 237 of /u/project/gandalm/jops/X/Snakefile:
Job Missing files after 60 seconds:
results/a/a.postalt.bam
results/a/a.hla.top
results/a/a.hla.all
This might be due to filesystem latency. If that is the case, consider to increase the wait time wi
th --latency-wait.
Job id: 108 completed successfully, but some output files are missing. 108
Traceback (most recent call last):
  File "/u/project/gandalm/jops/conda/envs/snakemake/lib/python3.9/site-packages/snakemake/dag.py",
 line 494, in check_and_touch_output
    wait_for_files(
  File "/u/project/gandalm/jops/conda/envs/snakemake/lib/python3.9/site-packages/snakemake/io.py",
line 825, in wait_for_files
    raise IOError(
OSError: Missing files after 60 seconds:
results/a/a.postalt.bam
results/a/a.hla.top
results/a/a.hla.all

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/u/project/gandalm/jops/conda/envs/snakemake/lib/python3.9/site-packages/snakemake/schedule
r.py", line 529, in _finish_jobs
    self.get_executor(job).handle_job_success(job)
  File "/u/project/gandalm/jops/conda/envs/snakemake/lib/python3.9/site-packages/snakemake/executor
s/__init__.py", line 870, in handle_job_success
    super().handle_job_success(
  File "/u/project/gandalm/jops/conda/envs/snakemake/lib/python3.9/site-packages/snakemake/executor
s/__init__.py", line 265, in handle_job_success
    job.postprocess(
  File "/u/project/gandalm/jops/conda/envs/snakemake/lib/python3.9/site-packages/snakemake/jobs.py"
, line 1017, in postprocess
    self.dag.check_and_touch_output(
  File "/u/project/gandalm/jops/conda/envs/snakemake/lib/python3.9/site-packages/snakemake/dag.py",
 line 501, in check_and_touch_output
    raise MissingOutputException(
snakemake.exceptions.MissingOutputException: Job Missing files after 60 seconds:
results/a/a.postalt.bam
results/a/a.hla.top
results/a/a.hla.all
This might be due to filesystem latency. If that is the case, consider to increase the wait time wi
th --latency-wait.
Job id: 108 completed successfully, but some output files are missing. 108

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/u/project/gandalm/jops/conda/envs/snakemake/lib/python3.9/site-packages/snakemake/__init__
.py", line 699, in snakemake
    success = workflow.execute(
  File "/u/project/gandalm/jops/conda/envs/snakemake/lib/python3.9/site-packages/snakemake/workflow
.py", line 1073, in execute
    success = self.scheduler.schedule()
  File "/u/project/gandalm/jops/conda/envs/snakemake/lib/python3.9/site-packages/snakemake/schedule
r.py", line 440, in schedule
    self._finish_jobs()
  File "/u/project/gandalm/jops/conda/envs/snakemake/lib/python3.9/site-packages/snakemake/schedule
r.py", line 534, in _finish_jobs
    self._handle_error(job)
  File "/u/project/gandalm/jops/conda/envs/snakemake/lib/python3.9/site-packages/snakemake/schedule
r.py", line 615, in _handle_error
    self.running.remove(job)
KeyError: bwa_postalt

Delving into the codebase, I believe the problem does seem to have been introduced by commit 41a5071 mentioned by openpaul above. Because the submitted job initially appears to have been a success, the _finish_jobs function of scheduler.py is at play. This function returns after handling a RuleException or WorkflowError (such as MissingOutputException). However, I think when Snakemake is run with the --keep-going flag, a continue statement would be more appropriate so that the _tofinish queue is cleared.

@iamh2o
Copy link
Contributor

iamh2o commented Jan 14, 2022

Ah! I do not think i made it clear this problem presents for me using --cluster qsub_custom_script (with status and job scripts too), and also when I realized I could use DRMAA(SO AWESOME), but the problem seems even worse using --drmaa (worse=20% more frequent?)

It has continued through the recent minor upgrades. This bug, occurring in upwards of 60% of every snakemake job has become untenable to manage, the time needed to recover from such a large % of often large jobs is substantial. I would have moved to trying nextflow, but they do not support DRMAA, which at least for my cluster is the only stable way to manage 100,000 jobs. --cluster-status being hit 100-1000's a times a minute == sad face. Reducing the frequency of status checks == different sad face.... if you add a non-trivial delay in job cleanup and next launch at these OOMs, it can add a day to your analysis.

One hacky very sub-ideal workaround was to set my own 120sec sleep at the end of every rule, then immediately after test for presence of all {output} and exit >= 1 if any were missing. This largely helped, but as resources are more in use and I parallelize less, the cost of this hack is also now driving me to other solutions. Which, was going to be exporting to CWL(ugh) and seeings if a cwl executor would help me.

BUT!!! Thanks to @cjops , I have a ray of hope :-). I'm going to go snag this fix branch and give it a go right now. Will lot you know what I see (thought could be a day or so . One worrisome observation....I find it surprising there is not more of an uproar about this bug considering it's impacts-and still wonder if somehow I'm exacerbating the problem. on my end. Will see -- thanks again @cjops -- and I'll go drop q quick note on your PR.

@Maarten-vd-Sande
Copy link
Contributor

@iamh2o you're not exacerbating, this is super annoying 😄 . It's a pain that the workflow doesn't end gracefully!

@cjops
Copy link
Contributor

cjops commented Jan 14, 2022

@iamh2o Thanks for testing! Curious to see if it fixes the problem in your case too. Don't have much experience with the DRMAA scheduler so I can't tell you what may be exacerbating the problem on your end. But maybe this bug is overlooked because most people don't rely on the --keep-going flag, idk. I found it annoying too.

@iamh2o
Copy link
Contributor

iamh2o commented Jan 15, 2022

Looks good to me!
This is awesome. I pulled your PR fix branch and am using that. I have not had a crash yet (and I'm well past where it would usually have taken the whole dag down ). I also forced a long running job to miss a file every time for an early rule in a smaller dag, while other work for the samples( this was 47 samples and ~2000 rules/jobs) went on in parallel**, and that has not triggered the problem... I'm pretty psyched. This appears to have resolved the bug I was seeing.

Well, maybe one of the bugs....It had seemed to me like this problem might potentially have 2 drivers. One with the scheduler, which you seem to have fixed. But a less common case seemed to always involve the benchmark file being the missing file. I had gotten around this by touching it, and i'm not testing that as it had a quick fix. And am mostly mentioning it for future web searchers. I googled for the benchmark files being related to this behavior and found very little-- buy, it maybe not everyone creates 100,000 benchmark files at a pop? I'll report that separately if it recurs (after I process this backlog)

**I say in parallel, but when the failing rule errors began, the scheduler stopped adding new tasks, despite this fail not impacting 80% of the remaining work. I set the latency wait time to 100 seconds. Well, it turns out this wait time is not parallelized, and was a parallelized task, so ~300 jobs needed to crash and clear, and they queue serially when they overlap. And as long as there is a queue, the scheduler does not submit new jobs.

All of the failed jobs died w.in minutes of each other, but it took 9 hours to process the latency wait missing files errors one at a time , after 30min of this, the cluster had no jobs running, despite almost all of them not blocked by this failed node. Which, I can see this not being urgent to fix, if at all, as probably no one is failing that frequently and operationalizing things? But seemed like a strange behavior and figured I'd mention it in case this is in fact unexpected behavior.

But my jobs are rolling and am free from wrangling with the export to CWL features-- thanks @cjops !

jem

@iamh2o
Copy link
Contributor

iamh2o commented Jan 23, 2022

I've still had a few of the same style of failure, but massively reduced in frequency. This is definitely a winner. And I've now tested it on a centos UGE cluster hosted via AWS as well. Roll this one in to the next release please :-)

jem

@iamh2o
Copy link
Contributor

iamh2o commented Feb 20, 2022

The patch is simple enough to manually apply to new versions of snakemake, so I have kept up to date. And the problem's severity is greatly reduced. However, this is unfortunately not a complete fix as it still is occuring, though at a much reduced frequency. It also feels related to this bug: #1323 - which I have also observerd.

I continue to use my hack to further ameliorate the problem, wich is defining global_latency_wait='sleep 60 at the beginning of the Snakefile, and at the end of every rule shell block, run {global_latancy_wait}; ls -lth {output} || exit 33;. to force a hard rule failure rather than go into the dragged out 'waiting for files' cycle. The sleep also seems to help mitigate the cases where rule finish before snakemake can notice, and results in the whole workflow hung.

A feature this suggests that could be useful. Similar to onstart: onerror: onsuccess:, have a global prerule: and postrule: directive that would add the contained code to the beginning or end of every rule.

@johanneskoester
Copy link
Contributor

johanneskoester commented Feb 21, 2022

I believe this is fixed by PR #1332 and #1416. Please reopen if I am wrong.

@xapple
Copy link

xapple commented May 24, 2023

Running version 7.25.3 and just got this bug when using --immediate-submit and thus submitting more jobs than usual to SLURM. Here is the traceback:

Traceback (most recent call last):
  File "~/miniconda3/envs/snake/lib/python3.9/site-packages/snakemake/__init__.py", line 757, in snakemake
    success = workflow.execute(
  File "~/miniconda3/envs/snake/lib/python3.9/site-packages/snakemake/workflow.py", line 1100, in execute
    raise e
  File "~/miniconda3/envs/snake/lib/python3.9/site-packages/snakemake/workflow.py", line 1096, in execute
    success = self.scheduler.schedule()
  File "~/miniconda3/envs/snake/lib/python3.9/site-packages/snakemake/scheduler.py", line 516, in schedule
    self._finish_jobs()
  File "~/miniconda3/envs/snake/lib/python3.9/site-packages/snakemake/scheduler.py", line 642, in _finish_jobs
    self.dag.finish(job, update_dynamic=self.update_dynamic)
  File "~/miniconda3/envs/snake/lib/python3.9/site-packages/snakemake/dag.py", line 1609, in finish
    self._running.remove(job)
KeyError: run_mapping

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants