New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KeyError: checkpoint #1244
Comments
The error message looks a lot like #1024 but as this bug has been fixed I assume it is similar but different. |
This seems to be a regression that was introduced between 6.8.0 and 6.8.1. To be exact it seems to me this was introduced with commit 41a5071 as it works before this commit as expected. Reverting the commit 41a5071 on main ( |
On 6.10.0 I am experiencing what appears to be the same bug (except don't use checkpoints, so do not know if it's exactly the same- but the same lines of code are complaining). For me it occurs unpredictably and mostly hits when I run really large jobs-( big jobs being 10,000-100,000).... And it seems to be correlated with managing releasing jobs from the dag- largely due to error states (But sometimes successful jobs trigger the same kind of fail). This pipeline has an expected high failure rate, so it might be particularly vulnerable. And this error happens both running on a single local box, or via an ubunu SGE cluster setup or a centos AWS hosted UGE cluster. Python 3.8 and 3.9. It's a pretty serious issue to work around due to the time and costs of re-rerunning these jobs several times hoping to squeak through. The most common case, is preceded by the missing files error
Which is fine, I've set -k, but not a single workflow has completed that is bigger than ~10K jobs w/out a version of what is described above happening and crashing the whole workflow (that is usually many hours into running). Here is an example of the most common killer double exception:
I also get the same crash from what seems to be successfully completed jobs failing to be removed.
The error I've put an ipython embed break at the point it happens to see if anything jumped out at me when I poked around in the code frozen in the exception state - or if I could figure out a hacky additional exception handler to prevent the crashes- i could not :-/ And in the ipython sessions, the job object seems intact- or at least it has what seem to be sensible properties. It seems to me that there is something going on with how these failed for missing file exceptions are working.... are the serially single threaded? From what I'm seeing, if 4 of these exceptions happen and i have latency set to 60, they all finish waiting one after the other, so 4min of holding, not 1m--- this seems like a good place for a queue to form and get overloaded. But, I've not yet found where that stuff is happening, so it just a hunch. I tried squashing the exception in a variety of ways to see if things will proceed which unsurprisingly all caused more problems than they solved. I've also tried slowing down the number of submitted jobs, limiting the total number (which can be quite high). I'll try to revert that commit which helped openpaul-- but may not be able to depending if i've adopted a new feature rolled out would be impacted (and that is even if these two issues have the same root cause). Is there anything I could dig up from the next instance that would be helpful? I'm not very familiar with this codebase, so am a little lost, but happy to take some direction. thanks-- jem |
I believe I am experiencing the same or a similar bug on Snakemake 6.12.1. I can reproduce the error when these two conditions are met:
Even though the job is initially reported as a success, Snakemake checks for missing files and catches the failure. I get the following output and it appears that all is well:
However, later on Snakemake appears to check for missing files a second time, and I get a KeyError as it presumably tries to remove the job from the set of running jobs again.
Delving into the codebase, I believe the problem does seem to have been introduced by commit 41a5071 mentioned by openpaul above. Because the submitted job initially appears to have been a success, the |
Ah! I do not think i made it clear this problem presents for me using It has continued through the recent minor upgrades. This bug, occurring in upwards of 60% of every snakemake job has become untenable to manage, the time needed to recover from such a large % of often large jobs is substantial. I would have moved to trying nextflow, but they do not support DRMAA, which at least for my cluster is the only stable way to manage 100,000 jobs. One hacky very sub-ideal workaround was to set my own 120sec sleep at the end of every rule, then immediately after test for presence of all {output} and exit >= 1 if any were missing. This largely helped, but as resources are more in use and I parallelize less, the cost of this hack is also now driving me to other solutions. Which, was going to be exporting to CWL(ugh) and seeings if a cwl executor would help me. BUT!!! Thanks to @cjops , I have a ray of hope :-). I'm going to go snag this fix branch and give it a go right now. Will lot you know what I see (thought could be a day or so . One worrisome observation....I find it surprising there is not more of an uproar about this bug considering it's impacts-and still wonder if somehow I'm exacerbating the problem. on my end. Will see -- thanks again @cjops -- and I'll go drop q quick note on your PR. |
@iamh2o you're not exacerbating, this is super annoying 😄 . It's a pain that the workflow doesn't end gracefully! |
@iamh2o Thanks for testing! Curious to see if it fixes the problem in your case too. Don't have much experience with the DRMAA scheduler so I can't tell you what may be exacerbating the problem on your end. But maybe this bug is overlooked because most people don't rely on the |
Looks good to me!
But my jobs are rolling and am free from wrangling with the export to CWL features-- thanks @cjops ! jem |
I've still had a few of the same style of failure, but massively reduced in frequency. This is definitely a winner. And I've now tested it on a centos UGE cluster hosted via AWS as well. Roll this one in to the next release please :-) jem |
The patch is simple enough to manually apply to new versions of snakemake, so I have kept up to date. And the problem's severity is greatly reduced. However, this is unfortunately not a complete fix as it still is occuring, though at a much reduced frequency. It also feels related to this bug: #1323 - which I have also observerd. I continue to use my hack to further ameliorate the problem, wich is defining A feature this suggests that could be useful. Similar to onstart: onerror: onsuccess:, have a global prerule: and postrule: directive that would add the contained code to the beginning or end of every rule. |
Running version 7.25.3 and just got this bug when using
|
Snakemake version
Snakemake 6.10.0
Describe the bug
I am using checkpoints and if a job fails instead of gracefully finishing the rest of the pipeline, Snakemake fails with a KeyError. I created a minimal example below.
Let me know if this is caused by Snakemake or by me abusing checkpoints and directories.
An error message looks like this:
Logs
Minimal example
I reduced my workflow to this minimal not working example. If you omit the failed sample the workflow works. Only upon failure the pipeline does not even create the files it could create. In my opinion it should do so.
The text was updated successfully, but these errors were encountered: