Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Key error with rule name when running large DRMAA jobs #1392

Closed
mrvollger opened this issue Feb 10, 2022 · 1 comment
Closed

Key error with rule name when running large DRMAA jobs #1392

mrvollger opened this issue Feb 10, 2022 · 1 comment
Labels
bug Something isn't working

Comments

@mrvollger
Copy link
Contributor

mrvollger commented Feb 10, 2022

Snakemake version
Any version >=6.8.0

Describe the bug
When running large drmaa jobs if a job errors (killed for example) Snakemake will throw a missing file error, and the exit ungracefully with a key error with the rule name.

Logs
Output of Snakemake as the jobs fails

MissingOutputException in line 25 of https://github.com/mrvollger/Rhodonite/raw/v0.12-alpha/workflow/rules/trf.smk:                                                                                            
Job Missing files after 60 seconds:                                                                    
results/HG01978.pat/trf/183-of-200/183-of-200.dat                                                      
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --
latency-wait.                                                                                          
Job id: 8509 completed successfully, but some output files are missing. 8509                           
Traceback (most recent call last):                                                                     
  File "/net/eichler/vol26/15000/nobackups/mvollger/miniconda3/envs/snakemake/lib/python3.9/site-package
s/snakemake/__init__.py", line 699, in snakemake                                                       
    success = workflow.execute(                                                                        
  File "/net/eichler/vol26/15000/nobackups/mvollger/miniconda3/envs/snakemake/lib/python3.9/site-package
s/snakemake/workflow.py", line 1073, in execute                                                        
    success = self.scheduler.schedule()                                                                
  File "/net/eichler/vol26/15000/nobackups/mvollger/miniconda3/envs/snakemake/lib/python3.9/site-package
s/snakemake/scheduler.py", line 440, in schedule
    self._finish_jobs()
  File "/net/eichler/vol26/15000/nobackups/mvollger/miniconda3/envs/snakemake/lib/python3.9/site-package
s/snakemake/scheduler.py", line 540, in _finish_jobs
    self.running.remove(job)
KeyError: run_split_trf

Log file of the job says it was killed (not enough memory).

Minimal example
This seems to only happen when I submit a very large number of jobs (>~5000) so a minimal example is hard to create.

Additional context
If I downgrade to snakemake 6.7 the problem seems to go away. I wonder if this PR may have something to do with it?
#1156

@mrvollger mrvollger added the bug Something isn't working label Feb 10, 2022
@johanneskoester
Copy link
Contributor

Thanks for reporting. I believe that this has been fixed with the 7.0 release. Please reopen if I am wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants