New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
QC read files deleted too early #132
Comments
This is expected behaviour. The rule 'circularize' is downstream of all the circularization methods. Because it no longer needs temp reads, it deletes them because all rules that use these reads have been running. |
My understanding is that Snakemake rebuilds the DAG every time it is run. So when you go |
This makes sense to me --> it can't find the QCed reads --> says I need to make the QCed reads --> the new QCed reads affect the following steps (in case the read polishing is non-deterministic) --> re-does everything post assembly. |
@LeeBergstrand Sorry, I think I didn't explain the problem clearly. You are right that it would be normal for snakemake to delete the short QC read files after reaching the end of the So: Does that make more sense? I think it is fundamentally a snakemake bug, but I think we could work around it temporarily by making the QC reads non-temp by default. However, I suspect that the bug will not appear if you have the pipeline run fully to the end. Let me know your thoughts here -- thanks! |
@jmtsuji Thanks for the explanation. I'm testing on my machine to confirm the behaviour encountered. To clarify, does this error only occur if you run |
Is there a separate rule for this polishing step? or is it calling polishing rules again somehow (is that why there is a |
Let's do this until we move to Snakemake 8. |
@LeeBergstrand Thanks for testing -- I am also running a few more test on my end with different run params. For now, I assume the error only occurs if I run Using |
It's calling the Polypolish rule again -- this is why there is a |
OK, sounds good. I am running a couple tests on my end to confirm that this fix works. |
Sounds good. Thanks for giving me an overview of how checkpointing works and how it applies to our software. I send you the test results when I get them. |
@jmtsuji I had to run
|
@jmtsuji I get the above error every time I run with |
@jmtsuji Changing it to keep the qc reads fixes the above error. For me, the pipeline, when run end to end, still passes even with the |
@LeeBergstrand My tests are also finished :-) Here are my results, using commit 864ca98 (pypolca branch): Main rotary run with test datasetIf
|
I got the same effect. |
@jmtsuji I recommend we do the following:
|
@jmtsuji I noticed that you use a Symlinking paradigm throughout the pipeline. For example: # Conditional based on whether short read polishing was performed
rule pre_coverage_filter:
input:
"{sample}/polish/medaka/{sample}_consensus.fasta" if POLISH_WITH_SHORT_READS == False else "{sample}/polish/polca/{sample}_polca.fasta"
output:
temp("{sample}/polish/cov_filter/{sample}_pre_filtered.fasta")
run:
source_relpath = os.path.relpath(str(input),os.path.dirname(str(output)))
os.symlink(source_relpath,str(output)) This was quite troublesome when I made the Perhaps the rebuilding issues that came up above are related to the symlinking? For example, Snakemake is having trouble tracking what files are used by what rules. I'm not sure. I would bring up this symlinking issue later, but I wanted to put it here now to ensure you know about it. |
@jmtsuji It might also rerun the entire pipeline if the config file is modified. My tests aren't really supporting that but I would also look into it. I'm getting the following for rule:
Something is causing the forced execution of the very first rule. |
@LeeBergstrand Thanks for the thoughts -- I agree. Second issue for checkpointing is already created at #134 , and |
I've also run into symlinking issues while editing the pipeline and agree, it would be better to streamline these parts, if possible. Reducing the number of steps in the analysis this way might help with consistent DAG construction. (Just for reference for the future: for the polish steps, I think it's necessary to either copy or symlink the files before running the polish rules in order for the rules to be compatible with inputs from multiple modules. This is not the case for other rules, I think, so symlinking steps can probably be streamlined in many other cases.) I'll move this to a separate issue. |
Also interesting -- I haven't seen "Forced execution" as a reason to re-run |
Run / system info
I'm running rotary commit 40381cc on Ubuntu 22.04.3, within the snakemake argument
--until circularize
(to run everything up until the start of the annotation module).Problem description
After reaching
checkpoint split_circular_and_linear_contigs
, I encountered the following error:When I tried to re-start rotary, it wanted to re-built the QC read files from scratch -- these files had been deleted as
temp()
-- and essentially start the pipeline over. The deletion of the final QC read files is odd, because these QC read files were still needed for the downstream parts of the pipeline (e.g., for another round of polishing during the circularization module). I added fake final QC read files (and a couple intermediate files) back viatouch
(with appropriate timestamps), and the pipeline continued past the checkpoint error, but (of course) rotary ran into an error when it tried to use the empty QC read files.Based on a quick skim of the GitHub issue flagged in the Snakemake error message (snakemake/snakemake#823), I suspect that the issue I am running into has to do with a bug in checkpointing. After hitting the
split_circular_and_linear_contigs
checkpoint, if my understanding is correct, Snakemake re-evaluates the rest of the DAG for the run. Until that point, snakemake doesn't know if the QC'ed reads will be needed again... it depends on if there are any circular contigs that get passed into thecircularization
module. It seems that Snakemake pre-emptively deleted thetemp()
QC read files before knowing if it was really safe to do so. As a result, I think the pipeline crashed once it realized that the QC read files were actually necessary files but were missing for continuing the run.Possible solution
Set the final QC read files as non-temp until we can find a better solution. An upgrade to Snakemake 8 might help fix this problem, or we might need to eventually contact the Snakemake authors about it.
The capability to set the QC read files as non-temp is already included in PR #130 . I would suggest that we change the default setting in that PR so that the final QC read files are non-temp by default.
@LeeBergstrand Any thoughts?
The text was updated successfully, but these errors were encountered: