Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QC read files deleted too early #132

Open
jmtsuji opened this issue Feb 15, 2024 · 24 comments
Open

QC read files deleted too early #132

jmtsuji opened this issue Feb 15, 2024 · 24 comments
Assignees
Labels
bug Something isn't working

Comments

@jmtsuji
Copy link
Collaborator

jmtsuji commented Feb 15, 2024

Run / system info

I'm running rotary commit 40381cc on Ubuntu 22.04.3, within the snakemake argument --until circularize (to run everything up until the start of the annotation module).

Problem description

After reaching checkpoint split_circular_and_linear_contigs, I encountered the following error:

BUG: Out of jobs ready to be started, but not all files built yet. Please check https://github.com/snakemake/snakemake/issues/823 for more information.
Remaining jobs:
 - search_contig_start: S18/circularize/identify/S18_circular.faa, S18/circularize/identify/S18_circular.ffn, S18/circularize/identify/S18_circular.gff, S18/circularize/identify/S18_hmmsearch_hits.txt, S18/circularize/identify/S18_hmmsearch_hits_no_comments.txt
 - finalize_circular_contig_rotation: S18/circularize/combine/S18_circular.fasta
 - combine_circular_and_linear_contigs: S18/circularize/combine/S18_combined.fasta
 - process_start_genes: S18/circularize/identify/S18_start_genes.list
 - polish_polypolish: S18/circularize/polypolish/S18_R1.clean.sam, S18/circularize/polypolish/S18_R2.clean.sam, S18/circularize/polypolish/S18_polypolish.fasta, S18/circularize/polypolish/polypolish.debug.log, S18/stats/circularize/polypolish_changes.log
 - map_short_reads_for_polishing: S18/circularize/polypolish/S18_R1.sam, S18/circularize/polypolish/S18_R2.sam, S18/circularize/polypolish/input/S18_input.fasta.0123, S18/circularize/polypolish/input/S18_input.fasta.amb, S18/circularize/polypolish/input/S18_input.fasta.ann, S18/circularize/polypolish/input/S18_input.fasta.bwt.2bit.64, S18/circularize/polypolish/input/S18_input.fasta.pac
 - circularize: checkpoints/circularize
 - get_start_genes: S18/circularize/identify/S18_start_gene.ffn
 - prepare_polypolish_circularize_input: S18/circularize/polypolish/input/S18_input.fasta
 - run_circlator: S18/circularize/circlator/S18_rotated.fasta, S18/circularize/circlator, S18/circularize/circlator/rotated.promer.contigs_with_ends.fa
 - symlink_circularization: S18/circularize/S18_circularize.fasta
 - get_polished_contigs: S18/circularize/filter/S18_circular.fasta
Complete log: .snakemake/log/2024-02-15T154852.164572.snakemake.log
Executing: snakemake --snakefile /2022_08_01_rotary_testing/rotary/rotary/rules/rotary.smk --configfile /2022_08_01_rotary_testing/output_dir/config.yaml --directory /2022_08_01_rotary_testing/output_dir --conda-prefix /2022_08_01_rotary_testing/rotary_db_dir/rotary_conda_env --conda-frontend mamba --use-conda --reason --rerun-incomplete --printshellcmds --jobs 48 --until circularize
Command 'snakemake --snakefile /2022_08_01_rotary_testing/rotary/rotary/rules/rotary.smk --configfile /2022_08_01_rotary_testing/output_dir/config.yaml --directory /2022_08_01_rotary_testing/output_dir --conda-prefix /2022_08_01_rotary_testing/rotary_db_dir/rotary_conda_env --conda-frontend mamba --use-conda --reason --rerun-incomplete --printshellcmds --jobs 48 --until circularize' returned non-zero exit status 1.

When I tried to re-start rotary, it wanted to re-built the QC read files from scratch -- these files had been deleted as temp() -- and essentially start the pipeline over. The deletion of the final QC read files is odd, because these QC read files were still needed for the downstream parts of the pipeline (e.g., for another round of polishing during the circularization module). I added fake final QC read files (and a couple intermediate files) back via touch (with appropriate timestamps), and the pipeline continued past the checkpoint error, but (of course) rotary ran into an error when it tried to use the empty QC read files.

Based on a quick skim of the GitHub issue flagged in the Snakemake error message (snakemake/snakemake#823), I suspect that the issue I am running into has to do with a bug in checkpointing. After hitting the split_circular_and_linear_contigs checkpoint, if my understanding is correct, Snakemake re-evaluates the rest of the DAG for the run. Until that point, snakemake doesn't know if the QC'ed reads will be needed again... it depends on if there are any circular contigs that get passed into the circularization module. It seems that Snakemake pre-emptively deleted the temp() QC read files before knowing if it was really safe to do so. As a result, I think the pipeline crashed once it realized that the QC read files were actually necessary files but were missing for continuing the run.

Possible solution

Set the final QC read files as non-temp until we can find a better solution. An upgrade to Snakemake 8 might help fix this problem, or we might need to eventually contact the Snakemake authors about it.

The capability to set the QC read files as non-temp is already included in PR #130 . I would suggest that we change the default setting in that PR so that the final QC read files are non-temp by default.

@LeeBergstrand Any thoughts?

@jmtsuji jmtsuji added the bug Something isn't working label Feb 15, 2024
@LeeBergstrand
Copy link
Collaborator

When I tried to re-start rotary, it wanted to re-built the QC read files from scratch -- these files had been deleted as temp() -- and essentially start the pipeline over. The deletion of the final QC read files is odd, because these QC read files were still needed for the downstream parts of the pipeline (e.g., for another round of polishing during the circularization module). I added fake final QC read files (and a couple intermediate files) back via touch (with appropriate timestamps), and the pipeline continued past the checkpoint error, but (of course) rotary ran into an error when it tried to use the empty QC read files.

This is expected behaviour. The rule 'circularize' is downstream of all the circularization methods. Because it no longer needs temp reads, it deletes them because all rules that use these reads have been running.

@LeeBergstrand
Copy link
Collaborator

rotary run -s'--dag --until circularize' | grep -v 'Executing' | dot -Tpdf > until_circ_dag.pdf

until_circ_dag

@LeeBergstrand
Copy link
Collaborator

My understanding is that Snakemake rebuilds the DAG every time it is run. So when you go --until circularize, it builds the DAG until the rule species by the --until without regard to the next steps. As you can see from the DAG graph above, it isn't aware of any of the steps, so it deletes the temp files it doesn't need.

@LeeBergstrand
Copy link
Collaborator

When I tried to re-start rotary, it wanted to re-built the QC read files from scratch -- these files had been deleted as temp() -- and essentially start the pipeline over. The deletion of the final QC read files is odd, because these QC read files were still needed for the downstream parts of the pipeline (e.g., for another round of polishing during the circularization module). I added fake final QC read files (and a couple intermediate files) back via touch (with appropriate timestamps), and the pipeline continued past the checkpoint error, but (of course) rotary ran into an error when it tried to use the empty QC read files.

This makes sense to me --> it can't find the QCed reads --> says I need to make the QCed reads --> the new QCed reads affect the following steps (in case the read polishing is non-deterministic) --> re-does everything post assembly.

@jmtsuji
Copy link
Collaborator Author

jmtsuji commented Feb 16, 2024

@LeeBergstrand Sorry, I think I didn't explain the problem clearly. You are right that it would be normal for snakemake to delete the short QC read files after reaching the end of the --until step specified. However, in my case, the pipeline is unable to finish the specified circularize rule. It gets stuck at the checkpoint a few steps before that and errors out early. I think the error is because temp() is deleting the QC read files too early.

So:
I start the pipeline and tell it --until circularize
The pipeline runs until the end of polish module normally (before the circularize module) -->
The pipeline doesn't know if short QC reads will be needed or not at the end of the circularize module due to the checkpoint in the middle of the module -->
The pipeline pre-emptively deletes the short QC reads (this is the problem, I think) -->
The pipeline evaluates the checkpoint partway through the circularize module and re-builds the DAG -->
The pipeline sees that short QC reads are actually needed to finish the circularize module -->
The pipeline produces the error shown above and refuses to finish the circularize module, because there are no short QC reads.

Does that make more sense? I think it is fundamentally a snakemake bug, but I think we could work around it temporarily by making the QC reads non-temp by default. However, I suspect that the bug will not appear if you have the pipeline run fully to the end. Let me know your thoughts here -- thanks!

@jmtsuji
Copy link
Collaborator Author

jmtsuji commented Feb 16, 2024

Specifically, in the DAG figure you inserted (thanks for adding it!), here is the circularization module:
スクリーンショット 2024-02-16 12 46 08

split_circular_and_linear_contigs is a checkpoint. At that point, the DAG gets re-evaluated.

If you have any circular contigs, then before combine_circular_and_linear_contigs, the DAG will add a bunch of additional rules upon re-evaluation, including rotating to the start gene and performing one more round of Polypolish-based polishing. The polishing step needs the short QC reads. However, this possible use of the QC short reads seems to be hidden from snakemake until the checkpoint is evaluated.

@LeeBergstrand
Copy link
Collaborator

@jmtsuji Thanks for the explanation. I'm testing on my machine to confirm the behaviour encountered. To clarify, does this error only occur if you run --until circularize and then restart the pipeline? Isn't --until circularize a corner case? I noticed you used it as an example in the README.

@LeeBergstrand
Copy link
Collaborator

If you have any circular contigs, then before combine_circular_and_linear_contigs, the DAG will add a bunch of additional rules upon re-evaluation, including rotating to the start gene and performing one more round of Polypolish-based polishing. The polishing step needs the short QC reads. However, this possible use of the QC short reads seems to be hidden from snakemake until the checkpoint is evaluated.

Is there a separate rule for this polishing step? or is it calling polishing rules again somehow (is that why there is a {step} in some of the polishing rule strings)?

@LeeBergstrand
Copy link
Collaborator

The capability to set the QC read files as non-temp is already included in PR #130 . I would suggest that we change the default setting in that PR so that the final QC read files are non-temp by default.

Let's do this until we move to Snakemake 8.

@jmtsuji
Copy link
Collaborator Author

jmtsuji commented Feb 16, 2024

@jmtsuji Thanks for the explanation. I'm testing on my machine to confirm the behaviour encountered. To clarify, does this error only occur if you run --until circularize and then restart the pipeline? Isn't --until circularize a corner case? I noticed you used it as an example in the README.

@LeeBergstrand Thanks for testing -- I am also running a few more test on my end with different run params. For now, I assume the error only occurs if I run --until with something after the checkpoint. No re-start of the pipeline is needed -- the error occurs partway through the run.

Using --until circularize gets you the genome without any annotations. This is a nice way to avoid the most resource and time-intensive parts of the pipeline, e.g., for testing.

@jmtsuji
Copy link
Collaborator Author

jmtsuji commented Feb 16, 2024

If you have any circular contigs, then before combine_circular_and_linear_contigs, the DAG will add a bunch of additional rules upon re-evaluation, including rotating to the start gene and performing one more round of Polypolish-based polishing. The polishing step needs the short QC reads. However, this possible use of the QC short reads seems to be hidden from snakemake until the checkpoint is evaluated.

Is there a separate rule for this polishing step? or is it calling polishing rules again somehow (is that why there is a {step} in some of the polishing rule strings)?

It's calling the Polypolish rule again -- this is why there is a {step} wildcard (which can be either polish or circularize).

@jmtsuji
Copy link
Collaborator Author

jmtsuji commented Feb 16, 2024

The capability to set the QC read files as non-temp is already included in PR #130 . I would suggest that we change the default setting in that PR so that the final QC read files are non-temp by default.

Let's do this until we move to Snakemake 8.

OK, sounds good. I am running a couple tests on my end to confirm that this fix works.

@LeeBergstrand
Copy link
Collaborator

The capability to set the QC read files as non-temp is already included in PR #130 . I would suggest that we change the default setting in that PR so that the final QC read files are non-temp by default.

Let's do this until we move to Snakemake 8.

OK, sounds good. I am running a couple tests on my end to confirm that this fix works.

Sounds good. Thanks for giving me an overview of how checkpointing works and how it applies to our software. I send you the test results when I get them.

@LeeBergstrand
Copy link
Collaborator

@jmtsuji I had to run --until circularize twice. It was because I had to stop it halfway through because I forgot to update the rotary using git. Interestingly, I ran into a very similar error running rotary with --until circularize again:

BUG: Out of jobs ready to be started, but not all files built yet. Please check https://github.com/snakemake/snakemake/issues/823 for more information.
Remaining jobs:
 - process_start_genes: cram/circularize/identify/cram_start_genes.list
 - prepare_polypolish_circularize_input: cram/circularize/polypolish/input/cram_input.fasta
 - get_polished_contigs: blam/circularize/filter/blam_circular.fasta
 - map_short_reads_for_polishing: cram/circularize/polypolish/cram_R1.sam, cram/circularize/polypolish/cram_R2.sam, cram/circularize/polypolish/input/cram_input.fasta.0123, cram/circularize/polypolish/input/cram_input.fasta.amb, cram/circularize/polypolish/input/cram_input.fasta.ann, cram/circularize/polypolish/input/cram_input.fasta.bwt.2bit.64, cram/circularize/polypolish/input/cram_input.fasta.pac
 - circularize: checkpoints/circularize
 - get_start_genes: blam/circularize/identify/blam_start_gene.ffn
 - run_circlator: cram/circularize/circlator/cram_rotated.fasta, cram/circularize/circlator, cram/circularize/circlator/rotated.promer.contigs_with_ends.fa
 - polish_polypolish: cram/circularize/polypolish/cram_R1.clean.sam, cram/circularize/polypolish/cram_R2.clean.sam, cram/circularize/polypolish/cram_polypolish.fasta, cram/circularize/polypolish/polypolish.debug.log, cram/stats/circularize/polypolish_changes.log
 - search_contig_start: blam/circularize/identify/blam_circular.faa, blam/circularize/identify/blam_circular.ffn, blam/circularize/identify/blam_circular.gff, blam/circularize/identify/blam_hmmsearch_hits.txt, blam/circularize/identify/blam_hmmsearch_hits_no_comments.txt
 - get_polished_contigs: cram/circularize/filter/cram_circular.fasta
 - combine_circular_and_linear_contigs: blam/circularize/combine/blam_combined.fasta
 - symlink_circularization: blam/circularize/blam_circularize.fasta
 - process_start_genes: blam/circularize/identify/blam_start_genes.list
 - get_start_genes: cram/circularize/identify/cram_start_gene.ffn
 - finalize_circular_contig_rotation: blam/circularize/combine/blam_circular.fasta
 - run_circlator: blam/circularize/circlator/blam_rotated.fasta, blam/circularize/circlator, blam/circularize/circlator/rotated.promer.contigs_with_ends.fa
 - symlink_circularization: cram/circularize/cram_circularize.fasta
 - map_short_reads_for_polishing: blam/circularize/polypolish/blam_R1.sam, blam/circularize/polypolish/blam_R2.sam, blam/circularize/polypolish/input/blam_input.fasta.0123, blam/circularize/polypolish/input/blam_input.fasta.amb, blam/circularize/polypolish/input/blam_input.fasta.ann, blam/circularize/polypolish/input/blam_input.fasta.bwt.2bit.64, blam/circularize/polypolish/input/blam_input.fasta.pac
 - search_contig_start: cram/circularize/identify/cram_circular.faa, cram/circularize/identify/cram_circular.ffn, cram/circularize/identify/cram_circular.gff, cram/circularize/identify/cram_hmmsearch_hits.txt, cram/circularize/identify/cram_hmmsearch_hits_no_comments.txt
 - polish_polypolish: blam/circularize/polypolish/blam_R1.clean.sam, blam/circularize/polypolish/blam_R2.clean.sam, blam/circularize/polypolish/blam_polypolish.fasta, blam/circularize/polypolish/polypolish.debug.log, blam/stats/circularize/polypolish_changes.log
 - combine_circular_and_linear_contigs: cram/circularize/combine/cram_combined.fasta
 - prepare_polypolish_circularize_input: blam/circularize/polypolish/input/blam_input.fasta
 - finalize_circular_contig_rotation: cram/circularize/combine/cram_circular.fasta
Complete log: .snakemake/log/2024-02-16T005846.034888.snakemake.log

@LeeBergstrand
Copy link
Collaborator

@jmtsuji I had to run --until circularize twice. It was because I had to stop it halfway through because I forgot to update the rotary using git. Interestingly, I ran into a very similar error running rotary with --until circularize again:

BUG: Out of jobs ready to be started, but not all files built yet. Please check https://github.com/snakemake/snakemake/issues/823 for more information.
Remaining jobs:
 - process_start_genes: cram/circularize/identify/cram_start_genes.list
 - prepare_polypolish_circularize_input: cram/circularize/polypolish/input/cram_input.fasta
 - get_polished_contigs: blam/circularize/filter/blam_circular.fasta
 - map_short_reads_for_polishing: cram/circularize/polypolish/cram_R1.sam, cram/circularize/polypolish/cram_R2.sam, cram/circularize/polypolish/input/cram_input.fasta.0123, cram/circularize/polypolish/input/cram_input.fasta.amb, cram/circularize/polypolish/input/cram_input.fasta.ann, cram/circularize/polypolish/input/cram_input.fasta.bwt.2bit.64, cram/circularize/polypolish/input/cram_input.fasta.pac
 - circularize: checkpoints/circularize
 - get_start_genes: blam/circularize/identify/blam_start_gene.ffn
 - run_circlator: cram/circularize/circlator/cram_rotated.fasta, cram/circularize/circlator, cram/circularize/circlator/rotated.promer.contigs_with_ends.fa
 - polish_polypolish: cram/circularize/polypolish/cram_R1.clean.sam, cram/circularize/polypolish/cram_R2.clean.sam, cram/circularize/polypolish/cram_polypolish.fasta, cram/circularize/polypolish/polypolish.debug.log, cram/stats/circularize/polypolish_changes.log
 - search_contig_start: blam/circularize/identify/blam_circular.faa, blam/circularize/identify/blam_circular.ffn, blam/circularize/identify/blam_circular.gff, blam/circularize/identify/blam_hmmsearch_hits.txt, blam/circularize/identify/blam_hmmsearch_hits_no_comments.txt
 - get_polished_contigs: cram/circularize/filter/cram_circular.fasta
 - combine_circular_and_linear_contigs: blam/circularize/combine/blam_combined.fasta
 - symlink_circularization: blam/circularize/blam_circularize.fasta
 - process_start_genes: blam/circularize/identify/blam_start_genes.list
 - get_start_genes: cram/circularize/identify/cram_start_gene.ffn
 - finalize_circular_contig_rotation: blam/circularize/combine/blam_circular.fasta
 - run_circlator: blam/circularize/circlator/blam_rotated.fasta, blam/circularize/circlator, blam/circularize/circlator/rotated.promer.contigs_with_ends.fa
 - symlink_circularization: cram/circularize/cram_circularize.fasta
 - map_short_reads_for_polishing: blam/circularize/polypolish/blam_R1.sam, blam/circularize/polypolish/blam_R2.sam, blam/circularize/polypolish/input/blam_input.fasta.0123, blam/circularize/polypolish/input/blam_input.fasta.amb, blam/circularize/polypolish/input/blam_input.fasta.ann, blam/circularize/polypolish/input/blam_input.fasta.bwt.2bit.64, blam/circularize/polypolish/input/blam_input.fasta.pac
 - search_contig_start: cram/circularize/identify/cram_circular.faa, cram/circularize/identify/cram_circular.ffn, cram/circularize/identify/cram_circular.gff, cram/circularize/identify/cram_hmmsearch_hits.txt, cram/circularize/identify/cram_hmmsearch_hits_no_comments.txt
 - polish_polypolish: blam/circularize/polypolish/blam_R1.clean.sam, blam/circularize/polypolish/blam_R2.clean.sam, blam/circularize/polypolish/blam_polypolish.fasta, blam/circularize/polypolish/polypolish.debug.log, blam/stats/circularize/polypolish_changes.log
 - combine_circular_and_linear_contigs: cram/circularize/combine/cram_combined.fasta
 - prepare_polypolish_circularize_input: blam/circularize/polypolish/input/blam_input.fasta
 - finalize_circular_contig_rotation: cram/circularize/combine/cram_circular.fasta
Complete log: .snakemake/log/2024-02-16T005846.034888.snakemake.log

@jmtsuji I get the above error every time I run with --until circularize.

@LeeBergstrand
Copy link
Collaborator

@jmtsuji I had to run --until circularize twice. It was because I had to stop it halfway through because I forgot to update the rotary using git. Interestingly, I ran into a very similar error running rotary with --until circularize again:

BUG: Out of jobs ready to be started, but not all files built yet. Please check https://github.com/snakemake/snakemake/issues/823 for more information.
Remaining jobs:
 - process_start_genes: cram/circularize/identify/cram_start_genes.list
 - prepare_polypolish_circularize_input: cram/circularize/polypolish/input/cram_input.fasta
 - get_polished_contigs: blam/circularize/filter/blam_circular.fasta
 - map_short_reads_for_polishing: cram/circularize/polypolish/cram_R1.sam, cram/circularize/polypolish/cram_R2.sam, cram/circularize/polypolish/input/cram_input.fasta.0123, cram/circularize/polypolish/input/cram_input.fasta.amb, cram/circularize/polypolish/input/cram_input.fasta.ann, cram/circularize/polypolish/input/cram_input.fasta.bwt.2bit.64, cram/circularize/polypolish/input/cram_input.fasta.pac
 - circularize: checkpoints/circularize
 - get_start_genes: blam/circularize/identify/blam_start_gene.ffn
 - run_circlator: cram/circularize/circlator/cram_rotated.fasta, cram/circularize/circlator, cram/circularize/circlator/rotated.promer.contigs_with_ends.fa
 - polish_polypolish: cram/circularize/polypolish/cram_R1.clean.sam, cram/circularize/polypolish/cram_R2.clean.sam, cram/circularize/polypolish/cram_polypolish.fasta, cram/circularize/polypolish/polypolish.debug.log, cram/stats/circularize/polypolish_changes.log
 - search_contig_start: blam/circularize/identify/blam_circular.faa, blam/circularize/identify/blam_circular.ffn, blam/circularize/identify/blam_circular.gff, blam/circularize/identify/blam_hmmsearch_hits.txt, blam/circularize/identify/blam_hmmsearch_hits_no_comments.txt
 - get_polished_contigs: cram/circularize/filter/cram_circular.fasta
 - combine_circular_and_linear_contigs: blam/circularize/combine/blam_combined.fasta
 - symlink_circularization: blam/circularize/blam_circularize.fasta
 - process_start_genes: blam/circularize/identify/blam_start_genes.list
 - get_start_genes: cram/circularize/identify/cram_start_gene.ffn
 - finalize_circular_contig_rotation: blam/circularize/combine/blam_circular.fasta
 - run_circlator: blam/circularize/circlator/blam_rotated.fasta, blam/circularize/circlator, blam/circularize/circlator/rotated.promer.contigs_with_ends.fa
 - symlink_circularization: cram/circularize/cram_circularize.fasta
 - map_short_reads_for_polishing: blam/circularize/polypolish/blam_R1.sam, blam/circularize/polypolish/blam_R2.sam, blam/circularize/polypolish/input/blam_input.fasta.0123, blam/circularize/polypolish/input/blam_input.fasta.amb, blam/circularize/polypolish/input/blam_input.fasta.ann, blam/circularize/polypolish/input/blam_input.fasta.bwt.2bit.64, blam/circularize/polypolish/input/blam_input.fasta.pac
 - search_contig_start: cram/circularize/identify/cram_circular.faa, cram/circularize/identify/cram_circular.ffn, cram/circularize/identify/cram_circular.gff, cram/circularize/identify/cram_hmmsearch_hits.txt, cram/circularize/identify/cram_hmmsearch_hits_no_comments.txt
 - polish_polypolish: blam/circularize/polypolish/blam_R1.clean.sam, blam/circularize/polypolish/blam_R2.clean.sam, blam/circularize/polypolish/blam_polypolish.fasta, blam/circularize/polypolish/polypolish.debug.log, blam/stats/circularize/polypolish_changes.log
 - combine_circular_and_linear_contigs: cram/circularize/combine/cram_combined.fasta
 - prepare_polypolish_circularize_input: blam/circularize/polypolish/input/blam_input.fasta
 - finalize_circular_contig_rotation: cram/circularize/combine/cram_circular.fasta
Complete log: .snakemake/log/2024-02-16T005846.034888.snakemake.log

@jmtsuji I get the above error every time I run with --until circularize.

Does that make more sense? I think it is fundamentally a snakemake bug, but I think we could work around it temporarily by making the QC reads non-temp by default. However, I suspect that the bug will not appear if you have the pipeline run fully to the end. Let me know your thoughts here -- thanks!

@jmtsuji Changing it to keep the qc reads fixes the above error. For me, the pipeline, when run end to end, still passes even with the keep qced reads set to false in the config.

@jmtsuji
Copy link
Collaborator Author

jmtsuji commented Feb 19, 2024

@LeeBergstrand My tests are also finished :-) Here are my results, using commit 864ca98 (pypolca branch):

Main rotary run with test dataset

If keep_final_qc_read_files is False (current default):

  • Whole pipeline: finishes successfully
  • --until circularize: FAILS with the bug above

If keep_final_qc_read_files as True (proposed temporary fix):

  • Whole pipeline: finishes successfully
  • --until circularize: finishes successfully

These results seem to match your tests results -- good.

Pipeline re-run

Interestingly, I also checked what would happen if I tried to re-run the pipeline after it finished for all 4 tests:

If keep_final_qc_read_files is False (current default):

  • Whole pipeline: tries to re-analyze the data, starting from long/short read QC
  • --until circularize: tries to re-analyze the data, starting from long/short read QC

If keep_final_qc_read_files is True (proposed temporary fix):

  • Whole pipeline: acknowledges the pipeline is finished and does not re-analyze the data (just re-generates the final checkpoints) <- this is the expected behaviour
  • --until circularize: tries to re-analyze the data, starting from long read QC (short read QC is OK). It claims it is looking for the missing file S18/qc/long/S18_nanopore_qc.fastq.gz (an early temp file from long read QC)

Summary

Setting keep_final_qc_read_files fixes two issues:

  • Allows the pipeline to run successfully to rule circularize
  • Allows full pipeline runs to acknowledge they are completely finished and do not need to re-run

However, the full pipeline still finishes, like you mentioned, even if keep_final_qc_read_files is False

Moving forward

I could go either way with changing keep_final_qc_read_files to True as default. Because the full pipeline still works with keep_final_qc_read_files as False, and discarding the QC read files saves a lot of space, I could see how keeping this feature as default could be nice. We could add a note to the README that keep_final_qc_read_files must be True to run --until circularize (i.e., to skip genome annotation). On the other hand, we could change our default to keep_final_qc_read_files as True to avoid people running into errors.

The re-run tests I did suggest there is a secondary bug with our temp file setup that is preventing the pipeline from properly acknowledging long read QC files. I'll open a second issue for this.

@LeeBergstrand What are your thoughts about how we ought to set keep_final_qc_read_files by default?

@LeeBergstrand
Copy link
Collaborator

If keep_final_qc_read_files is False (current default):

  • Whole pipeline: tries to re-analyze the data, starting from long/short read QC
  • --until circularize: tries to re-analyze the data, starting from long/short read QC

If keep_final_qc_read_files is True (proposed temporary fix):

  • Whole pipeline: acknowledges the pipeline is finished and does not re-analyze the data (just re-generates the final checkpoints) <- this is the expected behaviour
  • --until circularize: tries to re-analyze the data, starting from long read QC (short read QC is OK). It claims it is looking for the missing file S18/qc/long/S18_nanopore_qc.fastq.gz (an early temp file from long read QC)

I got the same effect.

@LeeBergstrand
Copy link
Collaborator

Moving forward

I could go either way with changing keep_final_qc_read_files to True as default. Because the full pipeline still works with keep_final_qc_read_files as False, and discarding the QC read files saves a lot of space, I could see how keeping this feature as default could be nice. We could add a note to the README that keep_final_qc_read_files must be True to run --until circularize (i.e., to skip genome annotation). On the other hand, we could change our default to keep_final_qc_read_files as True to avoid people running into errors.

The re-run tests I did suggest there is a secondary bug with our temp file setup that is preventing the pipeline from properly acknowledging long read QC files. I'll open a second issue for this.

@LeeBergstrand What are your thoughts about how we ought to set keep_final_qc_read_files by default?

@jmtsuji I recommend we do the following:

  • Create two separate issues for this.
  • Set If keep_final_qc_read_files to True by default.
  • Move to Snakemake 8 and reassess this issue.

@LeeBergstrand
Copy link
Collaborator

@jmtsuji I noticed that you use a Symlinking paradigm throughout the pipeline. For example:

# Conditional based on whether short read polishing was performed
rule pre_coverage_filter:
    input:
        "{sample}/polish/medaka/{sample}_consensus.fasta" if POLISH_WITH_SHORT_READS == False else "{sample}/polish/polca/{sample}_polca.fasta"
    output:
        temp("{sample}/polish/cov_filter/{sample}_pre_filtered.fasta")
    run:
        source_relpath = os.path.relpath(str(input),os.path.dirname(str(output)))
        os.symlink(source_relpath,str(output))

This was quite troublesome when I made the temp() file path as Snakemake would start deleting files that the symlinks would point to. I have found that Snakemake doesn't like symlinking because it relies on the presence and absence of files.

Perhaps the rebuilding issues that came up above are related to the symlinking? For example, Snakemake is having trouble tracking what files are used by what rules. I'm not sure. I would bring up this symlinking issue later, but I wanted to put it here now to ensure you know about it.

@LeeBergstrand
Copy link
Collaborator

If keep_final_qc_read_files is False (current default):

  • Whole pipeline: tries to re-analyze the data, starting from long/short read QC
  • --until circularize: tries to re-analyze the data, starting from long/short read QC

If keep_final_qc_read_files is True (proposed temporary fix):

  • Whole pipeline: acknowledges the pipeline is finished and does not re-analyze the data (just re-generates the final checkpoints) <- this is the expected behaviour
  • --until circularize: tries to re-analyze the data, starting from long read QC (short read QC is OK). It claims it is looking for the missing file S18/qc/long/S18_nanopore_qc.fastq.gz (an early temp file from long read QC)

I got the same effect.

@jmtsuji It might also rerun the entire pipeline if the config file is modified. My tests aren't really supporting that but I would also look into it. I'm getting the following for rule: nanopore_qc_filter

Reason: Forced execution

Something is causing the forced execution of the very first rule.

@jmtsuji
Copy link
Collaborator Author

jmtsuji commented Feb 22, 2024

@jmtsuji I recommend we do the following:

  • Create two separate issues for this.
  • Set If keep_final_qc_read_files to True by default.
  • Move to Snakemake 8 and reassess this issue.

@LeeBergstrand Thanks for the thoughts -- I agree. Second issue for checkpointing is already created at #134 , and keep_final_qc_read_files is now set as True by default in PR #138 (thanks for this). Let's keep this issue open and re-assess this bug after moving to Snakemake 8.

@jmtsuji
Copy link
Collaborator Author

jmtsuji commented Feb 22, 2024

This was quite troublesome when I made the temp() file path as Snakemake would start deleting files that the symlinks would point to. I have found that Snakemake doesn't like symlinking because it relies on the presence and absence of files.

Perhaps the rebuilding issues that came up above are related to the symlinking? For example, Snakemake is having trouble tracking what files are used by what rules. I'm not sure. I would bring up this symlinking issue later, but I wanted to put it here now to ensure you know about it.

I've also run into symlinking issues while editing the pipeline and agree, it would be better to streamline these parts, if possible. Reducing the number of steps in the analysis this way might help with consistent DAG construction. (Just for reference for the future: for the polish steps, I think it's necessary to either copy or symlink the files before running the polish rules in order for the rules to be compatible with inputs from multiple modules. This is not the case for other rules, I think, so symlinking steps can probably be streamlined in many other cases.) I'll move this to a separate issue.

@jmtsuji
Copy link
Collaborator Author

jmtsuji commented Feb 22, 2024

@jmtsuji It might also rerun the entire pipeline if the config file is modified. My tests aren't really supporting that but I would also look into it. I'm getting the following for rule: nanopore_qc_filter

Reason: Forced execution

Something is causing the forced execution of the very first rule.

Also interesting -- I haven't seen "Forced execution" as a reason to re-run nanopore_qc_filter in my tests, but I was running with a fresh install and config of rotary. I think snakemake also sometimes forces re-runs if it notices a snakefile has been updated compared to the last run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants