feat: for bwa, auto infer block size, extra tests, code cleanup and a…

…dd docs (#1774)   ### Description  - Automated infer of block size - extra tests - improve docs - code clean up ### QC  * [x] I confirm that: For all wrappers added by this PR, * there is a test case which covers any introduced changes, * `input:` and `output:` file paths in the resulting rule can be changed arbitrarily, * either the wrapper can only use a single core, or the example rule contains a `threads: x` statement with `x` being a reasonable default, * rule names in the test case are in [snake_case](https://en.wikipedia.org/wiki/Snake_case) and somehow tell what the rule is about or match the tools purpose or name (e.g., `map_reads` for a step that maps reads), * all `environment.yaml` specifications follow [the respective best practices](https://stackoverflow.com/a/64594513/2352071), * wherever possible, command line arguments are inferred and set automatically (e.g. based on file extensions in `input:` or `output:`), * all fields of the example rules in the `Snakefile`s and their entries are explained via comments (`input:`/`output:`/`params:` etc.), * `stderr` and/or `stdout` are logged correctly (`log:`), depending on the wrapped tool, * temporary files are either written to a unique hidden folder in the working directory, or (better) stored where the Python function `tempfile.gettempdir()` points to (see [here](https://docs.python.org/3/library/tempfile.html#tempfile.gettempdir); this also means that using any Python `tempfile` default behavior works), * the `meta.yaml` contains a link to the documentation of the respective tool or command, * `Snakefile`s pass the linting (`snakemake --lint`), * `Snakefile`s are formatted with [snakefmt](https://github.com/snakemake/snakefmt), * Python wrapper scripts are formatted with [black](https://black.readthedocs.io). * Conda environments use a minimal amount of channels, in recommended ordering. E.g. for bioconda, use (conda-forge, bioconda, nodefaults, as conda-forge should have highest priority and defaults channels are usually not needed because most packages are in conda-forge nowadays).
snakemake · Oct 30, 2023 · 66940e3 · 66940e3
1 parent 1cc3e00
commit 66940e3
Show file tree

Hide file tree

Showing 6 changed files with 83 additions and 33 deletions.
diff --git a/bio/bwa/aln/meta.yaml b/bio/bwa/aln/meta.yaml
@@ -1,4 +1,15 @@
 name: "bwa aln"
 description: Map reads with bwa aln. For more information about BWA see `BWA documentation <http://bio-bwa.sourceforge.net/bwa.shtml>`_.
+url: https://github.com/lh3/bwa
 authors:
   - Julian de Ruiter
+  - Filipe G. Vieira
+input:
+  - fastq" FASTQ file(s)
+  - idx: BWA reference genome index
+output:
+  - SAI file
+notes: |
+  * The `extra` param allows for additional arguments for bwa-mem.
+  * The `sorting` param allows to enable sorting, and can be either 'none', 'samtools' or 'picard'.
+  * The `sort_extra` allows for extra arguments for samtools/picard
diff --git a/bio/bwa/index/meta.yaml b/bio/bwa/index/meta.yaml
@@ -1,4 +1,14 @@
 name: "bwa index"
-description: Creates a BWA index. For more information about BWA see `BWA documentation <http://bio-bwa.sourceforge.net/bwa.shtml>`_.
+description: Creates a BWA index.
+url: https://github.com/lh3/bwa
 authors:
   - Patrik Smeds
+  - Filipe G. Vieira
+input:
+  - fasta file
+output:
+  - BWA index files
+params:
+  - extra: aditional program arguments
+notes: |
+  * Wrapper automatically calculates `block_size`.
diff --git a/bio/bwa/index/test/Snakefile b/bio/bwa/index/test/Snakefile
@@ -2,10 +2,10 @@ rule bwa_index:
     input:
         "{genome}.fasta",
     output:
-        idx=multiext("{genome}", ".amb", ".ann", ".bwt", ".pac", ".sa"),
+        idx=multiext("{genome}.{alg}", ".amb", ".ann", ".bwt", ".pac", ".sa"),
     log:
-        "logs/bwa_index/{genome}.log",
+        "logs/bwa_index/{genome}.{alg}.log",
     params:
-        algorithm="bwtsw",
+        extra=lambda w: f"-a {w.alg}",
     wrapper:
         "master/bio/bwa/index"
diff --git a/bio/bwa/index/wrapper.py b/bio/bwa/index/wrapper.py
@@ -4,29 +4,26 @@
 __license__ = "MIT"
 
 from os.path import splitext
-
+from pathlib import Path
 from snakemake.shell import shell
 
 log = snakemake.log_fmt_shell(stdout=False, stderr=True)
-
-# Check inputs/arguments.
-if len(snakemake.input) == 0:
-    raise ValueError("A reference genome has to be provided!")
-elif len(snakemake.input) > 1:
-    raise ValueError("Only one reference genome can be inputed!")
+extra = snakemake.params.get("extra", "")
 
 # Prefix that should be used for the database
-prefix = snakemake.params.get("prefix", splitext(snakemake.output.idx[0])[0])
-
-if len(prefix) > 0:
-    prefix = "-p " + prefix
-
-# Contrunction algorithm that will be used to build the database, default is bwtsw
-construction_algorithm = snakemake.params.get("algorithm", "")
-
-if len(construction_algorithm) != 0:
-    construction_algorithm = "-a " + construction_algorithm
-
-shell(
-    "bwa index" " {prefix}" " {construction_algorithm}" " {snakemake.input[0]}" " {log}"
-)
+prefix = snakemake.params.get("prefix", splitext(snakemake.output[0])[0])
+
+# Block size should be a 10th of the reference length (https://github.com/lh3/bwa/issues/104)
+block_size = Path(snakemake.input[0]).stat().st_size / 1024 / 1024 / 10
+# If GZip, assume a 4-fold compression rate:
+# - https://scfbm.biomedcentral.com/articles/10.1186/s13029-019-0073-5/tables/3
+# - https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7336184/
+# - https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3866555/bin/supp_btt594_supplement-rev2.pdf
+# - https://softpanorama.org/HPC/DNA_sequencing/Genomic_data_compression/index.shtml
+if snakemake.input[0].endswith(".gz"):
+    block_size *= 4
+
+# Ensure minimum (10 Mb as BWA default) and maximum (50Gb since no apparent gain and to limit memory usage) block size
+block_size = min(50 * 1024, max(10, int(block_size)))
+
+shell("bwa index -b {block_size}M -p {prefix} {extra} {snakemake.input} {log}")
diff --git a/bio/bwa/samxe/meta.yaml b/bio/bwa/samxe/meta.yaml
@@ -3,9 +3,9 @@ description: Map paired-end reads with either bwa samse or sampe.
 authors:
   - Filipe G. Vieira
 input:
-  - FASTQ file(s)
-  - SAI file(s)
-  - reference genome
+  - fastq: FASTQ file(s)
+  - sai: SAI file(s)
+  - idx: BWA reference genome index
 output:
   - SAM/BAM alignment file
 notes: |

diff --git a/test.py b/test.py
@@ -2114,11 +2114,43 @@ def test_bwa_index():
             "snakemake",
             "--cores",
             "1",
-            "genome.amb",
-            "genome.ann",
-            "genome.bwt",
-            "genome.pac",
-            "genome.sa",
+            "genome.bwtsw.amb",
+            "genome.bwtsw.ann",
+            "genome.bwtsw.bwt",
+            "genome.bwtsw.pac",
+            "genome.bwtsw.sa",
+            "--use-conda",
+            "-F",
+        ],
+    )
+
+    run(
+        "bio/bwa/index",
+        [
+            "snakemake",
+            "--cores",
+            "1",
+            "genome.is.amb",
+            "genome.is.ann",
+            "genome.is.bwt",
+            "genome.is.pac",
+            "genome.is.sa",
+            "--use-conda",
+            "-F",
+        ],
+    )
+
+    run(
+        "bio/bwa/index",
+        [
+            "snakemake",
+            "--cores",
+            "1",
+            "genome.rb2.amb",
+            "genome.rb2.ann",
+            "genome.rb2.bwt",
+            "genome.rb2.pac",
+            "genome.rb2.sa",
             "--use-conda",
             "-F",
         ],