fix: Bwa mem threads (#1743)

### Description Currently, [master/bio/bwa-mem2/mem](https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/bwa-mem2/mem.html) uses more threads than the number requested by user: for `x` threads requested by user, `x` is used in bwa-mem. Then either `x` are used by samtools (view or sort), or `1` by Picard. So for `x` threads requested by user, from `x+1` to `2x` threads are used. This leads to bwa-mem jobs using official wrappers being killed by cluster manager for not respecting fair use rules where I work. Coworkers asked for a fix. In this PR, I introduce a function to split the total number of threads between bwa-mem in one hand, and samtools/picard in the other hand. Ultimately, the introduced changes also allow an uncompressed SAM output. Tests have been modified, since this wrapper now requires at least two threads when samtools/picard are used. No more when sam output is requested. ### QC  * [X] I confirm that: For all wrappers added by this PR, * there is a test case which covers any introduced changes, * `input:` and `output:` file paths in the resulting rule can be changed arbitrarily, * either the wrapper can only use a single core, or the example rule contains a `threads: x` statement with `x` being a reasonable default, * rule names in the test case are in [snake_case](https://en.wikipedia.org/wiki/Snake_case) and somehow tell what the rule is about or match the tools purpose or name (e.g., `map_reads` for a step that maps reads), * all `environment.yaml` specifications follow [the respective best practices](https://stackoverflow.com/a/64594513/2352071), * wherever possible, command line arguments are inferred and set automatically (e.g. based on file extensions in `input:` or `output:`), * all fields of the example rules in the `Snakefile`s and their entries are explained via comments (`input:`/`output:`/`params:` etc.), * `stderr` and/or `stdout` are logged correctly (`log:`), depending on the wrapped tool, * temporary files are either written to a unique hidden folder in the working directory, or (better) stored where the Python function `tempfile.gettempdir()` points to (see [here](https://docs.python.org/3/library/tempfile.html#tempfile.gettempdir); this also means that using any Python `tempfile` default behavior works), * the `meta.yaml` contains a link to the documentation of the respective tool or command, * `Snakefile`s pass the linting (`snakemake --lint`), * `Snakefile`s are formatted with [snakefmt](https://github.com/snakemake/snakefmt), * Python wrapper scripts are formatted with [black](https://black.readthedocs.io). * Conda environments use a minimal amount of channels, in recommended ordering. E.g. for bioconda, use (conda-forge, bioconda, nodefaults, as conda-forge should have highest priority and defaults channels are usually not needed because most packages are in conda-forge nowadays). --------- Co-authored-by: tdayris <tdayris@gustaveroussy.fr> Co-authored-by: tdayris <thibault.dayris@gustaveroussy.fr> Co-authored-by: Johannes Köster <johannes.koester@uni-due.de> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: snakedeploy-bot[bot] <115615832+snakedeploy-bot[bot]@users.noreply.github.com> Co-authored-by: Felix Mölder <felix.moelder@uni-due.de> Co-authored-by: Christopher Schröder <christopher.schroeder@tu-dortmund.de>
snakemake · Aug 25, 2023 · e35e312 · e35e312
1 parent 2f76671
commit e35e312
Show file tree

Hide file tree

Showing 4 changed files with 69 additions and 16 deletions.
diff --git a/bio/bwa-mem2/mem/meta.yaml b/bio/bwa-mem2/mem/meta.yaml
@@ -5,9 +5,10 @@ authors:
   - Christopher Schröder
   - Johannes Köster
   - Julian de Ruiter
+  - Thibault Dayris
 input:
-  - FASTQ file(s)
-  - reference genome
+  - reads: List of path(s) to FASTQ file(s)
+  - idx: List of paths to indexed reference genome files
 output:
   - SAM/BAM/CRAM file
 notes: |

diff --git a/bio/bwa-mem2/mem/test/Snakefile b/bio/bwa-mem2/mem/test/Snakefile
@@ -9,9 +9,28 @@ rule bwa_mem2_mem:
         "logs/bwa_mem2/{sample}.log",
     params:
         extra=r"-R '@RG\tID:{sample}\tSM:{sample}'",
-        sort="none",  # Can be 'none', 'samtools' or 'picard'.
+        sort="none",  # Can be 'none', 'samtools', or 'picard'.
         sort_order="coordinate",  # Can be 'coordinate' (default) or 'queryname'.
-        sort_extra="",  # Extra args for samtools/picard.
+        sort_extra="",  # Extra args for samtools/picard sorts.
+    threads: 8
+    wrapper:
+        "master/bio/bwa-mem2/mem"
+
+
+rule bwa_mem2_mem_sam:
+    input:
+        reads=["reads/{sample}.1.fastq", "reads/{sample}.2.fastq"],
+        # Index can be a list of (all) files created by bwa, or one of them
+        idx=multiext("genome.fasta", ".amb", ".ann", ".bwt.2bit.64", ".pac"),
+    output:
+        "mapped/{sample}.sam",
+    log:
+        "logs/bwa_mem2/{sample}.log",
+    params:
+        extra=r"-R '@RG\tID:{sample}\tSM:{sample}'",
+        sort="none",  # Can be 'none', 'samtools', or 'picard'.
+        sort_order="coordinate",  # Can be 'coordinate' (default) or 'queryname'.
+        sort_extra="",  # Extra args for samtools/picard sorts.
     threads: 8
     wrapper:
         "master/bio/bwa-mem2/mem"
diff --git a/bio/bwa-mem2/mem/wrapper.py b/bio/bwa-mem2/mem/wrapper.py
@@ -19,9 +19,13 @@
 sort = snakemake.params.get("sort", "none")
 sort_order = snakemake.params.get("sort_order", "coordinate")
 sort_extra = snakemake.params.get("sort_extra", "")
-samtools_opts = get_samtools_opts(snakemake, param_name="sort_extra")
+samtools_opts = get_samtools_opts(
+    snakemake, parse_threads=False, param_name="sort_extra"
+)
 java_opts = get_java_opts(snakemake)
 
+bwa_threads = snakemake.threads
+samtools_threads = snakemake.threads - 1
 
 index = snakemake.input.get("index", "")
 if isinstance(index, str):
@@ -42,31 +46,56 @@
 
 # Determine which pipe command to use for converting to bam or sorting.
 if sort == "none":
-    # Simply convert to bam using samtools view.
-    pipe_cmd = "samtools view {samtools_opts}"
+    # Correctly assign number of threads according to user request
+    if samtools_threads >= 1:
+        samtools_opts += f" --threads {samtools_threads} "
+
+    if str(snakemake.output[0]).lower().endswith(("bam", "cram")):
+        # Simply convert to bam using samtools view.
+        pipe_cmd = " | samtools view {samtools_opts} > {snakemake.output[0]}"
+    else:
+        # Do not perform any sort nor compression, output raw sam
+        pipe_cmd = " > {snakemake.output[0]} "
+
 
 elif sort == "samtools":
-    # Sort alignments using samtools sort.
-    pipe_cmd = "samtools sort {samtools_opts} {sort_extra} -T {tmpdir}"
+    # Correctly assign number of threads according to user request
+    if samtools_threads >= 1:
+        samtools_opts += f" --threads {samtools_threads} "
 
     # Add name flag if needed.
     if sort_order == "queryname":
         sort_extra += " -n"
 
+    # Sort alignments using samtools sort.
+    pipe_cmd = " | samtools sort {samtools_opts} {sort_extra} -T {tmpdir} > {snakemake.output[0]}"
+
 elif sort == "picard":
+    # Correctly assign number of threads according to user request
+    bwa_threads = bwa_threads - 1
+    if bwa_threads <= 0:
+        raise ValueError(
+            "Not enough threads requested. This wrapper requires exactly one more."
+        )
+
     # Sort alignments using picard SortSam.
-    pipe_cmd = "picard SortSam {java_opts} {sort_extra} --INPUT /dev/stdin --TMP_DIR {tmpdir} --SORT_ORDER {sort_order} --OUTPUT {snakemake.output[0]}"
+    pipe_cmd = (
+        " | picard SortSam {java_opts} {sort_extra} "
+        "--INPUT /dev/stdin "
+        "--TMP_DIR {tmpdir} "
+        "--SORT_ORDER {sort_order} "
+        "--OUTPUT {snakemake.output[0]}"
+    )
 
 else:
     raise ValueError(f"Unexpected value for params.sort ({sort})")
 
-
 with tempfile.TemporaryDirectory() as tmpdir:
     shell(
         "(bwa-mem2 mem"
-        " -t {snakemake.threads}"
+        " -t {bwa_threads}"
         " {extra}"
         " {index}"
         " {snakemake.input.reads}"
-        " | " + pipe_cmd + ") {log}"
+        " " + pipe_cmd + " ) {log}"
     )
diff --git a/test.py b/test.py
@@ -2160,7 +2160,11 @@ def test_bwa_samse_sort_picard():
 def test_bwa_mem2_mem():
     run(
         "bio/bwa-mem2/mem",
-        ["snakemake", "--cores", "1", "mapped/a.bam", "--use-conda", "-F"],
+        ["snakemake", "--cores", "2", "mapped/a.bam", "--use-conda", "-F"],
+    )
+    run(
+        "bio/bwa-mem2/mem",
+        ["snakemake", "--cores", "2", "mapped/a.sam", "--use-conda", "-F"],
     )
 
 
@@ -2171,7 +2175,7 @@ def test_bwa_mem2_sort_samtools():
         [
             "snakemake",
             "--cores",
-            "1",
+            "2",
             "mapped/a.bam",
             "--use-conda",
             "-F",
@@ -2203,7 +2207,7 @@ def test_bwa_mem2_sort_picard():
         [
             "snakemake",
             "--cores",
-            "1",
+            "2",
             "mapped/a.bam",
             "--use-conda",
             "-F",