feat: automate inference of index name (#1169)

### Description  Added automated inference of index name and to call `MarkDuplicatesWithMateCigar` instead of the default, since both tools are quite similar. These changes render the wrapper `bio/picard/markduplicateswithmatecigar` redundant, ### QC  * [x] I confirm that: For all wrappers added by this PR, * there is a test case which covers any introduced changes, * `input:` and `output:` file paths in the resulting rule can be changed arbitrarily, * either the wrapper can only use a single core, or the example rule contains a `threads: x` statement with `x` being a reasonable default, * rule names in the test case are in [snake_case](https://en.wikipedia.org/wiki/Snake_case) and somehow tell what the rule is about or match the tools purpose or name (e.g., `map_reads` for a step that maps reads), * all `environment.yaml` specifications follow [the respective best practices](https://stackoverflow.com/a/64594513/2352071), * wherever possible, command line arguments are inferred and set automatically (e.g. based on file extensions in `input:` or `output:`), * all fields of the example rules in the `Snakefile`s and their entries are explained via comments (`input:`/`output:`/`params:` etc.), * `stderr` and/or `stdout` are logged correctly (`log:`), depending on the wrapped tool, * temporary files are either written to a unique hidden folder in the working directory, or (better) stored where the Python function `tempfile.gettempdir()` points to (see [here](https://docs.python.org/3/library/tempfile.html#tempfile.gettempdir); this also means that using any Python `tempfile` default behavior works), * the `meta.yaml` contains a link to the documentation of the respective tool or command, * `Snakefile`s pass the linting (`snakemake --lint`), * `Snakefile`s are formatted with [snakefmt](https://github.com/snakemake/snakefmt), * Python wrapper scripts are formatted with [black](https://black.readthedocs.io). * Conda environments use a minimal amount of channels, in recommended ordering. E.g. for bioconda, use (conda-forge, bioconda, nodefaults, as conda-forge should have highest priority and defaults channels are usually not needed because most packages are in conda-forge nowadays). --------- Co-authored-by: David Laehnemann <david.laehnemann@hhu.de>
snakemake · Mar 29, 2023 · 0d2c92a · 0d2c92a
1 parent 00b9b1c
commit 0d2c92a
Show file tree

Hide file tree

Showing 14 changed files with 89 additions and 110 deletions.
diff --git a/bio/dada2/quality-profile/test/reports/dada2/fwd-quality-profiles.png b/bio/dada2/quality-profile/test/reports/dada2/fwd-quality-profiles.png
diff --git a/...ada2/quality-profile/test/reports/dada2/quality-profile/a.1-quality-profile.png b/...ada2/quality-profile/test/reports/dada2/quality-profile/a.1-quality-profile.png
diff --git a/bio/dada2/quality-profile/test/reports/dada2/rev-quality-profiles.png b/bio/dada2/quality-profile/test/reports/dada2/rev-quality-profiles.png
diff --git a/bio/picard/markduplicates/environment.yaml b/bio/picard/markduplicates/environment.yaml
@@ -3,6 +3,6 @@ channels:
   - bioconda
   - nodefaults
 dependencies:
-  - picard =2.27.4
+  - picard =3.0.0
   - samtools =1.16.1
-  - snakemake-wrapper-utils =0.5.2
+  - snakemake-wrapper-utils =0.5.3
diff --git a/bio/picard/markduplicates/meta.yaml b/bio/picard/markduplicates/meta.yaml
@@ -1,15 +1,19 @@
 name: picard MarkDuplicates
 description: |
-  Mark PCR and optical duplicates with picard tools. For more information about MarkDuplicates see `picard documentation <https://broadinstitute.github.io/picard/command-line-overview.html#MarkDuplicates>`_.
+  Mark PCR and optical duplicates with picard tools.
+url: https://broadinstitute.github.io/picard/command-line-overview.html#MarkDuplicates
 authors:
   - Johannes Köster
   - Christopher Schröder
+  - Filipe G. Vieira
 input:
   - bam/cram file(s)
 output:
   - bam/cram file with marked or removed duplicates
+params:
+  - java_opts: allows for additional arguments to be passed to the java compiler, e.g. "-XX:ParallelGCThreads=10" (not for `-XmX` or `-Djava.io.tmpdir`, since they are handled automatically).
+  - extra: allows for additional program arguments.
+  - embed_ref: allows to embed the fasta reference into the cram
+  - withmatecigar: allows to run `MarkDuplicatesWithMateCigar <https://broadinstitute.github.io/picard/command-line-overview.html#MarkDuplicatesWithMateCigar>`_ instead.
 notes: |
-  * The `java_opts` param allows for additional arguments to be passed to the java compiler, e.g. "-XX:ParallelGCThreads=10" (not for `-XmX` or `-Djava.io.tmpdir`, since they are handled automatically).
-  * The `extra` param allows for additional program arguments.
   * `--TMP_DIR` is automatically set by `resources.tmpdir`
-  * For more information see, https://broadinstitute.github.io/picard/command-line-overview.html#MarkDuplicates
diff --git a/bio/picard/markduplicates/test/Snakefile b/bio/picard/markduplicates/test/Snakefile
@@ -1,14 +1,14 @@
-rule mark_duplicates:
+rule markduplicates_bam:
     input:
         bams="mapped/{sample}.bam",
     # optional to specify a list of BAMs; this has the same effect
     # of marking duplicates on separate read groups for a sample
     # and then merging
     output:
-        bam="dedup/{sample}.bam",
-        metrics="dedup/{sample}.metrics.txt",
+        bam="dedup_bam/{sample}.bam",
+        metrics="dedup_bam/{sample}.metrics.txt",
     log:
-        "logs/picard/dedup/{sample}.log",
+        "logs/dedup_bam/{sample}.log",
     params:
         extra="--REMOVE_DUPLICATES true",
     # optional specification of memory usage of the JVM that snakemake will respect with global
@@ -21,26 +21,39 @@ rule mark_duplicates:
         "master/bio/picard/markduplicates"
 
 
-rule mark_duplicates_cram:
+use rule markduplicates_bam as markduplicateswithmatecigar_bam with:
+    output:
+        bam="dedup_bam/{sample}.matecigar.bam",
+        idx="dedup_bam/{sample}.mcigar.bai",
+        metrics="dedup_bam/{sample}.matecigar.metrics.txt",
+    log:
+        "logs/dedup_bam/{sample}.matecigar.log",
+    params:
+        withmatecigar=True,
+        extra="--REMOVE_DUPLICATES true",
+
+
+use rule markduplicates_bam as markduplicates_sam with:
+    output:
+        bam="dedup_sam/{sample}.sam",
+        metrics="dedup_sam/{sample}.metrics.txt",
+    log:
+        "logs/dedup_sam/{sample}.log",
+    params:
+        extra="--REMOVE_DUPLICATES true",
+
+
+use rule markduplicates_bam as markduplicates_cram with:
     input:
         bams="mapped/{sample}.bam",
         ref="ref/genome.fasta",
-    # optional to specify a list of BAMs; this has the same effect
-    # of marking duplicates on separate read groups for a sample
-    # and then merging
     output:
-        bam="dedup/{sample}.cram",
-        metrics="dedup/{sample}.metrics.txt",
+        bam="dedup_cram/{sample}.cram",
+        idx="dedup_cram/{sample}.cram.crai",
+        metrics="dedup_cram/{sample}.metrics.txt",
     log:
-        "logs/picard/dedup/{sample}.log",
+        "logs/dedup_cram/{sample}.log",
     params:
         extra="--REMOVE_DUPLICATES true",
         embed_ref=True,  # set true if the fasta reference should be embedded into the cram
-    # optional specification of memory usage of the JVM that snakemake will respect with global
-    # resource restrictions (https://snakemake.readthedocs.io/en/latest/snakefiles/rules.html#resources)
-    # and which can be used to request RAM during cluster job submission as `{resources.mem_mb}`:
-    # https://snakemake.readthedocs.io/en/latest/executing/cluster.html#job-properties
-    resources:
-        mem_mb=1024,
-    wrapper:
-        "master/bio/picard/markduplicates"
+        withmatecigar=False,
diff --git a/bio/picard/markduplicates/wrapper.py b/bio/picard/markduplicates/wrapper.py
@@ -5,40 +5,60 @@
 
 
 import tempfile
+from pathlib import Path
 from snakemake.shell import shell
 from snakemake_wrapper_utils.java import get_java_opts
+from snakemake_wrapper_utils.samtools import get_samtools_opts, infer_out_format
 
-log = snakemake.log_fmt_shell()
 
+log = snakemake.log_fmt_shell()
 extra = snakemake.params.get("extra", "")
 # the --SORTING_COLLECTION_SIZE_RATIO default of 0.25 might
 # indicate a JVM memory overhead of that proportion
 java_opts = get_java_opts(snakemake, java_mem_overhead_factor=0.3)
+samtools_opts = get_samtools_opts(snakemake)
+
+
+tool = "MarkDuplicates"
+if snakemake.params.get("withmatecigar", False):
+    tool = "MarkDuplicatesWithMateCigar"
+
 
 bams = snakemake.input.bams
 if isinstance(bams, str):
     bams = [bams]
 bams = list(map("--INPUT {}".format, bams))
 
-if snakemake.output.bam.endswith(".cram"):
+
+output = snakemake.output.bam
+output_fmt = infer_out_format(output)
+convert = ""
+if output_fmt == "CRAM":
     output = "/dev/stdout"
-    if snakemake.params.embed_ref:
-        view_options = "-O cram,embed_ref"
-    else:
-        view_options = "-O cram"
-    convert = f" | samtools view -@ {snakemake.threads} {view_options} --reference {snakemake.input.ref} -o {snakemake.output.bam}"
-else:
-    output = snakemake.output.bam
-    convert = ""
+
+    # NOTE: output format inference should be done by snakemake-wrapper-utils. Keeping it here for backwards compatibility.
+    if snakemake.params.get("embed_ref", False):
+        samtools_opts += ",embed_ref"
+
+    convert = f" | samtools view {samtools_opts}"
+elif output_fmt == "BAM" and snakemake.output.get("idx"):
+    extra += " --CREATE_INDEX"
+
 
 with tempfile.TemporaryDirectory() as tmpdir:
     shell(
-        "(picard MarkDuplicates"  # Tool and its subcommand
+        "(picard {tool}"  # Tool and its subcommand
         " {java_opts}"  # Automatic java option
         " {extra}"  # User defined parmeters
         " {bams}"  # Input bam(s)
         " --TMP_DIR {tmpdir}"
         " --OUTPUT {output}"  # Output bam
         " --METRICS_FILE {snakemake.output.metrics}"  # Output metrics
-        " {convert} ) {log}"  # Logging
+        " {convert}) {log}"  # Logging
     )
+
+
+output_prefix = Path(snakemake.output.bam).with_suffix("")
+if snakemake.output.get("idx"):
+    if output_fmt == "BAM" and snakemake.output.idx != str(output_prefix) + ".bai":
+        shell("mv {output_prefix}.bai {snakemake.output.idx}")
diff --git a/bio/picard/markduplicateswithmatecigar/environment.yaml b/bio/picard/markduplicateswithmatecigar/environment.yaml
diff --git a/bio/picard/markduplicateswithmatecigar/meta.yaml b/bio/picard/markduplicateswithmatecigar/meta.yaml
diff --git a/bio/picard/markduplicateswithmatecigar/test/Snakefile b/bio/picard/markduplicateswithmatecigar/test/Snakefile
diff --git a/bio/picard/markduplicateswithmatecigar/test/mapped/a.bam b/bio/picard/markduplicateswithmatecigar/test/mapped/a.bam
diff --git a/bio/picard/markduplicateswithmatecigar/test/mapped/a.bam.bai b/bio/picard/markduplicateswithmatecigar/test/mapped/a.bam.bai
diff --git a/bio/picard/markduplicateswithmatecigar/wrapper.py b/bio/picard/markduplicateswithmatecigar/wrapper.py
diff --git a/test.py b/test.py
@@ -2832,26 +2832,34 @@ def test_picard_addorreplacegroups():
 
 
 @skip_if_not_modified
-def test_picard_markduplicates():
+def test_picard_markduplicates_bam():
     run(
         "bio/picard/markduplicates",
-        ["snakemake", "--cores", "1", "dedup/a.bam", "--use-conda", "-F"],
+        ["snakemake", "--cores", "1", "dedup_bam/a.bam", "--use-conda", "-F"],
     )
 
 
 @skip_if_not_modified
-def test_picard_markduplicates_cram():
+def test_picard_markduplicateswithmatecigar_bam():
     run(
         "bio/picard/markduplicates",
-        ["snakemake", "--cores", "1", "dedup/a.cram", "--use-conda", "-F"],
+        ["snakemake", "--cores", "1", "dedup_bam/a.matecigar.bam", "dedup_bam/a.mcigar.bai","--use-conda", "-F"],
     )
 
 
 @skip_if_not_modified
-def test_picard_markduplicateswithmatecigar():
+def test_picard_markduplicates_sam():
     run(
-        "bio/picard/markduplicateswithmatecigar",
-        ["snakemake", "--cores", "1", "dedup/a.bam", "--use-conda", "-F"],
+        "bio/picard/markduplicates",
+        ["snakemake", "--cores", "1", "dedup_sam/a.sam", "--use-conda", "-F"],
+    )
+
+
+@skip_if_not_modified
+def test_picard_markduplicates_cram():
+    run(
+        "bio/picard/markduplicates",
+        ["snakemake", "--cores", "1", "dedup_cram/a.cram", "dedup_cram/a.cram.crai", "--use-conda", "-F"],
     )