Skip to content

Commit

Permalink
feat: automate inference of index name (#1169)
Browse files Browse the repository at this point in the history
<!-- Ensure that the PR title follows conventional commit style (<type>:
<description>)-->
<!-- Possible types are here:
https://github.com/commitizen/conventional-commit-types/blob/master/index.json
-->

### Description

<!-- Add a description of your PR here-->
Added automated inference of index name and to call
`MarkDuplicatesWithMateCigar` instead of the default, since both tools
are quite similar.
These changes render the wrapper
`bio/picard/markduplicateswithmatecigar` redundant,

### QC
<!-- Make sure that you can tick the boxes below. -->

* [x] I confirm that:

For all wrappers added by this PR, 

* there is a test case which covers any introduced changes,
* `input:` and `output:` file paths in the resulting rule can be changed
arbitrarily,
* either the wrapper can only use a single core, or the example rule
contains a `threads: x` statement with `x` being a reasonable default,
* rule names in the test case are in
[snake_case](https://en.wikipedia.org/wiki/Snake_case) and somehow tell
what the rule is about or match the tools purpose or name (e.g.,
`map_reads` for a step that maps reads),
* all `environment.yaml` specifications follow [the respective best
practices](https://stackoverflow.com/a/64594513/2352071),
* wherever possible, command line arguments are inferred and set
automatically (e.g. based on file extensions in `input:` or `output:`),
* all fields of the example rules in the `Snakefile`s and their entries
are explained via comments (`input:`/`output:`/`params:` etc.),
* `stderr` and/or `stdout` are logged correctly (`log:`), depending on
the wrapped tool,
* temporary files are either written to a unique hidden folder in the
working directory, or (better) stored where the Python function
`tempfile.gettempdir()` points to (see
[here](https://docs.python.org/3/library/tempfile.html#tempfile.gettempdir);
this also means that using any Python `tempfile` default behavior
works),
* the `meta.yaml` contains a link to the documentation of the respective
tool or command,
* `Snakefile`s pass the linting (`snakemake --lint`),
* `Snakefile`s are formatted with
[snakefmt](https://github.com/snakemake/snakefmt),
* Python wrapper scripts are formatted with
[black](https://black.readthedocs.io).
* Conda environments use a minimal amount of channels, in recommended
ordering. E.g. for bioconda, use (conda-forge, bioconda, nodefaults, as
conda-forge should have highest priority and defaults channels are
usually not needed because most packages are in conda-forge nowadays).

---------

Co-authored-by: David Laehnemann <david.laehnemann@hhu.de>
  • Loading branch information
fgvieira and dlaehnemann committed Mar 29, 2023
1 parent 00b9b1c commit 0d2c92a
Show file tree
Hide file tree
Showing 14 changed files with 89 additions and 110 deletions.
Binary file not shown.
Binary file not shown.
Binary file not shown.
4 changes: 2 additions & 2 deletions bio/picard/markduplicates/environment.yaml
Expand Up @@ -3,6 +3,6 @@ channels:
- bioconda
- nodefaults
dependencies:
- picard =2.27.4
- picard =3.0.0
- samtools =1.16.1
- snakemake-wrapper-utils =0.5.2
- snakemake-wrapper-utils =0.5.3
12 changes: 8 additions & 4 deletions bio/picard/markduplicates/meta.yaml
@@ -1,15 +1,19 @@
name: picard MarkDuplicates
description: |
Mark PCR and optical duplicates with picard tools. For more information about MarkDuplicates see `picard documentation <https://broadinstitute.github.io/picard/command-line-overview.html#MarkDuplicates>`_.
Mark PCR and optical duplicates with picard tools.
url: https://broadinstitute.github.io/picard/command-line-overview.html#MarkDuplicates
authors:
- Johannes Köster
- Christopher Schröder
- Filipe G. Vieira
input:
- bam/cram file(s)
output:
- bam/cram file with marked or removed duplicates
params:
- java_opts: allows for additional arguments to be passed to the java compiler, e.g. "-XX:ParallelGCThreads=10" (not for `-XmX` or `-Djava.io.tmpdir`, since they are handled automatically).
- extra: allows for additional program arguments.
- embed_ref: allows to embed the fasta reference into the cram
- withmatecigar: allows to run `MarkDuplicatesWithMateCigar <https://broadinstitute.github.io/picard/command-line-overview.html#MarkDuplicatesWithMateCigar>`_ instead.
notes: |
* The `java_opts` param allows for additional arguments to be passed to the java compiler, e.g. "-XX:ParallelGCThreads=10" (not for `-XmX` or `-Djava.io.tmpdir`, since they are handled automatically).
* The `extra` param allows for additional program arguments.
* `--TMP_DIR` is automatically set by `resources.tmpdir`
* For more information see, https://broadinstitute.github.io/picard/command-line-overview.html#MarkDuplicates
51 changes: 32 additions & 19 deletions bio/picard/markduplicates/test/Snakefile
@@ -1,14 +1,14 @@
rule mark_duplicates:
rule markduplicates_bam:
input:
bams="mapped/{sample}.bam",
# optional to specify a list of BAMs; this has the same effect
# of marking duplicates on separate read groups for a sample
# and then merging
output:
bam="dedup/{sample}.bam",
metrics="dedup/{sample}.metrics.txt",
bam="dedup_bam/{sample}.bam",
metrics="dedup_bam/{sample}.metrics.txt",
log:
"logs/picard/dedup/{sample}.log",
"logs/dedup_bam/{sample}.log",
params:
extra="--REMOVE_DUPLICATES true",
# optional specification of memory usage of the JVM that snakemake will respect with global
Expand All @@ -21,26 +21,39 @@ rule mark_duplicates:
"master/bio/picard/markduplicates"


rule mark_duplicates_cram:
use rule markduplicates_bam as markduplicateswithmatecigar_bam with:
output:
bam="dedup_bam/{sample}.matecigar.bam",
idx="dedup_bam/{sample}.mcigar.bai",
metrics="dedup_bam/{sample}.matecigar.metrics.txt",
log:
"logs/dedup_bam/{sample}.matecigar.log",
params:
withmatecigar=True,
extra="--REMOVE_DUPLICATES true",


use rule markduplicates_bam as markduplicates_sam with:
output:
bam="dedup_sam/{sample}.sam",
metrics="dedup_sam/{sample}.metrics.txt",
log:
"logs/dedup_sam/{sample}.log",
params:
extra="--REMOVE_DUPLICATES true",


use rule markduplicates_bam as markduplicates_cram with:
input:
bams="mapped/{sample}.bam",
ref="ref/genome.fasta",
# optional to specify a list of BAMs; this has the same effect
# of marking duplicates on separate read groups for a sample
# and then merging
output:
bam="dedup/{sample}.cram",
metrics="dedup/{sample}.metrics.txt",
bam="dedup_cram/{sample}.cram",
idx="dedup_cram/{sample}.cram.crai",
metrics="dedup_cram/{sample}.metrics.txt",
log:
"logs/picard/dedup/{sample}.log",
"logs/dedup_cram/{sample}.log",
params:
extra="--REMOVE_DUPLICATES true",
embed_ref=True, # set true if the fasta reference should be embedded into the cram
# optional specification of memory usage of the JVM that snakemake will respect with global
# resource restrictions (https://snakemake.readthedocs.io/en/latest/snakefiles/rules.html#resources)
# and which can be used to request RAM during cluster job submission as `{resources.mem_mb}`:
# https://snakemake.readthedocs.io/en/latest/executing/cluster.html#job-properties
resources:
mem_mb=1024,
wrapper:
"master/bio/picard/markduplicates"
withmatecigar=False,
44 changes: 32 additions & 12 deletions bio/picard/markduplicates/wrapper.py
Expand Up @@ -5,40 +5,60 @@


import tempfile
from pathlib import Path
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
from snakemake_wrapper_utils.samtools import get_samtools_opts, infer_out_format

log = snakemake.log_fmt_shell()

log = snakemake.log_fmt_shell()
extra = snakemake.params.get("extra", "")
# the --SORTING_COLLECTION_SIZE_RATIO default of 0.25 might
# indicate a JVM memory overhead of that proportion
java_opts = get_java_opts(snakemake, java_mem_overhead_factor=0.3)
samtools_opts = get_samtools_opts(snakemake)


tool = "MarkDuplicates"
if snakemake.params.get("withmatecigar", False):
tool = "MarkDuplicatesWithMateCigar"


bams = snakemake.input.bams
if isinstance(bams, str):
bams = [bams]
bams = list(map("--INPUT {}".format, bams))

if snakemake.output.bam.endswith(".cram"):

output = snakemake.output.bam
output_fmt = infer_out_format(output)
convert = ""
if output_fmt == "CRAM":
output = "/dev/stdout"
if snakemake.params.embed_ref:
view_options = "-O cram,embed_ref"
else:
view_options = "-O cram"
convert = f" | samtools view -@ {snakemake.threads} {view_options} --reference {snakemake.input.ref} -o {snakemake.output.bam}"
else:
output = snakemake.output.bam
convert = ""

# NOTE: output format inference should be done by snakemake-wrapper-utils. Keeping it here for backwards compatibility.
if snakemake.params.get("embed_ref", False):
samtools_opts += ",embed_ref"

convert = f" | samtools view {samtools_opts}"
elif output_fmt == "BAM" and snakemake.output.get("idx"):
extra += " --CREATE_INDEX"


with tempfile.TemporaryDirectory() as tmpdir:
shell(
"(picard MarkDuplicates" # Tool and its subcommand
"(picard {tool}" # Tool and its subcommand
" {java_opts}" # Automatic java option
" {extra}" # User defined parmeters
" {bams}" # Input bam(s)
" --TMP_DIR {tmpdir}"
" --OUTPUT {output}" # Output bam
" --METRICS_FILE {snakemake.output.metrics}" # Output metrics
" {convert} ) {log}" # Logging
" {convert}) {log}" # Logging
)


output_prefix = Path(snakemake.output.bam).with_suffix("")
if snakemake.output.get("idx"):
if output_fmt == "BAM" and snakemake.output.idx != str(output_prefix) + ".bai":
shell("mv {output_prefix}.bai {snakemake.output.idx}")
7 changes: 0 additions & 7 deletions bio/picard/markduplicateswithmatecigar/environment.yaml

This file was deleted.

15 changes: 0 additions & 15 deletions bio/picard/markduplicateswithmatecigar/meta.yaml

This file was deleted.

18 changes: 0 additions & 18 deletions bio/picard/markduplicateswithmatecigar/test/Snakefile

This file was deleted.

Binary file not shown.
Binary file not shown.
26 changes: 0 additions & 26 deletions bio/picard/markduplicateswithmatecigar/wrapper.py

This file was deleted.

22 changes: 15 additions & 7 deletions test.py
Expand Up @@ -2832,26 +2832,34 @@ def test_picard_addorreplacegroups():


@skip_if_not_modified
def test_picard_markduplicates():
def test_picard_markduplicates_bam():
run(
"bio/picard/markduplicates",
["snakemake", "--cores", "1", "dedup/a.bam", "--use-conda", "-F"],
["snakemake", "--cores", "1", "dedup_bam/a.bam", "--use-conda", "-F"],
)


@skip_if_not_modified
def test_picard_markduplicates_cram():
def test_picard_markduplicateswithmatecigar_bam():
run(
"bio/picard/markduplicates",
["snakemake", "--cores", "1", "dedup/a.cram", "--use-conda", "-F"],
["snakemake", "--cores", "1", "dedup_bam/a.matecigar.bam", "dedup_bam/a.mcigar.bai","--use-conda", "-F"],
)


@skip_if_not_modified
def test_picard_markduplicateswithmatecigar():
def test_picard_markduplicates_sam():
run(
"bio/picard/markduplicateswithmatecigar",
["snakemake", "--cores", "1", "dedup/a.bam", "--use-conda", "-F"],
"bio/picard/markduplicates",
["snakemake", "--cores", "1", "dedup_sam/a.sam", "--use-conda", "-F"],
)


@skip_if_not_modified
def test_picard_markduplicates_cram():
run(
"bio/picard/markduplicates",
["snakemake", "--cores", "1", "dedup_cram/a.cram", "dedup_cram/a.cram.crai", "--use-conda", "-F"],
)


Expand Down

0 comments on commit 0d2c92a

Please sign in to comment.