Skip to content

Commit

Permalink
feat: Mutect2 additional parameters (#516)
Browse files Browse the repository at this point in the history
<!-- Ensure that the PR title follows conventional commit style (<type>:
<description>)-->
<!-- Possible types are here:
https://github.com/commitizen/conventional-commit-types/blob/master/index.json
-->

### Description

This PR handles bed intervals, germline vcf, panel of normals input
files and f1r2 output file.

I also introduced how to enable multi threading in some special cases,
since I have witnessed issues on several computing clusters last year.

### QC
<!-- Make sure that you can tick the boxes below. -->

For all wrappers added by this PR, I made sure that

* [x] there is a test case which covers any introduced changes,
* [x] `input:` and `output:` file paths in the resulting rule can be
changed arbitrarily,
* [x] either the wrapper can only use a single core, or the example rule
contains a `threads: x` statement with `x` being a reasonable default,
* [x] rule names in the test case are in
[snake_case](https://en.wikipedia.org/wiki/Snake_case) and somehow tell
what the rule is about or match the tools purpose or name (e.g.,
`map_reads` for a step that maps reads),
* [x] all `environment.yaml` specifications follow [the respective best
practices](https://stackoverflow.com/a/64594513/2352071),
* [x] wherever possible, command line arguments are inferred and set
automatically (e.g. based on file extensions in `input:` or `output:`),
* [x] all fields of the example rules in the `Snakefile`s and their
entries are explained via comments (`input:`/`output:`/`params:` etc.),
* [x] `stderr` and/or `stdout` are logged correctly (`log:`), depending
on the wrapped tool,
* [x] temporary files are either written to a unique hidden folder in
the working directory, or (better) stored where the Python function
`tempfile.gettempdir()` points to (see
[here](https://docs.python.org/3/library/tempfile.html#tempfile.gettempdir);
this also means that using any Python `tempfile` default behavior
works),
* [x] the `meta.yaml` contains a link to the documentation of the
respective tool or command,
* [x] `Snakefile`s pass the linting (`snakemake --lint`),
* [x] `Snakefile`s are formatted with
[snakefmt](https://github.com/snakemake/snakefmt),
* [x] Python wrapper scripts are formatted with
[black](https://black.readthedocs.io).

---------

Co-authored-by: tdayris <tdayris@gustaveroussy.fr>
Co-authored-by: tdayris <thibault.dayris@gustaveroussy.fr>
Co-authored-by: Filipe G. Vieira <fgarrettvieira@gmail.com>
Co-authored-by: Filipe G. Vieira <1151762+fgvieira@users.noreply.github.com>
  • Loading branch information
5 people committed Mar 22, 2023
1 parent f7914c4 commit f2b70ca
Show file tree
Hide file tree
Showing 5 changed files with 82 additions and 11 deletions.
20 changes: 14 additions & 6 deletions bio/gatk/mutect/meta.yaml
@@ -1,14 +1,22 @@
name: GATK Mutect2
url: https://gatk.broadinstitute.org/hc/en-us/articles/9570422171291-Mutect2
description: Call somatic SNVs and indels via local assembly of haplotypes
url: https://gatk.broadinstitute.org/hc/en-us/articles/360037593851-Mutect2
authors:
- Thibault Dayris
- Filipe G. Vieira
input:
- Mapped reads (SAM/BAM/CRAM)
- Reference Fasta file
- map: Mapped reads (SAM/BAM/CRAM)
- fasta: Reference Fasta file
- intervals: Optional path to a BED interval file
- pon: Optional path to Panel of Normals (flagged as BETA)
- germline: Optional path to known germline variants
output:
- Variant file
notes: |
* The `java_opts` param allows for additional arguments to be passed to the java compiler, e.g. "-XX:ParallelGCThreads=10" (not for `-XmX` or `-Djava.io.tmpdir`, since they are handled automatically).
* The `extra` param allows for additional program arguments.
- vcf: Path to variant file
- bam: Optional path to output bam file
- f1r2: Optional path to f1r2 count file
params:
- extra: Optional parameters for GATK Mutect2
- use_parallelgc: Automatically add "-XX:ParallelGCThreads={snakemake.threads}" to your command line. Set to `True` if your architecture supports ParallelGCThreads.
- use_omp: Automatically set `OMP_NUM_THREADS` environment variable. Set to `True` if your java architecture uses OMP threads.
- java_opts: allows for additional arguments to be passed to the java compiler (not for `-XmX` or `-Djava.io.tmpdir`, `-XX:ParallelGCThreads`, since they are handled automatically).
20 changes: 20 additions & 0 deletions bio/gatk/mutect/test/Snakefile
Expand Up @@ -31,3 +31,23 @@ rule mutect2_bam:
"logs/mutect_{sample}.log",
wrapper:
"master/bio/gatk/mutect"


rule mutect2_complete:
input:
fasta="genome/genome.fasta",
map="mapped/{sample}.bam",
intervals="genome/intervals.bed",
output:
vcf="variant_complete/{sample}.vcf",
bam="variant_complete/{sample}.bam",
f1r2="counts/{sample}.f1r2.tar.gz",
message:
"Testing Mutect2 with {wildcards.sample}"
threads: 1
resources:
mem_mb=1024,
log:
"logs/mutect_{sample}.log",
wrapper:
"master/bio/gatk/mutect"
2 changes: 2 additions & 0 deletions bio/gatk/mutect/test/genome/intervals.bed
@@ -0,0 +1,2 @@
ref 3 44
ref2 8 31
47 changes: 42 additions & 5 deletions bio/gatk/mutect/wrapper.py
Expand Up @@ -5,28 +5,65 @@
__email__ = "thibault.dayris@gustaveroussy.fr"
__license__ = "MIT"

import os
import tempfile
from snakemake.shell import shell
from snakemake.utils import makedirs
from snakemake_wrapper_utils.java import get_java_opts

log = snakemake.log_fmt_shell(stdout=True, stderr=True)

bam_output = "--bam-output"
if snakemake.output.get("bam", None) is not None:
bam_output = bam_output + " " + snakemake.output.bam
else:
bam_output = ""
# On non-omp systems, and in case OMP_NUM_THREADS
# was not set, define OMP_NUM_THREADS through python
if "OMP_NUM_THREADS" not in os.environ.keys() and snakemake.params.get(
"use_omp", False
):
os.environ["OMP_NUM_THREADS"] = snakemake.threads


bam_output = snakemake.output.get("bam", "")
if bam_output:
bam_output = f"--bam-output {bam_output }"


germline_resource = snakemake.input.get("germline", "")
if germline_resource:
germline_resource = f"--germline-resource {germline_resource}"

intervals = snakemake.input.get("intervals", "")
if intervals:
intervals = f"--intervals {intervals}"

f1r2 = snakemake.output.get("f1r2", "")
if f1r2:
f1r2 = f"--f1r2-tar-gz {f1r2}"

pon = snakemake.input.get("pon", "")
if pon:
pon = f"--panel-of-normals {pon}"

extra = snakemake.params.get("extra", "")
java_opts = get_java_opts(snakemake)

# In case Java execution environment suits GC parallel
# calls, these must be given as optional java parameters
if snakemake.params.get("use_parallelgc", False):
if "UseParallelGC" not in java_opts:
java_opts += " -XX:+UseParallelGC "
if "ParallelGCThreads" not in java_opts:
java_opts += f" -XX:ParallelGCThreads={snakemake.threads}"


with tempfile.TemporaryDirectory() as tmpdir:
shell(
"gatk --java-options '{java_opts}' Mutect2" # Tool and its subprocess
" --native-pair-hmm-threads {snakemake.threads}"
" --input {snakemake.input.map}" # Path to input mapping file
" --reference {snakemake.input.fasta}" # Path to reference fasta file
" {f1r2}" # Optional path to output f1r2 count file
" {germline_resource}" # Optional path to optional germline resource VCF
" {intervals}" # Optional path to optional bed intervals
" {pon} " # Optional path to panel of normals
" {extra}" # Extra parameters
" --tmp-dir {tmpdir}"
" --output {snakemake.output.vcf}" # Path to output vcf file
Expand Down
4 changes: 4 additions & 0 deletions test.py
Expand Up @@ -4060,6 +4060,10 @@ def test_gatk_mutect():
"bio/gatk/mutect",
["snakemake", "--cores", "1", "variant/a.vcf", "--use-conda", "-F"],
)
run(
"bio/gatk/mutect",
["snakemake", "--cores", "1", "variant_complete/a.vcf", "--use-conda", "-F"],
)


@skip_if_not_modified
Expand Down

0 comments on commit f2b70ca

Please sign in to comment.