Skip to content

Commit

Permalink
feat: Gatk short-variants-calling meta-wrapper (#1778)
Browse files Browse the repository at this point in the history
<!-- Ensure that the PR title follows conventional commit style (<type>:
<description>)-->
<!-- Possible types are here:
https://github.com/commitizen/conventional-commit-types/blob/master/index.json
-->

### Description

I have been personally contacted by someone telling me they could not
make Mutect2 work with Snakemake. I've been sending them this as an
example.

This PR adds GATK Mutect2 short variant calling meta-wrapper, following
best practices described in their official web-site. This does not
include neither hard-filtering, nor vcf annotation.

### QC
<!-- Make sure that you can tick the boxes below. -->

* [X] I confirm that:

For all wrappers added by this PR, 

* there is a test case which covers any introduced changes,
* `input:` and `output:` file paths in the resulting rule can be changed
arbitrarily,
* either the wrapper can only use a single core, or the example rule
contains a `threads: x` statement with `x` being a reasonable default,
* rule names in the test case are in
[snake_case](https://en.wikipedia.org/wiki/Snake_case) and somehow tell
what the rule is about or match the tools purpose or name (e.g.,
`map_reads` for a step that maps reads),
* all `environment.yaml` specifications follow [the respective best
practices](https://stackoverflow.com/a/64594513/2352071),
* wherever possible, command line arguments are inferred and set
automatically (e.g. based on file extensions in `input:` or `output:`),
* all fields of the example rules in the `Snakefile`s and their entries
are explained via comments (`input:`/`output:`/`params:` etc.),
* `stderr` and/or `stdout` are logged correctly (`log:`), depending on
the wrapped tool,
* temporary files are either written to a unique hidden folder in the
working directory, or (better) stored where the Python function
`tempfile.gettempdir()` points to (see
[here](https://docs.python.org/3/library/tempfile.html#tempfile.gettempdir);
this also means that using any Python `tempfile` default behavior
works),
* the `meta.yaml` contains a link to the documentation of the respective
tool or command,
* `Snakefile`s pass the linting (`snakemake --lint`),
* `Snakefile`s are formatted with
[snakefmt](https://github.com/snakemake/snakefmt),
* Python wrapper scripts are formatted with
[black](https://black.readthedocs.io).
* Conda environments use a minimal amount of channels, in recommended
ordering. E.g. for bioconda, use (conda-forge, bioconda, nodefaults, as
conda-forge should have highest priority and defaults channels are
usually not needed because most packages are in conda-forge nowadays).

---------

Co-authored-by: tdayris <tdayris@gustaveroussy.fr>
Co-authored-by: tdayris <thibault.dayris@gustaveroussy.fr>
Co-authored-by: Johannes Köster <johannes.koester@uni-due.de>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: snakedeploy-bot[bot] <115615832+snakedeploy-bot[bot]@users.noreply.github.com>
Co-authored-by: Felix Mölder <felix.moelder@uni-due.de>
Co-authored-by: Christopher Schröder <christopher.schroeder@tu-dortmund.de>
  • Loading branch information
8 people committed Nov 8, 2023
1 parent a492255 commit 4099e4b
Show file tree
Hide file tree
Showing 10 changed files with 216 additions and 0 deletions.
29 changes: 29 additions & 0 deletions meta/bio/gatk_mutect2_calling/meta.yaml
@@ -0,0 +1,29 @@
name: GATK Variant calling best practice workflow
url: https://gatk.broadinstitute.org/hc/en-us/sections/360007226651-Best-Practices-Workflows
description: >
Call short variants (SNP+INDEL) with GATK's Mutect2:
+----------------+---------------------------+-------------------------------------------------------------+
| Step | Tool | Reason |
+================+===========================+=============================================================+
+----------------+---------------------------+-------------------------------------------------------------+
| Indexing | Picard | Create genome sequence dictionnary |
+----------------+---------------------------+-------------------------------------------------------------+
| Indexing | Samtools | Index fasta genome sequence |
+----------------+---------------------------+-------------------------------------------------------------+
| Grouping | Picard | Add or replace possible missing read groups |
+----------------+---------------------------+-------------------------------------------------------------+
| Indexing | Sambamba | Index re-grouped BAM-formatted alignments |
+----------------+---------------------------+-------------------------------------------------------------+
| Calling | Mutect2 | Call short variants with Mutect2 |
+----------------+---------------------------+-------------------------------------------------------------+
| Contaminations | GetPileupSummaries | Tabulates pileup metrics for inferring contamination |
+----------------+---------------------------+-------------------------------------------------------------+
| Contaminations | CalculateContamination | Estimate cross sample contamination |
+----------------+---------------------------+-------------------------------------------------------------+
| Orientation | LearnReadOrientationModel | Search for sequencing artifacts based on read orientation |
+----------------+---------------------------+-------------------------------------------------------------+
| Filtering | FilterMutectCalls | Use previously estimated biases and filter variants |
+----------------+---------------------------+-------------------------------------------------------------+
authors:
- Thibault Dayris
158 changes: 158 additions & 0 deletions meta/bio/gatk_mutect2_calling/test/Snakefile
@@ -0,0 +1,158 @@
rule create_dict:
input:
"genome.fasta",
output:
"genome.dict",
threads: 1
resources:
mem_mb=1024,
log:
"logs/picard/create_dict.log",
params:
extra="",
wrapper:
"master/bio/picard/createsequencedictionary"


rule samtools_index:
input:
"genome.fasta",
output:
"genome.fasta.fai",
log:
"logs/genome_index.log",
params:
extra="", # optional params string
wrapper:
"master/bio/samtools/faidx"


rule picard_replace_read_groups:
input:
"mapped/{sample}.bam",
output:
"picard/{sample}.bam",
threads: 1
resources:
mem_mb=1024,
log:
"logs/picard/replace_rg/{sample}.log",
params:
# Required for GATK
extra="--RGLB lib1 --RGPL illumina --RGPU {sample} --RGSM {sample}",
wrapper:
"master/bio/picard/addorreplacereadgroups"


rule sambamba_index_picard_bam:
input:
"picard/{sample}.bam",
output:
"picard/{sample}.bam.bai",
threads: 1
log:
"logs/sambamba/index/{sample}.log",
params:
extra="",
wrapper:
"master/bio/sambamba/index"


rule mutect2_call:
input:
fasta="genome.fasta",
fasta_dict="genome.dict",
fasta_fai="genome.fasta.fai",
map="picard/{sample}.bam",
map_idx="picard/{sample}.bam.bai",
intervals="regions.bed",
output:
vcf="variant/{sample}.vcf",
bam="variant/{sample}.bam",
f1r2="counts/{sample}.f1r2.tar.gz",
threads: 1
resources:
mem_mb=1024,
params:
extra=" --tumor-sample {sample} ",
log:
"logs/mutect/{sample}.log",
wrapper:
"master/bio/gatk/mutect"


rule gatk_get_pileup_summaries:
input:
bam="picard/{sample}.bam",
bai_bai="picard/{sample}.bam.bai",
variants="known.vcf.gz",
variants_tbi="known.vcf.gz.tbi",
intervals="regions.bed",
output:
temp("summaries/{sample}.table"),
threads: 1
resources:
mem_mb=1024,
params:
extra="",
log:
"logs/summary/{sample}.log",
wrapper:
"master/bio/gatk/getpileupsummaries"


rule gatk_calculate_contamination:
input:
tumor="summaries/{sample}.table",
output:
temp("contamination/{sample}.pileups.table"),
threads: 1
resources:
mem_mb=1024,
log:
"logs/contamination/{sample}.log",
params:
extra="",
wrapper:
"master/bio/gatk/calculatecontamination"


rule gatk_learn_read_orientation_model:
input:
f1r2="counts/{sample}.f1r2.tar.gz",
output:
temp("artifacts_prior/{sample}.tar.gz"),
threads: 1
resources:
mem_mb=1024,
params:
extra="",
log:
"logs/learnreadorientationbias/{sample}.log",
wrapper:
"master/bio/gatk/learnreadorientationmodel"


rule filter_mutect_calls:
input:
vcf="variant/{sample}.vcf",
ref="genome.fasta",
ref_dict="genome.dict",
ref_fai="genome.fasta.fai",
bam="picard/{sample}.bam",
bam_bai="picard/{sample}.bam.bai",
contamination="contamination/{sample}.pileups.table",
f1r2="artifacts_prior/{sample}.tar.gz",
output:
vcf="variant/{sample}.filtered.vcf.gz",
vcf_idx="variant/{sample}.filtered.vcf.gz.tbi",
threads: 1
resources:
mem_mb=1024,
log:
"logs/gatk/filter/{sample}.log",
params:
extra="--create-output-variant-index --min-median-mapping-quality 35 --max-alt-allele-count 3",
java_opts="",
wrapper:
"master/bio/gatk/filtermutectcalls"
3 changes: 3 additions & 0 deletions meta/bio/gatk_mutect2_calling/test/genome.fasta
@@ -0,0 +1,3 @@
>FakeChrom
ATCGATGCTGCTATAGCTATAGCTCTAGATCTGATCTGCTAGTATCGATCCGATTATTAGCGCGATTATAGCGTAGTCA
ATCGATCTCGATATCGCGCTACGATCCGCGCGCGCGCGCGCATTATATCGATCGACGATGCTGCTAGCTAGCTGCTAGC
Binary file added meta/bio/gatk_mutect2_calling/test/known.vcf.gz
Binary file not shown.
Binary file added meta/bio/gatk_mutect2_calling/test/known.vcf.gz.tbi
Binary file not shown.
Binary file not shown.
Binary file not shown.
1 change: 1 addition & 0 deletions meta/bio/gatk_mutect2_calling/test/regions.bed
@@ -0,0 +1 @@
FakeChrom 3 157 region1
10 changes: 10 additions & 0 deletions meta/bio/gatk_mutect2_calling/used_wrappers.yaml
@@ -0,0 +1,10 @@
wrappers:
- bio/samtools/faidx
- bio/picard/createsequencedictionary
- bio/sambamba/index
- bio/picard/addorreplacereadgroups
- bio/gatk/mutect
- bio/gatk/getpileupsummaries
- bio/gatk/calculatecontamination
- bio/gatk/learnreadorientationmodel
- bio/gatk/filtermutectcalls
15 changes: 15 additions & 0 deletions test.py
Expand Up @@ -6156,6 +6156,21 @@ def test_collapse_reads_to_fragments_bam():
)


@skip_if_not_modified
def test_gatk_mutect2_calling_meta():
run(
"meta/bio/gatk_mutect2_calling",
[
"snakemake",
"--cores",
"2",
"--use-conda",
"-F",
"variant/Sample1.filtered.vcf.gz.tbi",
]
)


@skip_if_not_modified
def test_calc_consensus_reads():
run(
Expand Down

0 comments on commit 4099e4b

Please sign in to comment.