Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat: Add GATK VariantsToTable (#989)
<!-- Ensure that the PR title follows conventional commit style (<type>: <description>)--> <!-- Possible types are here: https://github.com/commitizen/conventional-commit-types/blob/master/index.json --> ### Description <!-- Add a description of your PR here--> Add GATK VariantsToTable wrapper ### QC <!-- Make sure that you can tick the boxes below. --> * [x] I confirm that: For all wrappers added by this PR, * there is a test case which covers any introduced changes, * `input:` and `output:` file paths in the resulting rule can be changed arbitrarily, * either the wrapper can only use a single core, or the example rule contains a `threads: x` statement with `x` being a reasonable default, * rule names in the test case are in [snake_case](https://en.wikipedia.org/wiki/Snake_case) and somehow tell what the rule is about or match the tools purpose or name (e.g., `map_reads` for a step that maps reads), * all `environment.yaml` specifications follow [the respective best practices](https://stackoverflow.com/a/64594513/2352071), * wherever possible, command line arguments are inferred and set automatically (e.g. based on file extensions in `input:` or `output:`), * all fields of the example rules in the `Snakefile`s and their entries are explained via comments (`input:`/`output:`/`params:` etc.), * `stderr` and/or `stdout` are logged correctly (`log:`), depending on the wrapped tool, * temporary files are either written to a unique hidden folder in the working directory, or (better) stored where the Python function `tempfile.gettempdir()` points to (see [here](https://docs.python.org/3/library/tempfile.html#tempfile.gettempdir); this also means that using any Python `tempfile` default behavior works), * the `meta.yaml` contains a link to the documentation of the respective tool or command, * `Snakefile`s pass the linting (`snakemake --lint`), * `Snakefile`s are formatted with [snakefmt](https://github.com/snakemake/snakefmt), * Python wrapper scripts are formatted with [black](https://black.readthedocs.io). * Conda environments use a minimal amount of channels, in recommended ordering. E.g. for bioconda, use (conda-forge, bioconda, nodefaults, as conda-forge should have highest priority and defaults channels are usually not needed because most packages are in conda-forge nowadays).
- Loading branch information
1 parent
ff9573f
commit e3654ce
Showing
6 changed files
with
171 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
channels: | ||
- conda-forge | ||
- bioconda | ||
- nodefaults | ||
dependencies: | ||
- gatk4 =4.3.0.0 | ||
- snakemake-wrapper-utils =0.5.0 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
name: gatk VariantsToTable | ||
url: https://gatk.broadinstitute.org/hc/en-us/articles/360036896892-VariantsToTable | ||
description: | | ||
Run gatk VariantsToTable | ||
authors: | ||
- Dmitry Bespiatykh | ||
input: | ||
- A VCF file to convert to a table | ||
output: | ||
- A tab-delimited file containing the values of the requested fields in the VCF file | ||
notes: | | ||
* The `java_opts` param allows for additional arguments to be passed to the java compiler, e.g. `-XX:ParallelGCThreads=10` (not for `-XmX` or `-Djava.io.tmpdir`, since they are handled automatically). | ||
* The `extra` param allows for additional program arguments. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
rule gatk_variantstotable: | ||
input: | ||
vcf="calls/snvs.vcf", | ||
# intervals="intervals.bed", | ||
output: | ||
tab="calls/snvs.tab", | ||
log: | ||
"logs/gatk/varintstotable.log", | ||
params: | ||
extra="-F CHROM -F POS -F TYPE -GF AD", | ||
java_opts="", # optional | ||
resources: | ||
mem_mb=1024, | ||
wrapper: | ||
"master/bio/gatk/variantstotable" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,97 @@ | ||
##fileformat=VCFv4.2 | ||
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location"> | ||
##FILTER=<ID=LowQual,Description="Low quality"> | ||
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed"> | ||
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> | ||
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> | ||
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> | ||
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> | ||
##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another"> | ||
##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group"> | ||
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification"> | ||
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias."> | ||
##GATKCommandLine=<ID=HaplotypeCaller,CommandLine="HaplotypeCaller --emit-ref-confidence GVCF --output calls/a.g.vcf --input mapped/a.bam --reference genome.fasta --gvcf-gq-bands 1 --gvcf-gq-bands 2 --gvcf-gq-bands 3 --gvcf-gq-bands 4 --gvcf-gq-bands 5 --gvcf-gq-bands 6 --gvcf-gq-bands 7 --gvcf-gq-bands 8 --gvcf-gq-bands 9 --gvcf-gq-bands 10 --gvcf-gq-bands 11 --gvcf-gq-bands 12 --gvcf-gq-bands 13 --gvcf-gq-bands 14 --gvcf-gq-bands 15 --gvcf-gq-bands 16 --gvcf-gq-bands 17 --gvcf-gq-bands 18 --gvcf-gq-bands 19 --gvcf-gq-bands 20 --gvcf-gq-bands 21 --gvcf-gq-bands 22 --gvcf-gq-bands 23 --gvcf-gq-bands 24 --gvcf-gq-bands 25 --gvcf-gq-bands 26 --gvcf-gq-bands 27 --gvcf-gq-bands 28 --gvcf-gq-bands 29 --gvcf-gq-bands 30 --gvcf-gq-bands 31 --gvcf-gq-bands 32 --gvcf-gq-bands 33 --gvcf-gq-bands 34 --gvcf-gq-bands 35 --gvcf-gq-bands 36 --gvcf-gq-bands 37 --gvcf-gq-bands 38 --gvcf-gq-bands 39 --gvcf-gq-bands 40 --gvcf-gq-bands 41 --gvcf-gq-bands 42 --gvcf-gq-bands 43 --gvcf-gq-bands 44 --gvcf-gq-bands 45 --gvcf-gq-bands 46 --gvcf-gq-bands 47 --gvcf-gq-bands 48 --gvcf-gq-bands 49 --gvcf-gq-bands 50 --gvcf-gq-bands 51 --gvcf-gq-bands 52 --gvcf-gq-bands 53 --gvcf-gq-bands 54 --gvcf-gq-bands 55 --gvcf-gq-bands 56 --gvcf-gq-bands 57 --gvcf-gq-bands 58 --gvcf-gq-bands 59 --gvcf-gq-bands 60 --gvcf-gq-bands 70 --gvcf-gq-bands 80 --gvcf-gq-bands 90 --gvcf-gq-bands 99 --indel-size-to-eliminate-in-ref-model 10 --use-alleles-trigger false --disable-optimizations false --just-determine-active-regions false --dont-genotype false --max-mnp-distance 0 --dont-trim-active-regions false --max-disc-ar-extension 25 --max-gga-ar-extension 300 --padding-around-indels 150 --padding-around-snps 20 --kmer-size 10 --kmer-size 25 --dont-increase-kmer-sizes-for-cycles false --allow-non-unique-kmers-in-ref false --num-pruning-samples 1 --recover-dangling-heads false --do-not-recover-dangling-branches false --min-dangling-branch-length 4 --consensus false --max-num-haplotypes-in-population 128 --error-correct-kmers false --min-pruning 2 --debug-graph-transformations false --kmer-length-for-read-error-correction 25 --min-observations-for-kmer-to-be-solid 20 --likelihood-calculation-engine PairHMM --base-quality-score-threshold 18 --pair-hmm-gap-continuation-penalty 10 --pair-hmm-implementation FASTEST_AVAILABLE --pcr-indel-model CONSERVATIVE --phred-scaled-global-read-mismapping-rate 45 --native-pair-hmm-threads 4 --native-pair-hmm-use-double-precision false --debug false --use-filtered-reads-for-annotations false --bam-writer-type CALLED_HAPLOTYPES --dont-use-soft-clipped-bases false --capture-assembly-failure-bam false --error-correct-reads false --do-not-run-physical-phasing false --min-base-quality-score 10 --smith-waterman JAVA --use-new-qual-calculator false --annotate-with-num-discovered-alleles false --heterozygosity 0.001 --indel-heterozygosity 1.25E-4 --heterozygosity-stdev 0.01 --standard-min-confidence-threshold-for-calling 10.0 --max-alternate-alleles 6 --max-genotype-count 1024 --sample-ploidy 2 --num-reference-samples-if-no-call 0 --genotyping-mode DISCOVERY --genotype-filtered-alleles false --contamination-fraction-to-filter 0.0 --output-mode EMIT_VARIANTS_ONLY --all-site-pls false --min-assembly-region-size 50 --max-assembly-region-size 300 --assembly-region-padding 100 --max-reads-per-alignment-start 50 --active-probability-threshold 0.002 --max-prob-propagation-distance 50 --interval-set-rule UNION --interval-padding 0 --interval-exclusion-padding 0 --interval-merging-rule ALL --read-validation-stringency SILENT --seconds-between-progress-updates 10.0 --disable-sequence-dictionary-validation false --create-output-bam-index true --create-output-bam-md5 false --create-output-variant-index true --create-output-variant-md5 false --lenient false --add-output-sam-program-record true --add-output-vcf-command-line true --cloud-prefetch-buffer 40 --cloud-index-prefetch-buffer -1 --disable-bam-index-caching false --sites-only-vcf-output false --help false --version false --showHidden false --verbosity INFO --QUIET false --use-jdk-deflater false --use-jdk-inflater false --gcs-max-retries 20 --disable-tool-default-read-filters false --minimum-mapping-quality 20 --disable-tool-default-annotations false --enable-all-annotations false",Version=4.0.5.1,Date="July 3, 2018 11:28:04 AM CEST"> | ||
##GVCFBlock0-1=minGQ=0(inclusive),maxGQ=1(exclusive) | ||
##GVCFBlock1-2=minGQ=1(inclusive),maxGQ=2(exclusive) | ||
##GVCFBlock10-11=minGQ=10(inclusive),maxGQ=11(exclusive) | ||
##GVCFBlock11-12=minGQ=11(inclusive),maxGQ=12(exclusive) | ||
##GVCFBlock12-13=minGQ=12(inclusive),maxGQ=13(exclusive) | ||
##GVCFBlock13-14=minGQ=13(inclusive),maxGQ=14(exclusive) | ||
##GVCFBlock14-15=minGQ=14(inclusive),maxGQ=15(exclusive) | ||
##GVCFBlock15-16=minGQ=15(inclusive),maxGQ=16(exclusive) | ||
##GVCFBlock16-17=minGQ=16(inclusive),maxGQ=17(exclusive) | ||
##GVCFBlock17-18=minGQ=17(inclusive),maxGQ=18(exclusive) | ||
##GVCFBlock18-19=minGQ=18(inclusive),maxGQ=19(exclusive) | ||
##GVCFBlock19-20=minGQ=19(inclusive),maxGQ=20(exclusive) | ||
##GVCFBlock2-3=minGQ=2(inclusive),maxGQ=3(exclusive) | ||
##GVCFBlock20-21=minGQ=20(inclusive),maxGQ=21(exclusive) | ||
##GVCFBlock21-22=minGQ=21(inclusive),maxGQ=22(exclusive) | ||
##GVCFBlock22-23=minGQ=22(inclusive),maxGQ=23(exclusive) | ||
##GVCFBlock23-24=minGQ=23(inclusive),maxGQ=24(exclusive) | ||
##GVCFBlock24-25=minGQ=24(inclusive),maxGQ=25(exclusive) | ||
##GVCFBlock25-26=minGQ=25(inclusive),maxGQ=26(exclusive) | ||
##GVCFBlock26-27=minGQ=26(inclusive),maxGQ=27(exclusive) | ||
##GVCFBlock27-28=minGQ=27(inclusive),maxGQ=28(exclusive) | ||
##GVCFBlock28-29=minGQ=28(inclusive),maxGQ=29(exclusive) | ||
##GVCFBlock29-30=minGQ=29(inclusive),maxGQ=30(exclusive) | ||
##GVCFBlock3-4=minGQ=3(inclusive),maxGQ=4(exclusive) | ||
##GVCFBlock30-31=minGQ=30(inclusive),maxGQ=31(exclusive) | ||
##GVCFBlock31-32=minGQ=31(inclusive),maxGQ=32(exclusive) | ||
##GVCFBlock32-33=minGQ=32(inclusive),maxGQ=33(exclusive) | ||
##GVCFBlock33-34=minGQ=33(inclusive),maxGQ=34(exclusive) | ||
##GVCFBlock34-35=minGQ=34(inclusive),maxGQ=35(exclusive) | ||
##GVCFBlock35-36=minGQ=35(inclusive),maxGQ=36(exclusive) | ||
##GVCFBlock36-37=minGQ=36(inclusive),maxGQ=37(exclusive) | ||
##GVCFBlock37-38=minGQ=37(inclusive),maxGQ=38(exclusive) | ||
##GVCFBlock38-39=minGQ=38(inclusive),maxGQ=39(exclusive) | ||
##GVCFBlock39-40=minGQ=39(inclusive),maxGQ=40(exclusive) | ||
##GVCFBlock4-5=minGQ=4(inclusive),maxGQ=5(exclusive) | ||
##GVCFBlock40-41=minGQ=40(inclusive),maxGQ=41(exclusive) | ||
##GVCFBlock41-42=minGQ=41(inclusive),maxGQ=42(exclusive) | ||
##GVCFBlock42-43=minGQ=42(inclusive),maxGQ=43(exclusive) | ||
##GVCFBlock43-44=minGQ=43(inclusive),maxGQ=44(exclusive) | ||
##GVCFBlock44-45=minGQ=44(inclusive),maxGQ=45(exclusive) | ||
##GVCFBlock45-46=minGQ=45(inclusive),maxGQ=46(exclusive) | ||
##GVCFBlock46-47=minGQ=46(inclusive),maxGQ=47(exclusive) | ||
##GVCFBlock47-48=minGQ=47(inclusive),maxGQ=48(exclusive) | ||
##GVCFBlock48-49=minGQ=48(inclusive),maxGQ=49(exclusive) | ||
##GVCFBlock49-50=minGQ=49(inclusive),maxGQ=50(exclusive) | ||
##GVCFBlock5-6=minGQ=5(inclusive),maxGQ=6(exclusive) | ||
##GVCFBlock50-51=minGQ=50(inclusive),maxGQ=51(exclusive) | ||
##GVCFBlock51-52=minGQ=51(inclusive),maxGQ=52(exclusive) | ||
##GVCFBlock52-53=minGQ=52(inclusive),maxGQ=53(exclusive) | ||
##GVCFBlock53-54=minGQ=53(inclusive),maxGQ=54(exclusive) | ||
##GVCFBlock54-55=minGQ=54(inclusive),maxGQ=55(exclusive) | ||
##GVCFBlock55-56=minGQ=55(inclusive),maxGQ=56(exclusive) | ||
##GVCFBlock56-57=minGQ=56(inclusive),maxGQ=57(exclusive) | ||
##GVCFBlock57-58=minGQ=57(inclusive),maxGQ=58(exclusive) | ||
##GVCFBlock58-59=minGQ=58(inclusive),maxGQ=59(exclusive) | ||
##GVCFBlock59-60=minGQ=59(inclusive),maxGQ=60(exclusive) | ||
##GVCFBlock6-7=minGQ=6(inclusive),maxGQ=7(exclusive) | ||
##GVCFBlock60-70=minGQ=60(inclusive),maxGQ=70(exclusive) | ||
##GVCFBlock7-8=minGQ=7(inclusive),maxGQ=8(exclusive) | ||
##GVCFBlock70-80=minGQ=70(inclusive),maxGQ=80(exclusive) | ||
##GVCFBlock8-9=minGQ=8(inclusive),maxGQ=9(exclusive) | ||
##GVCFBlock80-90=minGQ=80(inclusive),maxGQ=90(exclusive) | ||
##GVCFBlock9-10=minGQ=9(inclusive),maxGQ=10(exclusive) | ||
##GVCFBlock90-99=minGQ=90(inclusive),maxGQ=99(exclusive) | ||
##GVCFBlock99-100=minGQ=99(inclusive),maxGQ=100(exclusive) | ||
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities"> | ||
##INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases"> | ||
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered"> | ||
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?"> | ||
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval"> | ||
##INFO=<ID=ExcessHet,Number=1,Type=Float,Description="Phred-scaled p-value for exact test of excess heterozygosity"> | ||
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation"> | ||
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed"> | ||
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed"> | ||
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality"> | ||
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities"> | ||
##INFO=<ID=RAW_MQ,Number=1,Type=Float,Description="Raw data for RMS Mapping Quality"> | ||
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias"> | ||
##contig=<ID=ref,length=45> | ||
##contig=<ID=ref2,length=40> | ||
##source=HaplotypeCaller | ||
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT mysample | ||
ref 1 . A <NON_REF> . . END=45 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0 | ||
ref2 1 . A <NON_REF> . . END=40 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
"""Snakemake wrapper for GATK VariantsToTable""" | ||
|
||
__author__ = "Dmitry Bespiatykh" | ||
__copyright__ = "Copyright 2023, Dmitry Bespiatykh" | ||
__license__ = "MIT" | ||
|
||
|
||
import tempfile | ||
from snakemake.shell import shell | ||
from snakemake_wrapper_utils.java import get_java_opts | ||
|
||
extra = snakemake.params.get("extra", "") | ||
java_opts = get_java_opts(snakemake) | ||
log = snakemake.log_fmt_shell(stdout=True, stderr=True) | ||
|
||
|
||
intervals = snakemake.input.get("intervals", "") | ||
if not intervals: | ||
intervals = snakemake.params.get("intervals", "") | ||
if intervals: | ||
intervals = "--intervals {}".format(intervals) | ||
|
||
|
||
with tempfile.TemporaryDirectory() as tmpdir: | ||
shell( | ||
"gatk --java-options '{java_opts}' VariantsToTable" | ||
" --variant {snakemake.input.vcf}" | ||
" {intervals}" | ||
" {extra}" | ||
" --tmp-dir {tmpdir}" | ||
" --output {snakemake.output.tab}" | ||
" {log}" | ||
) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters