feat: update arriba and star_arriba meta-wrapper (#963)

### Description This update arriba to version 2.3 and also the meta-wrapper as there where some changes of the command line parameters. Also the meta wrapper was using a absolut path while the referenced file is defined in the input. So this is now called implicitly. In addition I would like to discuss if the Test-Snakefile should be more generic. The arriba-rule currently definies a blacklist-file as parameter while it is not part of the input. In addition the contigs are also limited to chromosome 1 and 2. As I would expect that a meta wrapper should work out of the box with default parameters when imported as a module it appears reasonable to me to remove this parameters (if this does not make the CI run forever). ### QC  * [x] I confirm that: For all wrappers added by this PR, * there is a test case which covers any introduced changes, * `input:` and `output:` file paths in the resulting rule can be changed arbitrarily, * either the wrapper can only use a single core, or the example rule contains a `threads: x` statement with `x` being a reasonable default, * rule names in the test case are in [snake_case](https://en.wikipedia.org/wiki/Snake_case) and somehow tell what the rule is about or match the tools purpose or name (e.g., `map_reads` for a step that maps reads), * all `environment.yaml` specifications follow [the respective best practices](https://stackoverflow.com/a/64594513/2352071), * wherever possible, command line arguments are inferred and set automatically (e.g. based on file extensions in `input:` or `output:`), * all fields of the example rules in the `Snakefile`s and their entries are explained via comments (`input:`/`output:`/`params:` etc.), * `stderr` and/or `stdout` are logged correctly (`log:`), depending on the wrapped tool, * temporary files are either written to a unique hidden folder in the working directory, or (better) stored where the Python function `tempfile.gettempdir()` points to (see [here](https://docs.python.org/3/library/tempfile.html#tempfile.gettempdir); this also means that using any Python `tempfile` default behavior works), * the `meta.yaml` contains a link to the documentation of the respective tool or command, * `Snakefile`s pass the linting (`snakemake --lint`), * `Snakefile`s are formatted with [snakefmt](https://github.com/snakemake/snakefmt), * Python wrapper scripts are formatted with [black](https://black.readthedocs.io). * Conda environments use a minimal amount of channels, in recommended ordering. E.g. for bioconda, use (conda-forge, bioconda, nodefaults, as conda-forge should have highest priority and defaults channels are usually not needed because most packages are in conda-forge nowadays).
snakemake · Dec 12, 2022 · f75d997 · f75d997
1 parent a8333a3
commit f75d997
Show file tree

Hide file tree

Showing 5 changed files with 73 additions and 22 deletions.
diff --git a/bio/arriba/environment.yaml b/bio/arriba/environment.yaml
@@ -3,4 +3,4 @@ channels:
   - bioconda
   - nodefaults
 dependencies:
-  - arriba ==1.1.0
+  - arriba =2.3.0
diff --git a/bio/arriba/meta.yaml b/bio/arriba/meta.yaml
@@ -3,6 +3,7 @@ url: https://github.com/suhrig/arriba
 description: Detect gene fusions from chimeric STAR output
 authors:
   - Jan Forster
+  - Felix Mölder
 input:
   - bam: Path to bam formatted alignment file from STAR
   - genome: Path to fasta formatted genome sequence

diff --git a/bio/arriba/test/Snakefile b/bio/arriba/test/Snakefile
@@ -6,22 +6,26 @@ rule arriba:
         genome="genome.fasta",
         # path to annotation gtf
         annotation="annotation.gtf",
+        # optional arriba blacklist file
+        custom_blacklist=[],
     output:
         # approved gene fusions
         fusions="fusions/{sample}.tsv",
         # discarded gene fusions
-        discarded="fusions/{sample}.discarded.tsv" # optional
+        discarded="fusions/{sample}.discarded.tsv",  # optional
     log:
-        "logs/arriba/{sample}.log"
+        "logs/arriba/{sample}.log",
     params:
-        # arriba blacklist file
-        blacklist="blacklist.tsv", # strongly recommended, see https://arriba.readthedocs.io/en/latest/input-files/#blacklist
-        # file containing known fusions
-        known_fusions="", # optional
+        # required when blacklist or known_fusions is set 
+        genome_build="GRCh38",
+        # strongly recommended, see https://arriba.readthedocs.io/en/latest/input-files/#blacklist
+        # only set blacklist input-file or blacklist-param
+        default_blacklist=False,  # optional
+        default_known_fusions=True,  # optional
         # file containing information from structural variant analysis
-        sv_file="", # optional
+        sv_file="",  # optional
         # optional parameters
-        extra="-T -P -i 1,2"
+        extra="-i 1,2",
     threads: 1
     wrapper:
         "master/bio/arriba"
diff --git a/bio/arriba/wrapper.py b/bio/arriba/wrapper.py
@@ -5,6 +5,7 @@
 
 
 import os
+import json
 from snakemake.shell import shell
 
 extra = snakemake.params.get("extra", "")
@@ -16,21 +17,59 @@
 else:
     discarded_cmd = ""
 
-blacklist = snakemake.params.get("blacklist")
-if blacklist:
-    blacklist_cmd = "-b " + blacklist
+database_dir = os.path.join(os.environ["CONDA_PREFIX"], "var/lib/arriba")
+build = snakemake.params.get("genome_build", None)
+
+blacklist_input = snakemake.input.get("custom_blacklist")
+default_blacklist = snakemake.params.get("default_blacklist", False)
+
+default_known_fusions = snakemake.params.get("default_known_fusions", False)
+
+if default_blacklist or default_known_fusions:
+    if not build:
+        raise ValueError(
+            "Please provide a genome build when using blacklist- or known_fusion-filtering"
+        )
+    arriba_vers = [
+        entry["version"]
+        for entry in json.load(os.popen("conda list --json"))
+        if entry["name"] == "arriba"
+    ][0]
+
+
+if blacklist_input and not default_blacklist:
+    blacklist_cmd = "-b " + blacklist_input
+elif not blacklist_input and default_blacklist:
+    blacklist_dict = {
+        "GRCh37": f"blacklist_hg19_hs37d5_GRCh37_v{arriba_vers}.tsv.gz",
+        "GRCh38": f"blacklist_hg38_GRCh38_v{arriba_vers}.tsv.gz",
+        "GRCm38": f"blacklist_mm10_GRCm38_v{arriba_vers}.tsv.gz",
+        "GRCm39": f"blacklist_mm39_GRCm39_v{arriba_vers}.tsv.gz",
+    }
+    blacklist_path = os.path.join(database_dir, blacklist_dict[build])
+    blacklist_cmd = "-b " + blacklist_path
+elif not blacklist_input and not default_blacklist:
+    blacklist_cmd = "-f blacklist"
 else:
-    blacklist_cmd = ""
+    raise ValueError(
+        "custom_blacklist input file and default_blacklist parameter option defined. Please set only one of both."
+    )
 
-known_fusions = snakemake.params.get("known_fusions")
-if known_fusions:
-    known_cmd = "-k" + known_fusions
+if default_known_fusions:
+    fusions_dict = {
+        "GRCh37": f"known_fusions_hg19_hs37d5_GRCh37_v{arriba_vers}.tsv.gz",
+        "GRCh38": f"known_fusions_hg38_GRCh38_v{arriba_vers}.tsv.gz",
+        "GRCm38": f"known_fusions_mm10_GRCm38_v{arriba_vers}.tsv.gz",
+        "GRCm39": f"known_fusions_mm39_GRCm39_v{arriba_vers}.tsv.gz",
+    }
+    known_fusions_path = os.path.join(database_dir, fusions_dict[build])
+    known_cmd = "-k " + known_fusions_path
 else:
     known_cmd = ""
 
 sv_file = snakemake.params.get("sv_file")
 if sv_file:
-    sv_cmd = "-d" + sv_file
+    sv_cmd = "-d " + sv_file
 else:
     sv_cmd = ""
 

diff --git a/meta/bio/star_arriba/test/Snakefile b/meta/bio/star_arriba/test/Snakefile
@@ -6,7 +6,7 @@ rule star_index:
         directory("resources/star_genome"),
     threads: 4
     params:
-        extra="--sjdbGTFfile resources/genome.gtf --sjdbOverhang 100",
+        extra=lambda wc, input: f"--sjdbGTFfile {input.annotation} --sjdbOverhang 100",
     log:
         "logs/star_index_genome.log",
     cache: True  # mark as eligible for between workflow caching
@@ -21,6 +21,7 @@ rule star_align:
         fq1="reads/{sample}_R1.1.fastq",
         fq2="reads/{sample}_R2.1.fastq",  #optional
         idx="resources/star_genome",
+        annotation="resources/genome.gtf",
     output:
         # see STAR manual for additional output files
         aln="star/{sample}/Aligned.out.bam",
@@ -29,7 +30,7 @@ rule star_align:
         "logs/star/{sample}.log",
     params:
         # specific parameters to work well with arriba
-        extra="--quantMode GeneCounts --sjdbGTFfile resources/genome.gtf"
+        extra=lambda wc, input: f"--quantMode GeneCounts --sjdbGTFfile {input.annotation}"
         " --outSAMtype BAM Unsorted --chimSegmentMin 10 --chimOutType WithinBAM SoftClip"
         " --chimJunctionOverhangMin 10 --chimScoreMin 1 --chimScoreDropMax 30 --chimScoreJunctionNonGTAG 0"
         " --chimScoreSeparation 1 --alignSJstitchMismatchNmax 5 -1 5 5 --chimSegmentReadGapMax 3",
@@ -43,13 +44,19 @@ rule arriba:
         bam="star/{sample}/Aligned.out.bam",
         genome="resources/genome.fasta",
         annotation="resources/genome.gtf",
+        # optional: # A custom tsv containing identified artifacts, such as read-through fusions of neighbouring genes.
+        # default blacklists are selected via blacklist parameter
+        # see https://arriba.readthedocs.io/en/latest/input-files/#blacklist
+        custom_blacklist=[],
     output:
         fusions="results/arriba/{sample}.fusions.tsv",
         discarded="results/arriba/{sample}.fusions.discarded.tsv",
     params:
-        # A tsv containing identified artifacts, such as read-through fusions of neighbouring genes, see https://arriba.readthedocs.io/en/latest/input-files/#blacklist
-        blacklist="arriba_blacklist.tsv",
-        extra="-T -P -i 1,2",  # -i describes the wanted contigs, remove if you want to use all hg38 chromosomes
+        # required if blacklist or known_fusions is set
+        genome_build="GRCh38",
+        default_blacklist=True,
+        default_known_fusions=True,
+        extra="",
     log:
         "logs/arriba/{sample}.log",
     threads: 1