Skip to content

Commit

Permalink
feat: Salmon-Tximport meta-wrapper (#1270)
Browse files Browse the repository at this point in the history
<!-- Ensure that the PR title follows conventional commit style (<type>:
<description>)-->
<!-- Possible types are here:
https://github.com/commitizen/conventional-commit-types/blob/master/index.json
-->

### Description

This PR includes the Salmon-Tximport meta-wrapper including the creation
of a decoy aware gentrome index.

### QC
<!-- Make sure that you can tick the boxes below. -->

(most of the points below do not concern meta-wrappers or are being
easily answered via the original wrappers themselves)

* [X] I confirm that:

For all wrappers added by this PR, 

* there is a test case which covers any introduced changes,
* `input:` and `output:` file paths in the resulting rule can be changed
arbitrarily,
* either the wrapper can only use a single core, or the example rule
contains a `threads: x` statement with `x` being a reasonable default,
* rule names in the test case are in
[snake_case](https://en.wikipedia.org/wiki/Snake_case) and somehow tell
what the rule is about or match the tools purpose or name (e.g.,
`map_reads` for a step that maps reads),
* all `environment.yaml` specifications follow [the respective best
practices](https://stackoverflow.com/a/64594513/2352071),
* wherever possible, command line arguments are inferred and set
automatically (e.g. based on file extensions in `input:` or `output:`),
* all fields of the example rules in the `Snakefile`s and their entries
are explained via comments (`input:`/`output:`/`params:` etc.),
* `stderr` and/or `stdout` are logged correctly (`log:`), depending on
the wrapped tool,
* temporary files are either written to a unique hidden folder in the
working directory, or (better) stored where the Python function
`tempfile.gettempdir()` points to (see
[here](https://docs.python.org/3/library/tempfile.html#tempfile.gettempdir);
this also means that using any Python `tempfile` default behavior
works),
* the `meta.yaml` contains a link to the documentation of the respective
tool or command,
* `Snakefile`s pass the linting (`snakemake --lint`),
* `Snakefile`s are formatted with
[snakefmt](https://github.com/snakemake/snakefmt),
* Python wrapper scripts are formatted with
[black](https://black.readthedocs.io).
* Conda environments use a minimal amount of channels, in recommended
ordering. E.g. for bioconda, use (conda-forge, bioconda, nodefaults, as
conda-forge should have highest priority and defaults channels are
usually not needed because most packages are in conda-forge nowadays).

---------

Co-authored-by: tdayris <tdayris@gustaveroussy.fr>
Co-authored-by: tdayris <thibault.dayris@gustaveroussy.fr>
Co-authored-by: Johannes Köster <johannes.koester@uni-due.de>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: snakedeploy-bot[bot] <115615832+snakedeploy-bot[bot]@users.noreply.github.com>
Co-authored-by: Felix Mölder <felix.moelder@uni-due.de>
Co-authored-by: Christopher Schröder <christopher.schroeder@tu-dortmund.de>
  • Loading branch information
8 people committed May 12, 2023
1 parent 25d8a07 commit 1e31da2
Show file tree
Hide file tree
Showing 12 changed files with 182 additions and 0 deletions.
16 changes: 16 additions & 0 deletions meta/bio/salmon_tximport/meta.yaml
@@ -0,0 +1,16 @@
name: Salmon Tximport
url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6178912.2/
description: >
+----------------+----------+----------------------------------------------------------------------------------------------+
| Step | Tool | Reason |
+================+==========+==============================================================================================+
| Indexation | Bash | Identify decoy sequences |
+ +----------+----------------------------------------------------------------------------------------------+
| | Salmon | Create decoy aware gentrome (genome + trancriptome) index |
+----------------+----------+----------------------------------------------------------------------------------------------+
| Quantification | Salmon | Quantify sequenced reads |
+ +----------+----------------------------------------------------------------------------------------------+
| | Tximport | Import counts and inferential replicates in R as a ready-to-use SummarizedExperiment object. |
+----------------+----------+----------------------------------------------------------------------------------------------+
authors:
- Thibault Dayris
117 changes: 117 additions & 0 deletions meta/bio/salmon_tximport/test/Snakefile
@@ -0,0 +1,117 @@
rule salmon_decoy_sequences:
input:
transcriptome="resources/transcriptome.fasta",
genome="resources/genome.fasta",
output:
gentrome=temp("resources/gentrome.fasta"),
decoys=temp("resources/decoys.txt"),
threads: 1
log:
"decoys.log",
wrapper:
"master/bio/salmon/decoys"


rule salmon_index_gentrome:
input:
sequences="resources/gentrome.fasta",
decoys="resources/decoys.txt",
output:
multiext(
"salmon/transcriptome_index/",
"complete_ref_lens.bin",
"ctable.bin",
"ctg_offsets.bin",
"duplicate_clusters.tsv",
"info.json",
"mphf.bin",
"pos.bin",
"pre_indexing.log",
"rank.bin",
"refAccumLengths.bin",
"ref_indexing.log",
"reflengths.bin",
"refseq.bin",
"seq.bin",
"versionInfo.json",
),
cache: True
log:
"logs/salmon/transcriptome_index.log",
threads: 2
params:
# optional parameters
extra="",
wrapper:
"master/bio/salmon/index"


rule salmon_quant_reads:
input:
r="reads/{sample}.fastq.gz",
index=multiext(
"salmon/transcriptome_index/",
"complete_ref_lens.bin",
"ctable.bin",
"ctg_offsets.bin",
"duplicate_clusters.tsv",
"info.json",
"mphf.bin",
"pos.bin",
"pre_indexing.log",
"rank.bin",
"refAccumLengths.bin",
"ref_indexing.log",
"reflengths.bin",
"refseq.bin",
"seq.bin",
"versionInfo.json",
),
gtf="resources/annotation.gtf",
output:
quant=temp("pseudo_mapping/{sample}/quant.sf"),
quant_gene=temp("pseudo_mapping/{sample}/quant.genes.sf"),
lib=temp("pseudo_mapping/{sample}/lib_format_counts.json"),
aux_info=temp(directory("pseudo_mapping/{sample}/aux_info")),
cmd_info=temp("pseudo_mapping/{sample}/cmd_info.json"),
libparams=temp(directory("pseudo_mapping/{sample}/libParams")),
logs=temp(directory("pseudo_mapping/{sample}/logs")),
log:
"logs/salmon/{sample}.log",
params:
# optional parameters
libtype="A",
extra="--numBootstraps 32",
threads: 2
wrapper:
"master/bio/salmon/quant"


rule tximport:
input:
quant=expand(
"pseudo_mapping/{sample}/quant.sf", sample=["S1", "S2", "S3", "S4"]
),
lib=expand(
"pseudo_mapping/{sample}/lib_format_counts.json",
sample=["S1", "S2", "S3", "S4"],
),
aux_info=expand(
"pseudo_mapping/{sample}/aux_info", sample=["S1", "S2", "S3", "S4"]
),
cmd_info=expand(
"pseudo_mapping/{sample}/cmd_info.json", sample=["S1", "S2", "S3", "S4"]
),
libparams=expand(
"pseudo_mapping/{sample}/libParams", sample=["S1", "S2", "S3", "S4"]
),
logs=expand("pseudo_mapping/{sample}/logs", sample=["S1", "S2", "S3", "S4"]),
tx_to_gene="resources/tx2gene.tsv",
output:
txi="tximport/SummarizedExperimentObject.RDS",
params:
extra="type='salmon'",
log:
"logs/tximport.log"
wrapper:
"master/bio/tximport"
Binary file added meta/bio/salmon_tximport/test/reads/S1.fastq.gz
Binary file not shown.
Binary file added meta/bio/salmon_tximport/test/reads/S2.fastq.gz
Binary file not shown.
Binary file added meta/bio/salmon_tximport/test/reads/S3.fastq.gz
Binary file not shown.
Binary file added meta/bio/salmon_tximport/test/reads/S4.fastq.gz
Binary file not shown.
11 changes: 11 additions & 0 deletions meta/bio/salmon_tximport/test/resources/annotation.gtf
@@ -0,0 +1,11 @@
#!genome-build ManuallyMadeForExample
#!genome-version MMFE01
#!genome-date 2023-04
#!genome-build-accession NCBI:GCA_000001405.14
#!genebuild-last-updated 2023-04
chromosome1 manually_made gene 160 208 . + . gene_id "ENMG01"; gene_version "1"; gene_name "ManuallyMadeGene1"; gene_source "manually_made"; gene_biotype "protein_coding";
chromosome1 manually_made transcript 160 208 . + . gene_id "ENMG01"; gene_version "1"; transcript_id "transcript1"; transcript_version "1"; gene_name "ManuallyMadeGene1"; gene_source "manually_made"; gene_biotype "protein_coding"; transcript_name "ENMT01"; transcript_source "manually_made"; transcript_biotype "processed_transcript"; tag "basic";
chromosome1 manually_made exon 160 208 . + . gene_id "ENMG01"; gene_version "1"; transcript_id "transcript1"; transcript_version "1"; exon_number "1"; gene_name "ManuallyMadeGene1"; gene_source "manually_made"; gene_biotype "protein_coding"; transcript_name: "ENMT01"; transcript_source "manually_made"; transcript_biotype "processed_transcript"; tag "basic"; exon_id "ENEX01"; exon_version "1";
chromosome1 manually_made gene 160 240 . + . gene_id "ENMG02"; gene_version "1"; gene_name "ManuallyMadeGene2"; gene_source "manually_made"; gene_biotype "protein_coding";
chromosome1 manually_made transcript 160 240 . + . gene_id "ENMG02"; gene_version "1"; transcript_id "transcript2"; transcript_version "1"; gene_name "ManuallyMadeGene2"; gene_source "manually_made"; gene_biotype "protein_coding"; transcript_name "ENMT02"; transcript_source "manually_made"; transcript_biotype "processed_transcript"; tag "basic";
chromosome1 manually_made exon 160 240 . + . gene_id "ENMG02"; gene_version "1"; transcript_id "transcript2"; transcript_version "1"; exon_number "1"; gene_name "ManuallyMadeGene2"; gene_source "manually_made"; gene_biotype "protein_coding"; transcript_name: "ENMT02"; transcript_source "manually_made"; transcript_biotype "processed_transcript"; tag "basic"; exon_id "ENEX02"; exon_version "1";
13 changes: 13 additions & 0 deletions meta/bio/salmon_tximport/test/resources/genome.fasta
@@ -0,0 +1,13 @@
>chromosome1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNCTAGTAATACACGGATCTCCTCGGCGGAAGATTCCTACCGAAGCATCATCGTAACTTAATTACGTGATGTG
CCAGGCTCGTATGTACATCGCTCCTCAAAGTGAGGGGAAGTCCTAATCGGATACCGATTGGACTCTTGAGTACCGGCCC
TGTCGTACCGCTTCCCCCTTGAGCCGCTAGTAATCGATGCTCTACGAATAGGGCACCATCCTCGTTGTGCGCTACCACG
GATTAGGCGCATCTCCCTGAGTCGGTTTAAAGATTGTTACCGTCCACCGTTGTCATATCAATATTATTAACAAGTTCGG
TGGTAGGCATCTTATGGAAGGCTTACGGTTGCACCTTCCCTCAATCTCTTGCGACCATACTGTTATTCGGCGGGAACAC
CGGTCTAACTGCGGTTAAGATAAGATTGCTAAGAATATTGTCGACTGGGATCCGGTTTATTATAGGATCTTCAGCTGTG
GTTCCGCGACCACAACATCTAGCATGGGGGGCTCCGTGTGTTTCGAAGCGCCCATCATTTCGTAGCCACATATTGGAAT
TAGCTGCCTTCAGAGTGATAATTAATCGCATAGGTAGGAGCACCCTCGTGAGGTCTTACTTGCCGGCCCGGTTTCATTC
CAGAATCTGAGTTACCCGTGTTATGTCATGATCCTTGTATGCGTACTCTTGATAGGTAACCCGGAGTGCCCACCACGCA
AGTTTATATAATCCCCGGGGAACAGGCTGTTGCCCAAAAGACTAGGCCCGTGTAGCTTTGCCCCGGATTCTCGTTAGTC
GAGCGTTATGCTTTATATNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
4 changes: 4 additions & 0 deletions meta/bio/salmon_tximport/test/resources/transcriptome.fasta
@@ -0,0 +1,4 @@
>transcript1
CCAGGCTCGTATGTACATCGCTCCTCAAAGTGAGGGGAAGTCCTAAT
>transcript2
CATCTCCCTGAGTCGGTTTAAAGATTGTCTTGTATGCGTACTCTTGATAGGTAACCCG
2 changes: 2 additions & 0 deletions meta/bio/salmon_tximport/test/resources/tx2gene.tsv
@@ -0,0 +1,2 @@
transcript1 ManuallyMadeGene1
transcript2 ManuallyMadeGene2
5 changes: 5 additions & 0 deletions meta/bio/salmon_tximport/used_wrappers.yaml
@@ -0,0 +1,5 @@
wrappers:
- bio/salmon/decoys
- bio/salmon/index
- bio/salmon/quant
- bio/tximport
14 changes: 14 additions & 0 deletions test.py
Expand Up @@ -708,6 +708,20 @@ def test_open_cravat_module():
)


@skip_if_not_modified
def test_salmon_tximport_meta():
run(
"meta/bio/salmon_tximport",
[
"snakemake",
"--cores",
"2",
"--use-conda",
"tximport/SummarizedExperimentObject.RDS",
],
)


@skip_if_not_modified
def test_dada2_se_meta():
run(
Expand Down

0 comments on commit 1e31da2

Please sign in to comment.