Skip to content

Commit

Permalink
feat: Salmon update (#482)
Browse files Browse the repository at this point in the history
* [fix] (template): Missing code in wrappers' doc. Error #187

* update salmon version, wrappers and documentation

* update salmon index, since RepMap indexes are not accepted anymore

* clean dev pipes

* snakefmt changes

* removed direct reference to resources for 
#482 (comment)

* Use of f-strings and implicit string to bool conversion
#482 (comment)

* List all files through multiext
#482 (comment)

* snakefmt trailing comma addition

* accept salmon index file list

* salmon index wrapper now accepts either a list of files, or a single file

* salmon quand now accepts either an index dir or a list of files

* salmon quant now accepts gzipped files and raw fastq files automatically

* bz2 support and threading error

* formatting

* adding bzip2 and gzip support in environment.yaml

* Remove unnecessary line #482 (comment)

* remove remaining dev print

Co-authored-by: tdayris <tdayris@gustaveroussy.fr>
  • Loading branch information
tdayris and tdayris committed May 23, 2022
1 parent 32fd0e0 commit 3684276
Show file tree
Hide file tree
Showing 38 changed files with 352 additions and 122 deletions.
2 changes: 1 addition & 1 deletion bio/salmon/index/environment.yaml
Expand Up @@ -3,4 +3,4 @@ channels:
- conda-forge
- defaults
dependencies:
- salmon ==0.14.1
- salmon ==1.8.0
11 changes: 8 additions & 3 deletions bio/salmon/index/meta.yaml
@@ -1,9 +1,14 @@
name: salmon_index
url: https://salmon.readthedocs.io/en/latest/salmon.html#preparing-transcriptome-indices-mapping-based-mode
description: |
Index a transcriptome assembly with salmon
Index a transcriptome assembly with salmon
authors:
- Tessa Pierce
- Thibault Dayris
input:
- assembly fasta
- sequences: Path to sequences to index with Salmon. This can be transcriptome sequences or gentrome.
- decoys: Optional path to decoy sequences name, in case the above `sequence` was a gentrome.
output:
- indexed assembly
- indexed assembly
params:
- extra: Optional parameters besides `--tmpdir`, `--threads`, and IO.
25 changes: 21 additions & 4 deletions bio/salmon/index/test/Snakefile
@@ -1,13 +1,30 @@
rule salmon_index:
input:
"assembly/transcriptome.fasta"
sequences="assembly/transcriptome.fasta",
output:
directory("salmon/transcriptome_index")
multiext(
"salmon/transcriptome_index/",
"complete_ref_lens.bin",
"ctable.bin",
"ctg_offsets.bin",
"duplicate_clusters.tsv",
"info.json",
"mphf.bin",
"pos.bin",
"pre_indexing.log",
"rank.bin",
"refAccumLengths.bin",
"ref_indexing.log",
"reflengths.bin",
"refseq.bin",
"seq.bin",
"versionInfo.json",
),
log:
"logs/salmon/transcriptome_index.log"
"logs/salmon/transcriptome_index.log",
threads: 2
params:
# optional parameters
extra=""
extra="",
wrapper:
"master/bio/salmon/index"
13 changes: 13 additions & 0 deletions bio/salmon/index/test/Snakefile_dir
@@ -0,0 +1,13 @@
rule salmon_index:
input:
sequences="assembly/transcriptome.fasta",
output:
directory("salmon/transcriptome_index/"),
log:
"logs/salmon/transcriptome_index.log",
threads: 2
params:
# optional parameters
extra="",
wrapper:
"master/bio/salmon/index"
25 changes: 21 additions & 4 deletions bio/salmon/index/wrapper.py
Expand Up @@ -5,12 +5,29 @@
__email__ = "ntpierce@gmail.com"
__license__ = "MIT"

from os.path import dirname
from snakemake.shell import shell
from tempfile import TemporaryDirectory

log = snakemake.log_fmt_shell(stdout=True, stderr=True)
extra = snakemake.params.get("extra", "")

shell(
"salmon index -t {snakemake.input} -i {snakemake.output} "
" --threads {snakemake.threads} {extra} {log}"
)
decoys = snakemake.input.get("decoys", "")
if decoys:
decoys = f"--decoys {decoys}"

output = snakemake.output
if len(output) > 1:
output = dirname(snakemake.output[0])

with TemporaryDirectory() as tempdir:
shell(
"salmon index "
"--transcripts {snakemake.input.sequences} "
"--index {output} "
"--threads {snakemake.threads} "
"--tmpdir {tempdir} "
"{decoys} "
"{extra} "
"{log}"
)
4 changes: 3 additions & 1 deletion bio/salmon/quant/environment.yaml
Expand Up @@ -3,4 +3,6 @@ channels:
- conda-forge
- defaults
dependencies:
- salmon ==0.14.1
- salmon ==1.8.0
- gzip ==1.11
- bzip2 ==1.0.8
22 changes: 18 additions & 4 deletions bio/salmon/quant/meta.yaml
@@ -1,9 +1,23 @@
name: salmon_quant
name: salmon quant
url: https://salmon.readthedocs.io/en/latest/salmon.html#quantifying-in-mapping-based-mode
description: |
Quantify transcripts with salmon
Quantify transcripts with salmon
authors:
- Tessa Pierce
- Thibault Dayris
input:
- assembly index, fastq files
- index: Path to Salmon indexed sequences, see `bio/salmon/index`
- gtf: Optional path to a GTF formatted genome annotation
- r: Path to unpaired reads
- r1: Path to upstream reads file.
- r2: Path to downstream reads file.
output:
- quantification files
- Path to quantification file
- bam: Path to pseudo-bam file
params:
- libType: Format string describing the library type, see `official documentation on Library Types <https://salmon.readthedocs.io/en/latest/library_type.html>`_ for list of accepted values.
- extra: Optional command line parameters, besides IO parameters and threads.
notes: |
Salmon accepted either a list of unpaired reads (`r` parameter), or two lists
of the same length containing paired reads (`r1` and `r2` parameters). Not
both.
17 changes: 8 additions & 9 deletions bio/salmon/quant/test/Snakefile
Expand Up @@ -2,19 +2,18 @@ rule salmon_quant_reads:
input:
# If you have multiple fastq files for a single sample (e.g. technical replicates)
# use a list for r1 and r2.
r1 = "reads/{sample}_1.fq.gz",
r2 = "reads/{sample}_2.fq.gz",
index = "salmon/transcriptome_index"
r1="reads/{sample}_1.fq.gz",
r2="reads/{sample}_2.fq.gz",
index="salmon/transcriptome_index",
output:
quant = 'salmon/{sample}/quant.sf',
lib = 'salmon/{sample}/lib_format_counts.json'
quant="salmon/{sample}/quant.sf",
lib="salmon/{sample}/lib_format_counts.json",
log:
'logs/salmon/{sample}.log'
"logs/salmon/{sample}.log",
params:
# optional parameters
libtype ="A",
#zip_ext = bz2 # req'd for bz2 files ('bz2'); optional for gz files('gz')
extra=""
libtype="A",
extra="",
threads: 2
wrapper:
"master/bio/salmon/quant"
36 changes: 36 additions & 0 deletions bio/salmon/quant/test/Snakefile_index_list
@@ -0,0 +1,36 @@
rule salmon_quant_reads:
input:
# If you have multiple fastq files for a single sample (e.g. technical replicates)
# use a list for r1 and r2.
r1="reads/{sample}_1.fq.gz",
r2="reads/{sample}_2.fq.gz",
index=multiext(
"salmon/transcriptome_index/",
"complete_ref_lens.bin",
"ctable.bin",
"ctg_offsets.bin",
"duplicate_clusters.tsv",
"info.json",
"mphf.bin",
"pos.bin",
"pre_indexing.log",
"rank.bin",
"refAccumLengths.bin",
"ref_indexing.log",
"reflengths.bin",
"refseq.bin",
"seq.bin",
"versionInfo.json",
),
output:
quant="salmon/{sample}/quant.sf",
lib="salmon/{sample}/lib_format_counts.json",
log:
"logs/salmon/{sample}.log",
params:
# optional parameters
libtype="A",
extra="",
threads: 2
wrapper:
"master/bio/salmon/quant"
17 changes: 8 additions & 9 deletions bio/salmon/quant/test/Snakefile_pe_multi
Expand Up @@ -2,19 +2,18 @@ rule salmon_quant_reads:
input:
# If you have multiple fastq files for a single sample (e.g. technical replicates, flowcells),
# use a list for multiple fastq files for each sample.
r1 = ['reads/a_1.fq.gz','reads/b_1.fq.gz'],
r2 = ['reads/a_2.fq.gz','reads/b_2.fq.gz'],
index = "salmon/transcriptome_index"
r1=["reads/a_1.fq.gz", "reads/b_1.fq.gz"],
r2=["reads/a_2.fq.gz", "reads/b_2.fq.gz"],
index="salmon/transcriptome_index",
output:
quant = 'salmon/ab_pe_x_transcriptome/quant.sf',
lib = 'salmon/ab_pe_x_transcriptome/lib_format_counts.json'
quant="salmon/ab_pe_x_transcriptome/quant.sf",
lib="salmon/ab_pe_x_transcriptome/lib_format_counts.json",
log:
'logs/salmon/ab_pe_x_transcriptome.log'
"logs/salmon/ab_pe_x_transcriptome.log",
params:
# optional parameters
libtype ="A",
#zip_ext = bz2 # req'd for bz2 files ('bz2'); optional for gz files('gz')
extra=""
libtype="A",
extra="",
threads: 2
wrapper:
"master/bio/salmon/quant"
15 changes: 7 additions & 8 deletions bio/salmon/quant/test/Snakefile_se
@@ -1,17 +1,16 @@
rule salmon_quant_reads:
input:
r = "reads/{sample}.fq.gz",
index = "salmon/transcriptome_index"
r="reads/{sample}.fq.gz",
index="salmon/transcriptome_index",
output:
quant = 'salmon/{sample}_x_transcriptome/quant.sf',
lib = 'salmon/{sample}_x_transcriptome/lib_format_counts.json'
quant="salmon/{sample}_x_transcriptome/quant.sf",
lib="salmon/{sample}_x_transcriptome/lib_format_counts.json",
log:
'logs/salmon/{sample}_x_transcriptome.log'
"logs/salmon/{sample}_x_transcriptome.log",
params:
# optional parameters
libtype ="A",
#zip_ext = bz2 # req'd for bz2 files ('bz2'); optional for gz files('gz')
extra=""
libtype="A",
extra="",
threads: 2
wrapper:
"master/bio/salmon/quant"
16 changes: 16 additions & 0 deletions bio/salmon/quant/test/Snakefile_se_bz2
@@ -0,0 +1,16 @@
rule salmon_quant_reads:
input:
r="reads/{sample}.fq.bz2",
index="salmon/transcriptome_index",
output:
quant="salmon/{sample}_x_transcriptome/quant.sf",
lib="salmon/{sample}_x_transcriptome/lib_format_counts.json",
log:
"logs/salmon/{sample}_x_transcriptome.log",
params:
# optional parameters
libtype="A",
extra="",
threads: 2
wrapper:
"master/bio/salmon/quant"
Binary file added bio/salmon/quant/test/reads/a_se.fq.bz2
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
@@ -1 +1 @@
RetainedTxp DuplicateTxp
RetainedRef DuplicateRef
Binary file not shown.
14 changes: 0 additions & 14 deletions bio/salmon/quant/test/salmon/transcriptome_index/header.json

This file was deleted.

2 changes: 0 additions & 2 deletions bio/salmon/quant/test/salmon/transcriptome_index/indexing.log

This file was deleted.

22 changes: 22 additions & 0 deletions bio/salmon/quant/test/salmon/transcriptome_index/info.json
@@ -0,0 +1,22 @@
{
"index_version": 4,
"reference_gfa": [
"transcriptome_index"
],
"sampling_type": "dense",
"k": 31,
"num_kmers": 352,
"num_contigs": 2,
"seq_length": 412,
"have_ref_seq": true,
"have_edge_vec": false,
"SeqHash": "8957140ad649436f3db7111f5a1cea7cf5e8ee72600f26443d3861b5f0894325",
"NameHash": "7733b4bd4d5a14d60999c280918c82dc8d1f7cfdd24764e8eef54a4bb30a51a3",
"SeqHash512": "89a7e74f55209605a4fe0823821c8dfbedebcb2639fba589afed3af583c8158d01cafff5ceb5e63d3b95c3635e937869a6d55c67d748d6f5e3ae1aa53fd5ba4b",
"NameHash512": "454d8e37dceb2f27b460b46f3d4724f5cca0b5bd29abe8493484846a759cf7e71db43da5cd7f4afbdb17ce12d46faa4c3326dc795dd1900df0995eb53dceb695",
"DecoySeqHash": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
"DecoyNameHash": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
"num_decoys": 0,
"first_decoy_index": 18446744073709551615,
"keep_duplicates": false
}
Binary file not shown.
Binary file not shown.
@@ -0,0 +1,3 @@
[2022-04-29 11:07:36.254] [jLog] [warning] The salmon index is being built without any decoy sequences. It is recommended that decoy sequence (either computed auxiliary decoy sequence or the genome of the organism) be provided during indexing. Further details can be found at https://salmon.readthedocs.io/en/latest/salmon.html#preparing-transcriptome-indices-mapping-based-mode.
[2022-04-29 11:07:36.254] [jLog] [info] building index
[2022-04-29 11:07:36.296] [jLog] [info] done building index
12 changes: 0 additions & 12 deletions bio/salmon/quant/test/salmon/transcriptome_index/quasi_index.log

This file was deleted.

Binary file not shown.
Binary file not shown.
5 changes: 0 additions & 5 deletions bio/salmon/quant/test/salmon/transcriptome_index/refInfo.json

This file was deleted.

28 changes: 28 additions & 0 deletions bio/salmon/quant/test/salmon/transcriptome_index/ref_indexing.log
@@ -0,0 +1,28 @@
[2022-04-29 11:07:36.254] [puff::index::jointLog] [info] Running fixFasta
[2022-04-29 11:07:36.255] [puff::index::jointLog] [info] Replaced 0 non-ATCG nucleotides
[2022-04-29 11:07:36.255] [puff::index::jointLog] [info] Clipped poly-A tails from 0 transcripts
[2022-04-29 11:07:36.256] [puff::index::jointLog] [info] Filter size not provided; estimating from number of distinct k-mers
[2022-04-29 11:07:36.256] [puff::index::jointLog] [info] ntHll estimated 47404 distinct k-mers, setting filter size to 2^20
[2022-04-29 11:07:36.268] [puff::index::jointLog] [info] Starting the Pufferfish indexing by reading the GFA binary file.
[2022-04-29 11:07:36.268] [puff::index::jointLog] [info] Setting the index/BinaryGfa directory transcriptome_index
[2022-04-29 11:07:36.268] [puff::index::jointLog] [info] Done wrapping the rank vector with a rank9sel structure.
[2022-04-29 11:07:36.268] [puff::index::jointLog] [info] contig count for validation: 2
[2022-04-29 11:07:36.268] [puff::index::jointLog] [info] Total # of Contigs : 2
[2022-04-29 11:07:36.268] [puff::index::jointLog] [info] Total # of numerical Contigs : 2
[2022-04-29 11:07:36.268] [puff::index::jointLog] [info] Total # of contig vec entries: 2
[2022-04-29 11:07:36.268] [puff::index::jointLog] [info] bits per offset entry 2
[2022-04-29 11:07:36.268] [puff::index::jointLog] [info] Done constructing the contig vector. 3
[2022-04-29 11:07:36.268] [puff::index::jointLog] [info] # segments = 2
[2022-04-29 11:07:36.268] [puff::index::jointLog] [info] total length = 412
[2022-04-29 11:07:36.268] [puff::index::jointLog] [info] Reading the reference files ...
[2022-04-29 11:07:36.269] [puff::index::jointLog] [info] positional integer width = 9
[2022-04-29 11:07:36.269] [puff::index::jointLog] [info] seqSize = 412
[2022-04-29 11:07:36.269] [puff::index::jointLog] [info] rankSize = 412
[2022-04-29 11:07:36.269] [puff::index::jointLog] [info] edgeVecSize = 0
[2022-04-29 11:07:36.269] [puff::index::jointLog] [info] num keys = 352
[2022-04-29 11:07:36.295] [puff::index::jointLog] [info] mphf size = 0.000961304 MB
[2022-04-29 11:07:36.295] [puff::index::jointLog] [info] chunk size = 412
[2022-04-29 11:07:36.295] [puff::index::jointLog] [info] chunk 0 = [0, 382)
[2022-04-29 11:07:36.295] [puff::index::jointLog] [info] finished populating pos vector
[2022-04-29 11:07:36.295] [puff::index::jointLog] [info] writing index components
[2022-04-29 11:07:36.296] [puff::index::jointLog] [info] finished writing dense pufferfish index
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
@@ -1,6 +1,7 @@
{
"indexVersion": 4,
"indexVersion": 5,
"hasAuxIndex": false,
"auxKmerLength": 31,
"indexType": 1
"indexType": 2,
"salmonVersion": "1.8.0"
}

0 comments on commit 3684276

Please sign in to comment.