Skip to content

Commit

Permalink
feat: DeseqDataSet from multiple input (#1326)
Browse files Browse the repository at this point in the history
<!-- Ensure that the PR title follows conventional commit style (<type>:
<description>)-->
<!-- Possible types are here:
https://github.com/commitizen/conventional-commit-types/blob/master/index.json
-->

### Description

This PR adds a wrapper to build DESeqDataSets for DESeq2, from multiple
sources (tximport, ranged SE, TSV/R count matrices, already existing
DESeqDataSet), and (on user request) performs both releveling and count
filtering as DESeq2 documentation suggests.

If you think this should be broken into multiple separate wrappers,
please let me know, I'll close this PR and submit the separated
sub-functions one by one.

### QC
<!-- Make sure that you can tick the boxes below. -->

* [X] I confirm that:

For all wrappers added by this PR, 

* there is a test case which covers any introduced changes,
* `input:` and `output:` file paths in the resulting rule can be changed
arbitrarily,
* either the wrapper can only use a single core, or the example rule
contains a `threads: x` statement with `x` being a reasonable default,
* rule names in the test case are in
[snake_case](https://en.wikipedia.org/wiki/Snake_case) and somehow tell
what the rule is about or match the tools purpose or name (e.g.,
`map_reads` for a step that maps reads),
* all `environment.yaml` specifications follow [the respective best
practices](https://stackoverflow.com/a/64594513/2352071),
* wherever possible, command line arguments are inferred and set
automatically (e.g. based on file extensions in `input:` or `output:`),
* all fields of the example rules in the `Snakefile`s and their entries
are explained via comments (`input:`/`output:`/`params:` etc.),
* `stderr` and/or `stdout` are logged correctly (`log:`), depending on
the wrapped tool,
* temporary files are either written to a unique hidden folder in the
working directory, or (better) stored where the Python function
`tempfile.gettempdir()` points to (see
[here](https://docs.python.org/3/library/tempfile.html#tempfile.gettempdir);
this also means that using any Python `tempfile` default behavior
works),
* the `meta.yaml` contains a link to the documentation of the respective
tool or command,
* `Snakefile`s pass the linting (`snakemake --lint`),
* `Snakefile`s are formatted with
[snakefmt](https://github.com/snakemake/snakefmt),
* Python wrapper scripts are formatted with
[black](https://black.readthedocs.io).
* Conda environments use a minimal amount of channels, in recommended
ordering. E.g. for bioconda, use (conda-forge, bioconda, nodefaults, as
conda-forge should have highest priority and defaults channels are
usually not needed because most packages are in conda-forge nowadays).

---------

Co-authored-by: tdayris <tdayris@gustaveroussy.fr>
Co-authored-by: tdayris <thibault.dayris@gustaveroussy.fr>
Co-authored-by: Johannes Köster <johannes.koester@uni-due.de>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: snakedeploy-bot[bot] <115615832+snakedeploy-bot[bot]@users.noreply.github.com>
Co-authored-by: Felix Mölder <felix.moelder@uni-due.de>
Co-authored-by: Christopher Schröder <christopher.schroeder@tu-dortmund.de>
Co-authored-by: Filipe G. Vieira <1151762+fgvieira@users.noreply.github.com>
  • Loading branch information
9 people committed May 12, 2023
1 parent 8cc5d91 commit 25d8a07
Show file tree
Hide file tree
Showing 18 changed files with 583 additions and 0 deletions.
9 changes: 9 additions & 0 deletions bio/deseq2/deseqdataset/environment.yaml
@@ -0,0 +1,9 @@
channels:
- conda-forge
- bioconda
- nodefaults
dependencies:
- bioconductor-tximport=1.26.0
- r-readr=2.1.4
- r-jsonlite=1.8.4
- bioconductor-deseq2=1.38.0
37 changes: 37 additions & 0 deletions bio/deseq2/deseqdataset/meta.yaml
@@ -0,0 +1,37 @@
name: DESeqDataSet
url: https://bioconductor.org/packages/3.16/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#input-data
description: |
Create a `DESeqDataSet` object from either, a `tximport` `SummarizedExperiment`, a directory containing `HTSeq` counts, a sample table containing paths to count matrices, or a `RangedSummarizedExperiment` object.
Then optionally run `DESeq2` pre-filtering.
authors:
- Thibault Dayris
input:
- colData: Path to the file describing the experiment design (TSV formatted file). First column contains sample names.
- dds: Path to the DESeqDataSet object (RDS formatted file) *OR*
- txi: Path to the tximport/tximeta SummarizedExperiment object (RDS formatted file) *OR*
- se: Path to the RangedSummarizedExperiment object (RDS formatted file) *OR*
- matrix: Path to the R `matrix(...) ` containing counts. Sample names must be in rownames. (RDS formatted file) *OR*
- counts: Path to the text matrix containing counts. Sample names should be in the first column. (TSV formatted file) *OR*
- htseq_dir: Path to the directory containing HTSeq/FeatureCount count matrices *AND*
- sample_table: Path to the table containing sample names and path to HTSeq/FeatureCount count matrices
output:
- Path to the DESeqDataSet object (RDS formatted file)
params:
- formula: Required.
- reference_level: Optional reference level name, in case relevel is needed
- tested_level: Optional tested level name, in case relevel is needed
- factor: Factor of interest, in case relevel is needed
- min_count: Minimum number of counted/estimated reads threshold (do not filter by default)
- extra: Optional argument passed to DESeq2, apart from `txi`, `colData`, `design`, `htseq`, `directory`, `se`, `sampleTable`, or `tidy`.
note: |
* `dds` input is here to enable filters on already existing `DESeqDataSet` objects.
* User should provide either:
* `dds`
* `txi` and `colData`
* `se`
* `matrix` and `colData`
* `counts` and `colData`
* `htseq_dir` and `sample_table`
* According to `DESeq2` documentation, a `sample_table` should be a TSV file with at least two columns: `sampleName` and `fileName`. Other columns may contain factors and levels information.
* According to `DESeq2` documentation, a `colData` table should be a TSV file with sample names on the first column. Other columns may contain factors and levels information.
116 changes: 116 additions & 0 deletions bio/deseq2/deseqdataset/test/Snakefile
@@ -0,0 +1,116 @@
rule test_DESeqDataSet_filtering:
input:
dds="dataset/dds.RDS",
output:
"dds_minimal.RDS",
threads: 1
log:
"logs/DESeqDataSet/txi.log",
params:
formula="~condition", # Required R statistical formula
factor="condition", # Optionally used for relevel
reference_level="A", # Optionally used for relevel
tested_level="B", # Optionally used for relevel
min_counts=0, # Optionally used to filter low counts
extra="", # Optional parameters provided to import function
wrapper:
"master/bio/deseq2/deseqdataset"


rule test_DESeqDataSet_from_tximport:
input:
txi="dataset/txi.RDS",
colData="coldata.tsv",
output:
"dds_txi.RDS",
threads: 1
log:
"logs/DESeqDataSet/txi.log",
params:
formula="~condition", # Required R statistical formula
# factor="condition", # Optionally used for relevel
# reference_level="A", # Optionally used for relevel
# tested_level="B", # Optionally used for relevel
# min_counts=0, # Optionally used to filter low counts
# extra="", # Optional parameters provided to import function
wrapper:
"master/bio/deseq2/deseqdataset"


rule test_DESeqDataSet_from_ranged_se:
input:
se="dataset/se.RDS",
output:
"dds_se.RDS",
threads: 1
log:
"logs/DESeqDataSet/se.log",
params:
formula="~condition", # Required R statistical formula
# factor="condition", # Optionally used for relevel
# reference_level="A", # Optionally used for relevel
# tested_level="B", # Optionally used for relevel
# min_counts=0, # Optionally used to filter low counts
# extra="", # Optional parameters provided to import function
wrapper:
"master/bio/deseq2/deseqdataset"


rule test_DESeqDataSet_from_r_matrix:
input:
matrix="dataset/matrix.RDS",
colData="coldata.tsv",
output:
"dds_rmatrix.RDS",
threads: 1
log:
"logs/DESeqDataSet/r_matrix.log",
params:
formula="~condition", # Required R statistical formula
# factor="condition", # Optionally used for relevel
# reference_level="A", # Optionally used for relevel
# tested_level="B", # Optionally used for relevel
# min_counts=0, # Optionally used to filter low counts
# extra="", # Optional parameters provided to import function
wrapper:
"master/bio/deseq2/deseqdataset"


rule test_DESeqDataSet_from_tsv_matrix:
input:
counts="dataset/counts.tsv",
colData="coldata.tsv",
output:
"dds_matrix.RDS",
threads: 1
log:
"logs/DESeqDataSet/txt_matrix.log",
params:
formula="~condition", # Required R statistical formula
# factor="condition", # Optionally used for relevel
# reference_level="A", # Optionally used for relevel
# tested_level="B", # Optionally used for relevel
# min_counts=0, # Optionally used to filter low counts
# extra="", # Optional parameters provided to import function
wrapper:
"master/bio/deseq2/deseqdataset"


rule test_DESeqDataSet_from_htseqcount:
input:
htseq_dir="dataset/htseq_dir",
sample_table="sample_table.tsv",
output:
"dds_htseq.RDS",
threads: 1
log:
"logs/DESeqDataSet/txt_matrix.log",
params:
formula="~condition", # Required R statistical formula
# factor="condition", # Optionally used for relevel
# reference_level="A", # Optionally used for relevel
# tested_level="B", # Optionally used for relevel
# min_counts=0, # Optionally used to filter low counts
# extra="", # Optional parameters provided to import function
wrapper:
"master/bio/deseq2/deseqdataset"
7 changes: 7 additions & 0 deletions bio/deseq2/deseqdataset/test/coldata.tsv
@@ -0,0 +1,7 @@
sample condition
sample1 A
sample2 A
sample3 A
sample4 B
sample5 B
sample6 B
12 changes: 12 additions & 0 deletions bio/deseq2/deseqdataset/test/dataset/counts.tsv
@@ -0,0 +1,12 @@
gene_id sample1 sample2 sample3 sample4 sample5 sample6
gene1 18 11 27 14 28 47
gene2 19 12 6 7 1 20
gene3 34 33 28 42 52 25
gene4 36 33 29 26 22 9
gene5 50 64 45 48 61 50
gene6 13 18 26 19 35 9
gene7 3 2 6 3 2 1
gene8 2 1 3 0 1 2
gene9 3 17 23 9 8 0
gene10 28 9 7 5 12 7

Binary file added bio/deseq2/deseqdataset/test/dataset/dds.RDS
Binary file not shown.
11 changes: 11 additions & 0 deletions bio/deseq2/deseqdataset/test/dataset/htseq_dir/sample1.count.tsv
@@ -0,0 +1,11 @@
gene1 18
gene2 19
gene3 34
gene4 36
gene5 50
gene6 13
gene7 3
gene8 2
gene9 3
gene10 28

11 changes: 11 additions & 0 deletions bio/deseq2/deseqdataset/test/dataset/htseq_dir/sample2.count.tsv
@@ -0,0 +1,11 @@
gene1 11
gene2 12
gene3 33
gene4 33
gene5 64
gene6 18
gene7 2
gene8 1
gene9 17
gene10 9

11 changes: 11 additions & 0 deletions bio/deseq2/deseqdataset/test/dataset/htseq_dir/sample3.count.tsv
@@ -0,0 +1,11 @@
gene1 27
gene2 6
gene3 28
gene4 29
gene5 45
gene6 26
gene7 6
gene8 3
gene9 23
gene10 7

11 changes: 11 additions & 0 deletions bio/deseq2/deseqdataset/test/dataset/htseq_dir/sample4.count.tsv
@@ -0,0 +1,11 @@
gene1 14
gene2 7
gene3 42
gene4 26
gene5 48
gene6 19
gene7 3
gene8 0
gene9 9
gene10 5

11 changes: 11 additions & 0 deletions bio/deseq2/deseqdataset/test/dataset/htseq_dir/sample5.count.tsv
@@ -0,0 +1,11 @@
gene1 28
gene2 1
gene3 52
gene4 22
gene5 61
gene6 35
gene7 2
gene8 1
gene9 8
gene10 12

11 changes: 11 additions & 0 deletions bio/deseq2/deseqdataset/test/dataset/htseq_dir/sample6.count.tsv
@@ -0,0 +1,11 @@
gene1 47
gene2 20
gene3 25
gene4 9
gene5 50
gene6 9
gene7 1
gene8 2
gene9 0
gene10 7

Binary file added bio/deseq2/deseqdataset/test/dataset/matrix.RDS
Binary file not shown.
Binary file added bio/deseq2/deseqdataset/test/dataset/se.RDS
Binary file not shown.
Binary file added bio/deseq2/deseqdataset/test/dataset/txi.RDS
Binary file not shown.
7 changes: 7 additions & 0 deletions bio/deseq2/deseqdataset/test/sample_table.tsv
@@ -0,0 +1,7 @@
sampleName fileName condition
sample1 sample1.count.tsv A
sample2 sample2.count.tsv A
sample3 sample3.count.tsv A
sample4 sample4.count.tsv B
sample5 sample5.count.tsv B
sample6 sample6.count.tsv B

0 comments on commit 25d8a07

Please sign in to comment.