Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat: DeseqDataSet from multiple input (#1326)
<!-- Ensure that the PR title follows conventional commit style (<type>: <description>)--> <!-- Possible types are here: https://github.com/commitizen/conventional-commit-types/blob/master/index.json --> ### Description This PR adds a wrapper to build DESeqDataSets for DESeq2, from multiple sources (tximport, ranged SE, TSV/R count matrices, already existing DESeqDataSet), and (on user request) performs both releveling and count filtering as DESeq2 documentation suggests. If you think this should be broken into multiple separate wrappers, please let me know, I'll close this PR and submit the separated sub-functions one by one. ### QC <!-- Make sure that you can tick the boxes below. --> * [X] I confirm that: For all wrappers added by this PR, * there is a test case which covers any introduced changes, * `input:` and `output:` file paths in the resulting rule can be changed arbitrarily, * either the wrapper can only use a single core, or the example rule contains a `threads: x` statement with `x` being a reasonable default, * rule names in the test case are in [snake_case](https://en.wikipedia.org/wiki/Snake_case) and somehow tell what the rule is about or match the tools purpose or name (e.g., `map_reads` for a step that maps reads), * all `environment.yaml` specifications follow [the respective best practices](https://stackoverflow.com/a/64594513/2352071), * wherever possible, command line arguments are inferred and set automatically (e.g. based on file extensions in `input:` or `output:`), * all fields of the example rules in the `Snakefile`s and their entries are explained via comments (`input:`/`output:`/`params:` etc.), * `stderr` and/or `stdout` are logged correctly (`log:`), depending on the wrapped tool, * temporary files are either written to a unique hidden folder in the working directory, or (better) stored where the Python function `tempfile.gettempdir()` points to (see [here](https://docs.python.org/3/library/tempfile.html#tempfile.gettempdir); this also means that using any Python `tempfile` default behavior works), * the `meta.yaml` contains a link to the documentation of the respective tool or command, * `Snakefile`s pass the linting (`snakemake --lint`), * `Snakefile`s are formatted with [snakefmt](https://github.com/snakemake/snakefmt), * Python wrapper scripts are formatted with [black](https://black.readthedocs.io). * Conda environments use a minimal amount of channels, in recommended ordering. E.g. for bioconda, use (conda-forge, bioconda, nodefaults, as conda-forge should have highest priority and defaults channels are usually not needed because most packages are in conda-forge nowadays). --------- Co-authored-by: tdayris <tdayris@gustaveroussy.fr> Co-authored-by: tdayris <thibault.dayris@gustaveroussy.fr> Co-authored-by: Johannes Köster <johannes.koester@uni-due.de> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: snakedeploy-bot[bot] <115615832+snakedeploy-bot[bot]@users.noreply.github.com> Co-authored-by: Felix Mölder <felix.moelder@uni-due.de> Co-authored-by: Christopher Schröder <christopher.schroeder@tu-dortmund.de> Co-authored-by: Filipe G. Vieira <1151762+fgvieira@users.noreply.github.com>
- Loading branch information
1 parent
8cc5d91
commit 25d8a07
Showing
18 changed files
with
583 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
channels: | ||
- conda-forge | ||
- bioconda | ||
- nodefaults | ||
dependencies: | ||
- bioconductor-tximport=1.26.0 | ||
- r-readr=2.1.4 | ||
- r-jsonlite=1.8.4 | ||
- bioconductor-deseq2=1.38.0 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
name: DESeqDataSet | ||
url: https://bioconductor.org/packages/3.16/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#input-data | ||
description: | | ||
Create a `DESeqDataSet` object from either, a `tximport` `SummarizedExperiment`, a directory containing `HTSeq` counts, a sample table containing paths to count matrices, or a `RangedSummarizedExperiment` object. | ||
Then optionally run `DESeq2` pre-filtering. | ||
authors: | ||
- Thibault Dayris | ||
input: | ||
- colData: Path to the file describing the experiment design (TSV formatted file). First column contains sample names. | ||
- dds: Path to the DESeqDataSet object (RDS formatted file) *OR* | ||
- txi: Path to the tximport/tximeta SummarizedExperiment object (RDS formatted file) *OR* | ||
- se: Path to the RangedSummarizedExperiment object (RDS formatted file) *OR* | ||
- matrix: Path to the R `matrix(...) ` containing counts. Sample names must be in rownames. (RDS formatted file) *OR* | ||
- counts: Path to the text matrix containing counts. Sample names should be in the first column. (TSV formatted file) *OR* | ||
- htseq_dir: Path to the directory containing HTSeq/FeatureCount count matrices *AND* | ||
- sample_table: Path to the table containing sample names and path to HTSeq/FeatureCount count matrices | ||
output: | ||
- Path to the DESeqDataSet object (RDS formatted file) | ||
params: | ||
- formula: Required. | ||
- reference_level: Optional reference level name, in case relevel is needed | ||
- tested_level: Optional tested level name, in case relevel is needed | ||
- factor: Factor of interest, in case relevel is needed | ||
- min_count: Minimum number of counted/estimated reads threshold (do not filter by default) | ||
- extra: Optional argument passed to DESeq2, apart from `txi`, `colData`, `design`, `htseq`, `directory`, `se`, `sampleTable`, or `tidy`. | ||
note: | | ||
* `dds` input is here to enable filters on already existing `DESeqDataSet` objects. | ||
* User should provide either: | ||
* `dds` | ||
* `txi` and `colData` | ||
* `se` | ||
* `matrix` and `colData` | ||
* `counts` and `colData` | ||
* `htseq_dir` and `sample_table` | ||
* According to `DESeq2` documentation, a `sample_table` should be a TSV file with at least two columns: `sampleName` and `fileName`. Other columns may contain factors and levels information. | ||
* According to `DESeq2` documentation, a `colData` table should be a TSV file with sample names on the first column. Other columns may contain factors and levels information. | ||
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,116 @@ | ||
rule test_DESeqDataSet_filtering: | ||
input: | ||
dds="dataset/dds.RDS", | ||
output: | ||
"dds_minimal.RDS", | ||
threads: 1 | ||
log: | ||
"logs/DESeqDataSet/txi.log", | ||
params: | ||
formula="~condition", # Required R statistical formula | ||
factor="condition", # Optionally used for relevel | ||
reference_level="A", # Optionally used for relevel | ||
tested_level="B", # Optionally used for relevel | ||
min_counts=0, # Optionally used to filter low counts | ||
extra="", # Optional parameters provided to import function | ||
wrapper: | ||
"master/bio/deseq2/deseqdataset" | ||
|
||
|
||
rule test_DESeqDataSet_from_tximport: | ||
input: | ||
txi="dataset/txi.RDS", | ||
colData="coldata.tsv", | ||
output: | ||
"dds_txi.RDS", | ||
threads: 1 | ||
log: | ||
"logs/DESeqDataSet/txi.log", | ||
params: | ||
formula="~condition", # Required R statistical formula | ||
# factor="condition", # Optionally used for relevel | ||
# reference_level="A", # Optionally used for relevel | ||
# tested_level="B", # Optionally used for relevel | ||
# min_counts=0, # Optionally used to filter low counts | ||
# extra="", # Optional parameters provided to import function | ||
wrapper: | ||
"master/bio/deseq2/deseqdataset" | ||
|
||
|
||
rule test_DESeqDataSet_from_ranged_se: | ||
input: | ||
se="dataset/se.RDS", | ||
output: | ||
"dds_se.RDS", | ||
threads: 1 | ||
log: | ||
"logs/DESeqDataSet/se.log", | ||
params: | ||
formula="~condition", # Required R statistical formula | ||
# factor="condition", # Optionally used for relevel | ||
# reference_level="A", # Optionally used for relevel | ||
# tested_level="B", # Optionally used for relevel | ||
# min_counts=0, # Optionally used to filter low counts | ||
# extra="", # Optional parameters provided to import function | ||
wrapper: | ||
"master/bio/deseq2/deseqdataset" | ||
|
||
|
||
rule test_DESeqDataSet_from_r_matrix: | ||
input: | ||
matrix="dataset/matrix.RDS", | ||
colData="coldata.tsv", | ||
output: | ||
"dds_rmatrix.RDS", | ||
threads: 1 | ||
log: | ||
"logs/DESeqDataSet/r_matrix.log", | ||
params: | ||
formula="~condition", # Required R statistical formula | ||
# factor="condition", # Optionally used for relevel | ||
# reference_level="A", # Optionally used for relevel | ||
# tested_level="B", # Optionally used for relevel | ||
# min_counts=0, # Optionally used to filter low counts | ||
# extra="", # Optional parameters provided to import function | ||
wrapper: | ||
"master/bio/deseq2/deseqdataset" | ||
|
||
|
||
rule test_DESeqDataSet_from_tsv_matrix: | ||
input: | ||
counts="dataset/counts.tsv", | ||
colData="coldata.tsv", | ||
output: | ||
"dds_matrix.RDS", | ||
threads: 1 | ||
log: | ||
"logs/DESeqDataSet/txt_matrix.log", | ||
params: | ||
formula="~condition", # Required R statistical formula | ||
# factor="condition", # Optionally used for relevel | ||
# reference_level="A", # Optionally used for relevel | ||
# tested_level="B", # Optionally used for relevel | ||
# min_counts=0, # Optionally used to filter low counts | ||
# extra="", # Optional parameters provided to import function | ||
wrapper: | ||
"master/bio/deseq2/deseqdataset" | ||
|
||
|
||
rule test_DESeqDataSet_from_htseqcount: | ||
input: | ||
htseq_dir="dataset/htseq_dir", | ||
sample_table="sample_table.tsv", | ||
output: | ||
"dds_htseq.RDS", | ||
threads: 1 | ||
log: | ||
"logs/DESeqDataSet/txt_matrix.log", | ||
params: | ||
formula="~condition", # Required R statistical formula | ||
# factor="condition", # Optionally used for relevel | ||
# reference_level="A", # Optionally used for relevel | ||
# tested_level="B", # Optionally used for relevel | ||
# min_counts=0, # Optionally used to filter low counts | ||
# extra="", # Optional parameters provided to import function | ||
wrapper: | ||
"master/bio/deseq2/deseqdataset" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
sample condition | ||
sample1 A | ||
sample2 A | ||
sample3 A | ||
sample4 B | ||
sample5 B | ||
sample6 B |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
gene_id sample1 sample2 sample3 sample4 sample5 sample6 | ||
gene1 18 11 27 14 28 47 | ||
gene2 19 12 6 7 1 20 | ||
gene3 34 33 28 42 52 25 | ||
gene4 36 33 29 26 22 9 | ||
gene5 50 64 45 48 61 50 | ||
gene6 13 18 26 19 35 9 | ||
gene7 3 2 6 3 2 1 | ||
gene8 2 1 3 0 1 2 | ||
gene9 3 17 23 9 8 0 | ||
gene10 28 9 7 5 12 7 | ||
|
Binary file not shown.
11 changes: 11 additions & 0 deletions
11
bio/deseq2/deseqdataset/test/dataset/htseq_dir/sample1.count.tsv
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
gene1 18 | ||
gene2 19 | ||
gene3 34 | ||
gene4 36 | ||
gene5 50 | ||
gene6 13 | ||
gene7 3 | ||
gene8 2 | ||
gene9 3 | ||
gene10 28 | ||
|
11 changes: 11 additions & 0 deletions
11
bio/deseq2/deseqdataset/test/dataset/htseq_dir/sample2.count.tsv
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
gene1 11 | ||
gene2 12 | ||
gene3 33 | ||
gene4 33 | ||
gene5 64 | ||
gene6 18 | ||
gene7 2 | ||
gene8 1 | ||
gene9 17 | ||
gene10 9 | ||
|
11 changes: 11 additions & 0 deletions
11
bio/deseq2/deseqdataset/test/dataset/htseq_dir/sample3.count.tsv
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
gene1 27 | ||
gene2 6 | ||
gene3 28 | ||
gene4 29 | ||
gene5 45 | ||
gene6 26 | ||
gene7 6 | ||
gene8 3 | ||
gene9 23 | ||
gene10 7 | ||
|
11 changes: 11 additions & 0 deletions
11
bio/deseq2/deseqdataset/test/dataset/htseq_dir/sample4.count.tsv
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
gene1 14 | ||
gene2 7 | ||
gene3 42 | ||
gene4 26 | ||
gene5 48 | ||
gene6 19 | ||
gene7 3 | ||
gene8 0 | ||
gene9 9 | ||
gene10 5 | ||
|
11 changes: 11 additions & 0 deletions
11
bio/deseq2/deseqdataset/test/dataset/htseq_dir/sample5.count.tsv
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
gene1 28 | ||
gene2 1 | ||
gene3 52 | ||
gene4 22 | ||
gene5 61 | ||
gene6 35 | ||
gene7 2 | ||
gene8 1 | ||
gene9 8 | ||
gene10 12 | ||
|
11 changes: 11 additions & 0 deletions
11
bio/deseq2/deseqdataset/test/dataset/htseq_dir/sample6.count.tsv
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
gene1 47 | ||
gene2 20 | ||
gene3 25 | ||
gene4 9 | ||
gene5 50 | ||
gene6 9 | ||
gene7 1 | ||
gene8 2 | ||
gene9 0 | ||
gene10 7 | ||
|
Binary file not shown.
Binary file not shown.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
sampleName fileName condition | ||
sample1 sample1.count.tsv A | ||
sample2 sample2.count.tsv A | ||
sample3 sample3.count.tsv A | ||
sample4 sample4.count.tsv B | ||
sample5 sample5.count.tsv B | ||
sample6 sample6.count.tsv B |
Oops, something went wrong.