feat: DeseqDataSet from multiple input (#1326)

### Description This PR adds a wrapper to build DESeqDataSets for DESeq2, from multiple sources (tximport, ranged SE, TSV/R count matrices, already existing DESeqDataSet), and (on user request) performs both releveling and count filtering as DESeq2 documentation suggests. If you think this should be broken into multiple separate wrappers, please let me know, I'll close this PR and submit the separated sub-functions one by one. ### QC  * [X] I confirm that: For all wrappers added by this PR, * there is a test case which covers any introduced changes, * `input:` and `output:` file paths in the resulting rule can be changed arbitrarily, * either the wrapper can only use a single core, or the example rule contains a `threads: x` statement with `x` being a reasonable default, * rule names in the test case are in [snake_case](https://en.wikipedia.org/wiki/Snake_case) and somehow tell what the rule is about or match the tools purpose or name (e.g., `map_reads` for a step that maps reads), * all `environment.yaml` specifications follow [the respective best practices](https://stackoverflow.com/a/64594513/2352071), * wherever possible, command line arguments are inferred and set automatically (e.g. based on file extensions in `input:` or `output:`), * all fields of the example rules in the `Snakefile`s and their entries are explained via comments (`input:`/`output:`/`params:` etc.), * `stderr` and/or `stdout` are logged correctly (`log:`), depending on the wrapped tool, * temporary files are either written to a unique hidden folder in the working directory, or (better) stored where the Python function `tempfile.gettempdir()` points to (see [here](https://docs.python.org/3/library/tempfile.html#tempfile.gettempdir); this also means that using any Python `tempfile` default behavior works), * the `meta.yaml` contains a link to the documentation of the respective tool or command, * `Snakefile`s pass the linting (`snakemake --lint`), * `Snakefile`s are formatted with [snakefmt](https://github.com/snakemake/snakefmt), * Python wrapper scripts are formatted with [black](https://black.readthedocs.io). * Conda environments use a minimal amount of channels, in recommended ordering. E.g. for bioconda, use (conda-forge, bioconda, nodefaults, as conda-forge should have highest priority and defaults channels are usually not needed because most packages are in conda-forge nowadays). --------- Co-authored-by: tdayris <tdayris@gustaveroussy.fr> Co-authored-by: tdayris <thibault.dayris@gustaveroussy.fr> Co-authored-by: Johannes Köster <johannes.koester@uni-due.de> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: snakedeploy-bot[bot] <115615832+snakedeploy-bot[bot]@users.noreply.github.com> Co-authored-by: Felix Mölder <felix.moelder@uni-due.de> Co-authored-by: Christopher Schröder <christopher.schroeder@tu-dortmund.de> Co-authored-by: Filipe G. Vieira <1151762+fgvieira@users.noreply.github.com>
snakemake · May 12, 2023 · 25d8a07 · 25d8a07
1 parent 8cc5d91
commit 25d8a07
Show file tree

Hide file tree

Showing 18 changed files with 583 additions and 0 deletions.
diff --git a/bio/deseq2/deseqdataset/environment.yaml b/bio/deseq2/deseqdataset/environment.yaml
@@ -0,0 +1,9 @@
+channels:
+  - conda-forge
+  - bioconda
+  - nodefaults
+dependencies:
+  - bioconductor-tximport=1.26.0
+  - r-readr=2.1.4
+  - r-jsonlite=1.8.4
+  - bioconductor-deseq2=1.38.0
diff --git a/bio/deseq2/deseqdataset/meta.yaml b/bio/deseq2/deseqdataset/meta.yaml
@@ -0,0 +1,37 @@
+name: DESeqDataSet
+url: https://bioconductor.org/packages/3.16/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#input-data
+description: |
+  Create a `DESeqDataSet` object from either, a `tximport` `SummarizedExperiment`, a directory containing `HTSeq` counts, a sample table containing paths to count matrices, or a `RangedSummarizedExperiment` object. 
+  Then optionally run `DESeq2` pre-filtering.
+authors:
+  - Thibault Dayris
+input:
+  - colData: Path to the file describing the experiment design (TSV formatted file). First column contains sample names.
+  - dds: Path to the DESeqDataSet object (RDS formatted file) *OR*
+  - txi: Path to the tximport/tximeta SummarizedExperiment object (RDS formatted file) *OR*
+  - se: Path to the RangedSummarizedExperiment object (RDS formatted file) *OR*
+  - matrix: Path to the R `matrix(...) ` containing counts. Sample names must be in rownames. (RDS formatted file) *OR*
+  - counts: Path to the text matrix containing counts. Sample names should be in the first column. (TSV formatted file) *OR*
+  - htseq_dir: Path to the directory containing HTSeq/FeatureCount count matrices *AND*
+  - sample_table: Path to the table containing sample names and path to HTSeq/FeatureCount count matrices
+output:
+  - Path to the DESeqDataSet object (RDS formatted file)
+params:
+  - formula: Required.
+  - reference_level: Optional reference level name, in case relevel is needed
+  - tested_level: Optional tested level name, in case relevel is needed
+  - factor: Factor of interest, in case relevel is needed
+  - min_count: Minimum number of counted/estimated reads threshold (do not filter by default)
+  - extra: Optional argument passed to DESeq2, apart from `txi`, `colData`, `design`, `htseq`, `directory`, `se`, `sampleTable`, or `tidy`.
+note: |
+  * `dds` input is here to enable filters on already existing `DESeqDataSet` objects.
+  * User should provide either:
+    * `dds`
+    * `txi` and `colData`
+    * `se`
+    * `matrix` and `colData`
+    * `counts` and `colData`
+    * `htseq_dir` and `sample_table`
+  * According to `DESeq2` documentation, a `sample_table` should be a TSV file with at least two columns: `sampleName` and `fileName`. Other columns may contain factors and levels information.
+  * According to `DESeq2` documentation, a `colData` table should be a TSV file with sample names on the first column. Other columns may contain factors and levels information.
+  
diff --git a/bio/deseq2/deseqdataset/test/Snakefile b/bio/deseq2/deseqdataset/test/Snakefile
@@ -0,0 +1,116 @@
+rule test_DESeqDataSet_filtering:
+    input:
+        dds="dataset/dds.RDS",
+    output:
+        "dds_minimal.RDS",
+    threads: 1
+    log:
+        "logs/DESeqDataSet/txi.log",
+    params:
+        formula="~condition",  # Required R statistical formula
+        factor="condition",  # Optionally used for relevel
+        reference_level="A",  # Optionally used for relevel
+        tested_level="B",  # Optionally used for relevel
+        min_counts=0,  # Optionally used to filter low counts
+        extra="",  # Optional parameters provided to import function
+    wrapper:
+        "master/bio/deseq2/deseqdataset"
+
+
+rule test_DESeqDataSet_from_tximport:
+    input:
+        txi="dataset/txi.RDS",
+        colData="coldata.tsv",
+    output:
+        "dds_txi.RDS",
+    threads: 1
+    log:
+        "logs/DESeqDataSet/txi.log",
+    params:
+        formula="~condition",  # Required R statistical formula
+        # factor="condition", # Optionally used for relevel
+        # reference_level="A", # Optionally used for relevel
+        # tested_level="B", # Optionally used for relevel
+        # min_counts=0, # Optionally used to filter low counts
+        # extra="", # Optional parameters provided to import function
+    wrapper:
+        "master/bio/deseq2/deseqdataset"
+
+
+rule test_DESeqDataSet_from_ranged_se:
+    input:
+        se="dataset/se.RDS",
+    output:
+        "dds_se.RDS",
+    threads: 1
+    log:
+        "logs/DESeqDataSet/se.log",
+    params:
+        formula="~condition",  # Required R statistical formula
+        # factor="condition", # Optionally used for relevel
+        # reference_level="A", # Optionally used for relevel
+        # tested_level="B", # Optionally used for relevel
+        # min_counts=0, # Optionally used to filter low counts
+        # extra="", # Optional parameters provided to import function
+    wrapper:
+        "master/bio/deseq2/deseqdataset"
+
+
+rule test_DESeqDataSet_from_r_matrix:
+    input:
+        matrix="dataset/matrix.RDS",
+        colData="coldata.tsv",
+    output:
+        "dds_rmatrix.RDS",
+    threads: 1
+    log:
+        "logs/DESeqDataSet/r_matrix.log",
+    params:
+        formula="~condition",  # Required R statistical formula
+        # factor="condition", # Optionally used for relevel
+        # reference_level="A", # Optionally used for relevel
+        # tested_level="B", # Optionally used for relevel
+        # min_counts=0, # Optionally used to filter low counts
+        # extra="", # Optional parameters provided to import function
+    wrapper:
+        "master/bio/deseq2/deseqdataset"
+
+
+rule test_DESeqDataSet_from_tsv_matrix:
+    input:
+        counts="dataset/counts.tsv",
+        colData="coldata.tsv",
+    output:
+        "dds_matrix.RDS",
+    threads: 1
+    log:
+        "logs/DESeqDataSet/txt_matrix.log",
+    params:
+        formula="~condition",  # Required R statistical formula
+        # factor="condition", # Optionally used for relevel
+        # reference_level="A", # Optionally used for relevel
+        # tested_level="B", # Optionally used for relevel
+        # min_counts=0, # Optionally used to filter low counts
+        # extra="", # Optional parameters provided to import function
+    wrapper:
+        "master/bio/deseq2/deseqdataset"
+
+
+rule test_DESeqDataSet_from_htseqcount:
+    input:
+        htseq_dir="dataset/htseq_dir",
+        sample_table="sample_table.tsv",
+    output:
+        "dds_htseq.RDS",
+    threads: 1
+    log:
+        "logs/DESeqDataSet/txt_matrix.log",
+    params:
+        formula="~condition",  # Required R statistical formula
+        # factor="condition", # Optionally used for relevel
+        # reference_level="A", # Optionally used for relevel
+        # tested_level="B", # Optionally used for relevel
+        # min_counts=0, # Optionally used to filter low counts
+        # extra="", # Optional parameters provided to import function
+    wrapper:
+        "master/bio/deseq2/deseqdataset"
diff --git a/bio/deseq2/deseqdataset/test/coldata.tsv b/bio/deseq2/deseqdataset/test/coldata.tsv
@@ -0,0 +1,7 @@
+sample	condition
+sample1	A
+sample2	A
+sample3	A
+sample4	B
+sample5	B
+sample6	B
diff --git a/bio/deseq2/deseqdataset/test/dataset/counts.tsv b/bio/deseq2/deseqdataset/test/dataset/counts.tsv
@@ -0,0 +1,12 @@
+gene_id	sample1	sample2	sample3	sample4	sample5	sample6
+gene1	18	11	27	14	28	47
+gene2	19	12	6	7	1	20
+gene3	34	33	28	42	52	25
+gene4	36	33	29	26	22	9
+gene5	50	64	45	48	61	50
+gene6	13	18	26	19	35	9
+gene7	3	2	6	3	2	1
+gene8	2	1	3	0	1	2
+gene9	3	17	23	9	8	0
+gene10	28	9	7	5	12	7
+
diff --git a/bio/deseq2/deseqdataset/test/dataset/dds.RDS b/bio/deseq2/deseqdataset/test/dataset/dds.RDS
diff --git a/bio/deseq2/deseqdataset/test/dataset/htseq_dir/sample1.count.tsv b/bio/deseq2/deseqdataset/test/dataset/htseq_dir/sample1.count.tsv
@@ -0,0 +1,11 @@
+gene1	18
+gene2	19
+gene3	34
+gene4	36
+gene5	50
+gene6	13
+gene7	3
+gene8	2
+gene9	3
+gene10	28
+
diff --git a/bio/deseq2/deseqdataset/test/dataset/htseq_dir/sample2.count.tsv b/bio/deseq2/deseqdataset/test/dataset/htseq_dir/sample2.count.tsv
@@ -0,0 +1,11 @@
+gene1	11
+gene2	12
+gene3	33
+gene4	33
+gene5	64
+gene6	18
+gene7	2
+gene8	1
+gene9	17
+gene10	9
+
diff --git a/bio/deseq2/deseqdataset/test/dataset/htseq_dir/sample3.count.tsv b/bio/deseq2/deseqdataset/test/dataset/htseq_dir/sample3.count.tsv
@@ -0,0 +1,11 @@
+gene1	27
+gene2	6
+gene3	28
+gene4	29
+gene5	45
+gene6	26
+gene7	6
+gene8	3
+gene9	23
+gene10	7
+
diff --git a/bio/deseq2/deseqdataset/test/dataset/htseq_dir/sample4.count.tsv b/bio/deseq2/deseqdataset/test/dataset/htseq_dir/sample4.count.tsv
@@ -0,0 +1,11 @@
+gene1	14
+gene2	7
+gene3	42
+gene4	26
+gene5	48
+gene6	19
+gene7	3
+gene8	0
+gene9	9
+gene10	5
+
diff --git a/bio/deseq2/deseqdataset/test/dataset/htseq_dir/sample5.count.tsv b/bio/deseq2/deseqdataset/test/dataset/htseq_dir/sample5.count.tsv
@@ -0,0 +1,11 @@
+gene1	28
+gene2	1
+gene3	52
+gene4	22
+gene5	61
+gene6	35
+gene7	2
+gene8	1
+gene9	8
+gene10	12
+
diff --git a/bio/deseq2/deseqdataset/test/dataset/htseq_dir/sample6.count.tsv b/bio/deseq2/deseqdataset/test/dataset/htseq_dir/sample6.count.tsv
@@ -0,0 +1,11 @@
+gene1	47
+gene2	20
+gene3	25
+gene4	9
+gene5	50
+gene6	9
+gene7	1
+gene8	2
+gene9	0
+gene10	7
+
diff --git a/bio/deseq2/deseqdataset/test/dataset/matrix.RDS b/bio/deseq2/deseqdataset/test/dataset/matrix.RDS
diff --git a/bio/deseq2/deseqdataset/test/dataset/se.RDS b/bio/deseq2/deseqdataset/test/dataset/se.RDS
diff --git a/bio/deseq2/deseqdataset/test/dataset/txi.RDS b/bio/deseq2/deseqdataset/test/dataset/txi.RDS
diff --git a/bio/deseq2/deseqdataset/test/sample_table.tsv b/bio/deseq2/deseqdataset/test/sample_table.tsv
@@ -0,0 +1,7 @@
+sampleName	fileName	condition
+sample1	sample1.count.tsv	A
+sample2	sample2.count.tsv	A
+sample3	sample3.count.tsv	A
+sample4	sample4.count.tsv	B
+sample5	sample5.count.tsv	B
+sample6	sample6.count.tsv	B