fix: empty emu combined output (#2899)

I just noticed a bug for some input in the wrapper I made a few weeks ago :( It didn't fail, but produced empty files. I have fixed it now and and modify the test to consider more flexible filenames. I don't know how to assert in the test whether the output file is not empty. And I'm not sure that's something we should do in the wrapper, as it is technically a valid output. So far, I have checked manually it now works as expected. ### QC  * [ ] I confirm that: For all wrappers added by this PR, * there is a test case which covers any introduced changes, * `input:` and `output:` file paths in the resulting rule can be changed arbitrarily, * either the wrapper can only use a single core, or the example rule contains a `threads: x` statement with `x` being a reasonable default, * rule names in the test case are in [snake_case](https://en.wikipedia.org/wiki/Snake_case) and somehow tell what the rule is about or match the tools purpose or name (e.g., `map_reads` for a step that maps reads), * all `environment.yaml` specifications follow [the respective best practices](https://stackoverflow.com/a/64594513/2352071), * the `environment.yaml` pinning has been updated by running `snakedeploy pin-conda-envs environment.yaml` on a linux machine, * wherever possible, command line arguments are inferred and set automatically (e.g. based on file extensions in `input:` or `output:`), * all fields of the example rules in the `Snakefile`s and their entries are explained via comments (`input:`/`output:`/`params:` etc.), * `stderr` and/or `stdout` are logged correctly (`log:`), depending on the wrapped tool, * temporary files are either written to a unique hidden folder in the working directory, or (better) stored where the Python function `tempfile.gettempdir()` points to (see [here](https://docs.python.org/3/library/tempfile.html#tempfile.gettempdir); this also means that using any Python `tempfile` default behavior works), * the `meta.yaml` contains a link to the documentation of the respective tool or command, * `Snakefile`s pass the linting (`snakemake --lint`), * `Snakefile`s are formatted with [snakefmt](https://github.com/snakemake/snakefmt), * Python wrapper scripts are formatted with [black](https://black.readthedocs.io). * Conda environments use a minimal amount of channels, in recommended ordering. E.g. for bioconda, use (conda-forge, bioconda, nodefaults, as conda-forge should have highest priority and defaults channels are usually not needed because most packages are in conda-forge nowadays).
snakemake · Apr 22, 2024 · 7b10806 · 7b10806
1 parent 48730cd
commit 7b10806
Show file tree

Hide file tree

Showing 8 changed files with 19 additions and 14 deletions.
diff --git a/bio/emu/combine-outputs/test/Snakefile b/bio/emu/combine-outputs/test/Snakefile
@@ -2,7 +2,7 @@ rule combine_outputs:
     input:
         expand("{sample}_rel-abundance.tsv", sample=["sample1", "sample2"]),
     output:
-        abundances="combined_abundances.tsv",
+        abundances=ensure("combined_abundances.tsv", non_empty=True),
     log:
         "logs/emu/combined_abundances.log",
     wrapper:
@@ -11,10 +11,10 @@ rule combine_outputs:
 
 rule combine_outputs_split:
     input:
-        expand("{sample}_rel-abundance.txt", sample=["sample1", "sample2"]),
+        expand("{sample}.txt", sample=["sample1", "sample2"]),
     output:
-        abundances="counts.tsv",
-        taxonomy="taxonomy.tsv",
+        abundances = ensure("counts.tsv", non_empty=True),
+        taxonomy = ensure("taxonomy.tsv", non_empty=True),
     log:
         "logs/emu/combined_split.log",
     params:

diff --git a/bio/emu/combine-outputs/test/sample1.txt b/bio/emu/combine-outputs/test/sample1.txt
@@ -0,0 +1,3 @@
+tax_id	abundance	superkingdom	phylum	class	order	family	genus	species	estimated counts
+1	1.0	Bacteria	Proteobacteria	Gammaproteobacteria	Pseudomonadales	Pseudomonadaceae	Pseudomonas	amygdali;	2.0
+unassigned	0.0								2.0
diff --git a/bio/emu/combine-outputs/test/sample1_rel-abundance.tsv b/bio/emu/combine-outputs/test/sample1_rel-abundance.tsv
@@ -1,3 +1,3 @@
 tax_id	abundance	superkingdom	phylum	class	order	family	genus	species	estimated counts
 1	1.0	Bacteria	Proteobacteria	Gammaproteobacteria	Pseudomonadales	Pseudomonadaceae	Pseudomonas	amygdali;	2.0
-unassigned	0.0								2.0
+unassigned	0.0								2.0
diff --git a/bio/emu/combine-outputs/test/sample1_rel-abundance.txt b/bio/emu/combine-outputs/test/sample1_rel-abundance.txt
diff --git a/bio/emu/combine-outputs/test/sample2.txt b/bio/emu/combine-outputs/test/sample2.txt
@@ -0,0 +1,3 @@
+tax_id	abundance	superkingdom	phylum	class	order	family	genus	species	estimated counts
+1	1.0	Bacteria	Proteobacteria	Gammaproteobacteria	Pseudomonadales	Pseudomonadaceae	Pseudomonas	amygdali;	2.0
+unassigned	0.0								2.0
diff --git a/bio/emu/combine-outputs/test/sample2_rel-abundance.tsv b/bio/emu/combine-outputs/test/sample2_rel-abundance.tsv
@@ -1,3 +1,3 @@
-tax_id	abundance	superkingdom	phylum	class	order	family	genus	species
-1	1.0	Bacteria	Proteobacteria	Gammaproteobacteria	Pseudomonadales	Pseudomonadaceae	Pseudomonas	amygdali;
-unassigned	0.0							
+tax_id	abundance	superkingdom	phylum	class	order	family	genus	species	estimated counts
+1	1.0	Bacteria	Proteobacteria	Gammaproteobacteria	Pseudomonadales	Pseudomonadaceae	Pseudomonas	amygdali;	2.0
+unassigned	0.0								2.0
diff --git a/bio/emu/combine-outputs/test/sample2_rel-abundance.txt b/bio/emu/combine-outputs/test/sample2_rel-abundance.txt
diff --git a/bio/emu/combine-outputs/wrapper.py b/bio/emu/combine-outputs/wrapper.py
@@ -26,10 +26,11 @@
 with tempfile.TemporaryDirectory() as tmpdir:
     for infile in snakemake.input:
         # Files has to end in tsv, and contain rel_abundances
-        temp = os.path.join(tmpdir, os.path.basename(infile))
-        if not temp.endswith("rel_abundances.tsv"):
-            temp = os.path.splitext(infile)[0] + "-rel_abundances.tsv"
-        os.symlink(infile, temp)
+        temp_basename = os.path.basename(infile)
+        if not temp_basename.endswith("_rel-abundance.tsv"):
+            temp_basename = os.path.splitext(infile)[0] + "_rel-abundance.tsv"
+        temp = os.path.join(tmpdir, temp_basename)
+        os.link(infile, temp)
     shell("emu combine-outputs {tmpdir} {rank} {extra} {log}")
     if split and counts:
         shell("mv {tmpdir}/emu-combined-taxonomy-{rank}.tsv {taxonomy}")