Skip to content

Commit

Permalink
feat: Taxonkit wrapper (#2755)
Browse files Browse the repository at this point in the history
<!-- Ensure that the PR title follows conventional commit style (<type>:
<description>)-->
<!-- Possible types are here:
https://github.com/commitizen/conventional-commit-types/blob/master/index.json
-->

<!-- Add a description of your PR here-->

### QC
<!-- Make sure that you can tick the boxes below. -->

* [x] I confirm that:

For all wrappers added by this PR, 

* there is a test case which covers any introduced changes,
* `input:` and `output:` file paths in the resulting rule can be changed
arbitrarily,
* either the wrapper can only use a single core, or the example rule
contains a `threads: x` statement with `x` being a reasonable default,
* rule names in the test case are in
[snake_case](https://en.wikipedia.org/wiki/Snake_case) and somehow tell
what the rule is about or match the tools purpose or name (e.g.,
`map_reads` for a step that maps reads),
* all `environment.yaml` specifications follow [the respective best
practices](https://stackoverflow.com/a/64594513/2352071),
* the `environment.yaml` pinning has been updated by running
`snakedeploy pin-conda-envs environment.yaml` on a linux machine,
* wherever possible, command line arguments are inferred and set
automatically (e.g. based on file extensions in `input:` or `output:`),
* all fields of the example rules in the `Snakefile`s and their entries
are explained via comments (`input:`/`output:`/`params:` etc.),
* `stderr` and/or `stdout` are logged correctly (`log:`), depending on
the wrapped tool,
* temporary files are either written to a unique hidden folder in the
working directory, or (better) stored where the Python function
`tempfile.gettempdir()` points to (see
[here](https://docs.python.org/3/library/tempfile.html#tempfile.gettempdir);
this also means that using any Python `tempfile` default behavior
works),
* the `meta.yaml` contains a link to the documentation of the respective
tool or command,
* `Snakefile`s pass the linting (`snakemake --lint`),
* `Snakefile`s are formatted with
[snakefmt](https://github.com/snakemake/snakefmt),
* Python wrapper scripts are formatted with
[black](https://black.readthedocs.io).
* Conda environments use a minimal amount of channels, in recommended
ordering. E.g. for bioconda, use (conda-forge, bioconda, nodefaults, as
conda-forge should have highest priority and defaults channels are
usually not needed because most packages are in conda-forge nowadays).
  • Loading branch information
fgvieira committed Mar 26, 2024
1 parent 3a4f700 commit 576ddb9
Show file tree
Hide file tree
Showing 19 changed files with 1,371 additions and 2 deletions.
1 change: 0 additions & 1 deletion bio/seqkit/environment.yaml
@@ -1,5 +1,4 @@
channels:
- conda-forge
- bioconda
- nodefaults
dependencies:
Expand Down
2 changes: 1 addition & 1 deletion bio/seqkit/meta.yaml
@@ -1,5 +1,5 @@
name: SeqKit generic wrapper
url: https://bioinf.shenwei.me/seqkit/usage/
url: https://bioinf.shenwei.me/seqkit/
description: |
Run SeqKit.
authors:
Expand Down
5 changes: 5 additions & 0 deletions bio/taxonkit/environment.linux-64.pin.txt
@@ -0,0 +1,5 @@
# This file may be used to create an environment using:
# $ conda create --name <env> --file <this file>
# platform: linux-64
@EXPLICIT
https://conda.anaconda.org/bioconda/linux-64/taxonkit-0.16.0-h9ee0642_0.tar.bz2#d58328deecfbc00948320e07d5a885c4
5 changes: 5 additions & 0 deletions bio/taxonkit/environment.yaml
@@ -0,0 +1,5 @@
channels:
- bioconda
- nodefaults
dependencies:
- taxonkit =0.16.0
14 changes: 14 additions & 0 deletions bio/taxonkit/meta.yaml
@@ -0,0 +1,14 @@
name: TaxonKit generic wrapper
url: https://bioinf.shenwei.me/taxonkit/
description: |
Run TaxonKit.
authors:
- Filipe G. Vieira
input:
- input: input file(s)
- taxdump: taxdump files
output:
- taxdump: output taxdump files
params:
- command: TaxonKit command to use.
- extra: Optional parameters.
225 changes: 225 additions & 0 deletions bio/taxonkit/test/Snakefile
@@ -0,0 +1,225 @@
rule taxonkit_list_txt:
input:
taxdump=multiext(
"test-taxdump/",
"taxid.map",
"nodes.dmp",
"names.dmp",
"merged.dmp",
"delnodes.dmp",
),
output:
"out/list/{sample}.txt",
log:
"logs/list/{sample}.log",
params:
command="list",
extra="--ids 36846609 --indent '\t' --show-name --show-rank",
threads: 2
wrapper:
"master/bio/taxonkit"


rule taxonkit_list_json:
input:
taxdump=multiext(
"test-taxdump/",
"taxid.map",
"nodes.dmp",
"names.dmp",
"merged.dmp",
"delnodes.dmp",
),
output:
"out/list/{sample}.json",
log:
"logs/list/{sample}.log",
params:
command="list",
extra="--ids 36846609 --show-name --show-rank",
threads: 2
wrapper:
"master/bio/taxonkit"


rule taxonkit_lineage:
input:
input="taxon_ids.txt",
taxdump=multiext(
"test-taxdump/",
"taxid.map",
"nodes.dmp",
"names.dmp",
"merged.dmp",
"delnodes.dmp",
),
output:
"out/lineage/{sample}.txt",
log:
"logs/lineage/{sample}.log",
params:
command="lineage",
extra="--show-status-code",
threads: 2
wrapper:
"master/bio/taxonkit"


rule taxonkit_reformat:
input:
input="taxon_ids.txt",
taxdump=multiext(
"test-taxdump/",
"taxid.map",
"nodes.dmp",
"names.dmp",
"merged.dmp",
"delnodes.dmp",
),
output:
"out/reformat/{sample}.txt",
log:
"logs/reformat/{sample}.log",
params:
command="reformat",
extra="--taxid-field 1",
threads: 2
wrapper:
"master/bio/taxonkit"


rule taxonkit_name2taxid:
input:
input="taxon_name.txt",
taxdump=multiext(
"test-taxdump/",
"taxid.map",
"nodes.dmp",
"names.dmp",
"merged.dmp",
"delnodes.dmp",
),
output:
"out/name2taxid/{sample}.txt",
log:
"logs/name2taxid/{sample}.log",
params:
command="name2taxid",
extra="--show-rank",
threads: 2
wrapper:
"master/bio/taxonkit"


rule taxonkit_filter:
input:
input="taxon_ids.txt",
taxdump=multiext(
"test-taxdump/",
"taxid.map",
"nodes.dmp",
"names.dmp",
"merged.dmp",
"delnodes.dmp",
),
output:
"out/filter/{sample}.txt",
log:
"logs/filter/{sample}.log",
params:
command="filter",
extra="--equal-to species",
threads: 2
wrapper:
"master/bio/taxonkit"


rule taxonkit_lca:
input:
input="taxon_ids.txt",
taxdump=multiext(
"test-taxdump/",
"taxid.map",
"nodes.dmp",
"names.dmp",
"merged.dmp",
"delnodes.dmp",
),
output:
"out/lca/{sample}.txt",
log:
"logs/lca/{sample}.log",
params:
command="lca",
extra="--separator ','",
threads: 2
wrapper:
"master/bio/taxonkit"


rule taxonkit_create_taxdump:
input:
input=["lineages1.txt", "lineages2.txt"],
output:
taxdump=multiext(
"out/create-taxdump/{sample}/",
"taxid.map",
"nodes.dmp",
"names.dmp",
"merged.dmp",
"delnodes.dmp",
),
log:
"logs/create-taxdump/{sample}.log",
params:
command="create-taxdump",
extra="--field-accession 1 --rank-names 'superkingdom,phylum,class,order,family,genus,species'",
threads: 2
wrapper:
"master/bio/taxonkit"


rule taxonkit_profile2cami:
input:
input="abundance.tsv",
taxdump=multiext(
"test-taxdump/",
"taxid.map",
"nodes.dmp",
"names.dmp",
"merged.dmp",
"delnodes.dmp",
),
output:
"out/profile2cami/{sample}.txt",
log:
"logs/profile2cami/{sample}.log",
params:
command="profile2cami",
extra="--sample-id sample1 --taxonomy-id 2021-10-01",
threads: 2
wrapper:
"master/bio/taxonkit"


rule taxonkit_cami_filter:
input:
input=rules.taxonkit_profile2cami.output[0],
taxdump=multiext(
"test-taxdump/",
"taxid.map",
"nodes.dmp",
"names.dmp",
"merged.dmp",
"delnodes.dmp",
),
output:
"out/cami_filter/{sample}.tsv",
log:
"logs/cami_filter/{sample}.log",
params:
command="cami-filter",
extra="--taxids 2759",
threads: 2
wrapper:
"master/bio/taxonkit"
4 changes: 4 additions & 0 deletions bio/taxonkit/test/abundance.tsv
@@ -0,0 +1,4 @@
2824115 0.2 merged to 483329
483329 0.2 absord 2824115
239935 0.5 no change
1657696 0.1 deleted
2 changes: 2 additions & 0 deletions bio/taxonkit/test/lineages1.txt
@@ -0,0 +1,2 @@
2026160944 Archaea Halobacteriota Halobacteria Halobacteriales Haloferacaceae Halobellus Halobellus inordinatus
2088315078 Archaea Halobacteriota Methanosarcinia Methanotrichales Methanotrichaceae Methanothrix_A Methanothrix_A sp001602645
3 changes: 3 additions & 0 deletions bio/taxonkit/test/lineages2.txt
@@ -0,0 +1,3 @@
2090174402 Archaea Thermoproteota Nitrososphaeria_A Caldarchaeales Wolframiiraptoraceae Geocrenenecus Geocrenenecus arthurdayi
2139920326 Archaea Huberarchaeota Huberarchaeia Huberarchaeales Huberarchaeaceae Huberarchaeum Huberarchaeum crystalense
2143941790 Archaea Thermoproteota Nitrososphaeria Nitrososphaerales UBA57 UBA57 UBA57 sp002495905
5 changes: 5 additions & 0 deletions bio/taxonkit/test/taxon_ids.txt
@@ -0,0 +1,5 @@
2026160944
2088315078
2090174402
2139920326
2143941790
3 changes: 3 additions & 0 deletions bio/taxonkit/test/taxon_name.txt
@@ -0,0 +1,3 @@
Archaeoglobi
001639295
CSSED10-239
Empty file.
Empty file.

0 comments on commit 576ddb9

Please sign in to comment.