unite-train

A pipeline to build Qiime2 taxonomy classifiers for the UNITE database.

Download a pre-trained classifier here! 🎁

What is this?

If you are interested in Fungi 🍄🍄‍🟫 you could use their genomic fingerprint to identify them. Affordable PCR amplification and sequencing of the ITS gene gives you these nucleic acid fingerprints, and the UNITE team provides a database to gives these sequences a name.

We can predict the taxonomy of our fungal fingerprints using an old-school machine learning method: a supervised k-mer nb-classifier. But first, we need to prepare our database in a process called 'training.'

This is a pipeline that trains the UNITE ITS taxonomy database for use with Qiime2. You can run this pipeline yourself, but you don't have to! I've provided a ready to use pre-trained classifiers so you can simply run qiime feature-classifier classify-sklearn.

If you have questions about using Qiime2, ask on the Qiime2 forums.

If you have questions about the UNITE ITS database, contact the UNITE team.

If you have questions about this pipeline, please open a new issue!

Running Snakemake workflow

Set up:

Install Mambaforge and configure Bioconda.
Install the version of Qiime2 you want using the recomended environment name. (For a faster install, you can replace conda with mamba.)
Install Snakemake into an environment, then activate that environment.

Configure:

Open up config/config.yaml and configure it to your liking. (For example, you may need to update the name of your Qiime2 environment.)

Run:

snakemake --cores 8 --use-conda --resources mem_mb=10000

Training one classifier takes 1-9 hours on an AMD EPYC 75F3 Milan, depending on the size and complexity of the data.

Run on a slurm cluster:

More specifically, The University of Florida HiPerGator supercomputer, with access generously provided by the Kawahara Lab!

screen    # We connect to a random login node, so we may not be able...
screen -r # to reconnect with this later on.

snakemake --jobs 24 --slurm \
  --rerun-incomplete --retries 3 \
  --use-envmodules --latency-wait 10 \
  --default-resources slurm_account=kawahara slurm_partition=hpg-milan

Run with Docker:

Say, in 'the cloud' using FlowDeploy.

snakemake --jobs 12 \
  --rerun-incomplete --retries 3 \
  --use-singularity \
  --default-resources

Reports:

snakemake --report results/report.html
snakemake --forceall --dag --dryrun | dot -Tpdf > results/dag.pdf

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
config		config
workflow		workflow
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
release_notes_newest.md		release_notes_newest.md
release_to_GitHub.md		release_to_GitHub.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

benchmarks

benchmarks

config

config

workflow

workflow

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

release_notes_newest.md

release_notes_newest.md

release_to_GitHub.md

release_to_GitHub.md

Repository files navigation

unite-train

Download a pre-trained classifier here! 🎁

What is this?

Running Snakemake workflow

About

Releases

Packages

Languages

License

colinbrislawn/unite-train

Folders and files

Latest commit

History

Repository files navigation

unite-train

What is this?

Running Snakemake workflow

About

Resources

License

Stars

Watchers

Forks

Languages