OptiFit

an improved method for fitting amplicon sequences to existing OTUs

This repository contains the complete analysis workflow used to benchmark the OptiFit algorithm in mothur and produce the accompanying manuscript. Find details on how to use OptiFit and descriptions of the parameter options on the mothur wiki: https://mothur.org/wiki/cluster.fit/.

Citation

Sovacool KL, Westcott SL, Mumphrey MB, Dotson GA, Schloss PD. 2022. OptiFit: An Improved Method for Fitting Amplicon Sequences to Existing OTUs. mSphere. http://dx.doi.org/10.1128/msphere.00916-21

A bibtex entry for LaTeX users:

@article{sovacool_optifit_2022,
author = {Kelly L. Sovacool  and Sarah L. Westcott  and M. Brodie Mumphrey  and Gabrielle A. Dotson  and Patrick D. Schloss},
title = {OptiFit: an Improved Method for Fitting Amplicon Sequences to Existing OTUs},
journal = {mSphere},
year = {2022},
doi = {10.1128/msphere.00916-21}
URL = {https://journals.asm.org/doi/10.1128/msphere.00916-21},

The Workflow

The workflow is split into five subworkflows:

0_prep_db — download & preprocess reference databases.
1_prep_samples — download, preprocess, & de novo cluster the sample datasets.
2_fit_reference_db — fit datasets to reference databases.
3_fit_sample_split — split datasets; cluster one fraction de novo and fit the remaining sequences to the de novo OTUs.
4_vsearch — run vsearch clustering for comparison.

The main workflow (Snakefile) creates plots from the results of the subworkflows and renders the paper.

Quickstart

Before cloning, configure git symlinks:
```
 git config --global core.symlinks true
```
Otherwise, git will create text files in place of symlinks.

Clone this repository.

 git clone https://github.com/SchlossLab/Sovacool_OptiFit_mSphere_2022
 cd Sovacool_OptiFit_mSphere_2022

Install the dependencies.

Almost all are listed in the conda environment file. Everything needed to run the analysis workflow is listed here.
```
conda env create -f config/env.simple.yaml
conda activate optifit
```
Additionally, I used a custom version of ggraph for the algorithm figure. You can install it with devtools from R:
```
devtools::install_github('kelly-sovacool/ggraph', ref = 'iss-297_ggtext')
```
If you do not have LaTeX already, you'll need to install a LaTeX distribution before rendering the manuscript as a PDF. You can use tinytex to do so:
```
tinytex::install_tinytex()
```
I also used latexdiffr to create a PDF with changes tracked prior to submitting revisions to the journal.
```
devtools::install_github("hughjonesd/latexdiffr")
```
Run the entire pipeline.

Locally:
```
snakemake --cores 4
```
Or on an HPC running slurm:
```
sbatch code/slurm/submit_all.sh
```
(You will first need to edit your email and slurm account info in the submission script and cluster config.)

Directory Structure

.
├── OptiFit.Rproj
├── README.md
├── Snakefile
├── code
│   ├── R
│   ├── bash
│   ├── py
│   ├── slurm
│   └── tests
├── config
│   ├── cluster.json
│   ├── config.yaml
│   ├── config_test.yaml
│   ├── env.export.yaml
│   ├── env.simple.yaml
│   └── slurm
│       └── config.yaml
├── docs
│   ├── paper.md
│   ├── paper.pdf
│   └── slides
├── exploratory
│   ├── 2018_fall_rotation
│   ├── 2019_winter_rotation
│   ├── 2020-05_May-Oct
│   ├── 2020-11_Nov-Dec
│   ├── 2021
│   │   ├── figures
│   │   ├── plots.Rmd
│   │   ├── plots.md
│   ├── AnalysisRoadmap.md
│   └── DeveloperNotes.md
├── figures
├── log
├── paper
│   ├── figures.yaml
│   ├── head.tex
│   ├── msphere.csl
│   ├── paper.Rmd
│   ├── preamble.tex
│   └── references.bib
├── results
│   ├── aggregated.tsv
│   ├── stats.RData
│   └── summarized.tsv
└── subworkflows
    ├── 0_prep_db
    │   ├── README.md
    │   └── Snakefile
    ├── 1_prep_samples
    │   ├── README.md
    │   ├── Snakefile
    │   ├── data
    │   │   ├── human
    │   │       └── SRR_Acc_List.txt
    │   │   ├── marine
    │   │       └── SRR_Acc_List.txt
    │   │   ├── mouse
    │   │       └── SRR_Acc_List.txt
    │   │   └── soil
    │   │       └── SRR_Acc_List.txt
    │   └── results
    │       ├── dataset_sizes.tsv
    │       └── opticlust_results.tsv
    ├── 2_fit_reference_db
    │   ├── README.md
    │   ├── Snakefile
    │   └── results
    │       ├── denovo_dbs.tsv
    │       ├── optifit_dbs_results.tsv
    │       └── ref_sizes.tsv
    ├── 3_fit_sample_split
    │   ├── README.md
    │   ├── Snakefile
    │   └── results
    │       ├── optifit_crit_check.tsv
    │       └── optifit_split_results.tsv
    └── 4_vsearch
        ├── README.md
        ├── Snakefile
        └── results
            └── vsearch_results.tsv

Name		Name	Last commit message	Last commit date
Latest commit History 3,448 Commits
.github/workflows		.github/workflows
code		code
config		config
docs		docs
exploratory		exploratory
figures		figures
log		log
paper		paper
results		results
subworkflows		subworkflows
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE.md		LICENSE.md
OptiFit.Rproj		OptiFit.Rproj
README.md		README.md
Snakefile		Snakefile

License

SchlossLab/Sovacool_OptiFit_mSphere_2022

Folders and files

Latest commit

History

Repository files navigation

OptiFit

an improved method for fitting amplicon sequences to existing OTUs

Citation

The Workflow

Quickstart

Directory Structure

About

Topics

Resources

License

Stars

Watchers

Forks

Languages