OBITools workflow

Table of Contents

About
Getting Started
- Installation
- Directories and files structure
- Download your data
Usage
- Configuration

About

This is a Snakemake workflow based on the obitools suite of programs, that analyzes DNA metabarcoding data.

Sequence analysis is performed with the obitools (Boyer et al. 2016) and sumaclust (Mercier et al. 2013) through a Snakemake pipeline (Mölder et al. 2021).

Getting started

Installation

Dependencies

In order to run the workflow, the following languages/programs are required:

Please note that the workflow is currently running exclusively on Unix systems.

Install the workflow

Clone the repository:

git clone https://github.com/AnneSoBen/obitools_workflow.git

Directories and files structure

The repository contains five folders:

config/: contains the configuration file of the Snakemake workflow (config.yaml). This is where the value of the options for the various commands used is defined.
log/: where log files of each rule are written.
resources/: where you should download/copy your raw data (cf. Download your data)
results/: where all output files are written.
workflow/: contains the Snakemake workflow (Snakefile), the configuration file of the submission parameters on the cluster (cluster.yaml) and the script to submit the workflow on the cluster (sub_smk.sh).

Download your data

Download/copy your data in the resources/ folder. Three files are required:

forward and reverse fastq files
the corresponding ngsfilter file

They should be named as follows: prefix_R1.fastq, prefix_R2.fastq, prefix_ngsfilter.tab

And be put in a subfolder whose name is the prefix of the files (see Example).

Usage

Configuration

Before running the workflow, the configuration file (config/config.yaml) has to be edited. The parameters that can be set are listed in the table below:

parameter	description	concerned rule(s)	default value	comment
tomerge	whether to merge libraries before dereplication	merge_demultiplex	FALSE	should be set to 'TRUE' if you analyse several libraries that you want to merge
resourcesfolder	relative path to the folder containing resource files (fastq files and ngsfilter)	split_fastq, demultiplex	../resources	should not be changed, unless you want to rename the folder
resultsfolder	relative path to the folder where output files will be written	all	../results	should not be changed, unless you want to rename the folder
fastqfiles	prefix of the name of the resource fastq files and ngsfilter	all	wolf_diet	must be changed to match your files name prefix
mergedfile	prefix of the name of the output files if tomerge=TRUE	merge_demultiplex, split_fasta, derepl, merge_derepl, basicfilt, clustering, merge_clust, tab_format	wolf_diet	must be changed for the merged files name prefix you want
split_fastq:nfiles	number of files to create when splitting fastq files for pairing	split_fastq	2	should be changed according to the size of your dataset: the bigger it is, the more you will want to split your initial files - useful only on multi-threaded systems
minscore	minimum alignment score required for pairing	alifilt	40.00	set according to Taberlet et al. 2018
split_fasta:nfiles	number of files to create when splitting demultiplexed fasta files for dereplication	split_fasta	2	should be changed according to the size of you dataset: the bigger it is, the more you will want to split your initial file(s)
minlength	minimum sequence length (in bp)	basicfilt	80	must be changed according to the minimum length expected for your barcode
mincount	minimum number of reads per unique sequence	basicfilt	1	it's up to you!
minsim	similarity threshold for clustering	clustering	0.97	it's up to you!

If you run the workflow on a SLURM cluster, you must also check the workflow/cluster.yaml that sets up the ressources available for each rule.

Run the workflow

Then, run the workflow:

cd workflow
conda activate snakemake
snakemake -c1 --use-conda

Alternatively, you can run the workflow with a single command on a SLURM cluster by submitting the sub_smk.sh file:

cd workflow
sbatch sub_smk.sh

Example

Download toy data

If you want to test the workflow, download the toy dataset from the obitools tutorial (https://pythonhosted.org/OBITools/wolves.html) in the resources/ folder:

wget -O resources/wolf_tutorial.zip https://pythonhosted.org/OBITools/_downloads/wolf_tutorial.zip
unzip resources/wolf_tutorial.zip -d resources/
mv resources/wolf_tutorial resources/wolf_diet
rm resources/wolf_tutorial.zip

Rename the files to fit the template decribed above (or create symbolic links):

cd resources/wolf_diet
ln -s wolf_F.fastq wolf_diet_R1.fastq
ln -s wolf_R.fastq wolf_diet_R2.fastq
ln -s wolf_diet_ngsfilter.txt wolf_diet_ngsfilter.tab

You should get this directory and file structure:

tree

.
├── config
│   └── config.yaml
├── LICENSE
├── log
├── README.md
├── resources
│   └── wolf_diet
│       ├── db_v05_r117.fasta
│       ├── embl_r117.ndx
│       ├── embl_r117.rdx
│       ├── embl_r117.tdx
│       ├── wolf_diet_ngsfilter.tab -> wolf_diet_ngsfilter.txt
│       ├── wolf_diet_ngsfilter.txt
│       ├── wolf_diet_R1.fastq -> wolf_F.fastq
│       ├── wolf_diet_R2.fastq -> wolf_R.fastq
│       ├── wolf_F.fastq
│       └── wolf_R.fastq
├── results
└── workflow
    ├── cluster.yaml
    ├── Snakefile
    └── sub_smk.sh

Note that the name of the subfolder containing your source files (fastq and ngsfilter files) should be the prefix of the files.

The config.yaml file is already modified to fit this data.

Run the workflow

Now run the workflow:

cd ../../workflow/
conda activate snakemake
snakemake -c1 --use-conda

Option: merging libraries

You may want to merge libraries, for example if technical replicates are split in different libraries. To allow this, the value of "tomerge" in the config/config.yaml file should be set to TRUE. The prefix of your library files should be listed in the config/config.yaml file, such as:

tomerge:
  TRUE
resourcesfolder:
  ../resources/
resultsfolder:
  ../results/
fastqfiles:
  - myfirstlibfileprefix
  - mysecondlibfileprefix
mergedfile:
  mymergedlibs

The source files of each library should be in separate subfolders. For example:

└─ resources
 └── myfirstlibprefix
 |   ├── myfirstlibprefix_ngsfilter.tab
 |   ├── myfirstlibprefix_R1.fastq
 |   └── myfirstlibprefix_R2.fastq
 └── mysecondlibprefix
     ├── mysecondlibprefix_ngsfilter.tab
     ├── mysecondlibprefix_R1.fastq
     └── mysecondlibprefix_R2.fastq

Two ngsfilter files will be necessary: resources/myfirstlibfileprefix/myfirstlibfileprefix_ngsfilter.tab and resources/myfirstlibfileprefix/mysecondlibfileprefix_ngsfilter.tab.

⚠️ If you want to be able to distinguish your technical replicates in the final output, don't forget to give your samples different names in the ngsfilter files, e.g. for a sample named "sample", you could change its name to "sample_a" in the first ngsfilter file and "sample_b" in the second ngsfilter file (if you have two technical replicates).

The value of "mergedfile" corresponds to the prefix of the merged files from the dereplication to the end of the workflow.

Going further

You may want to clean up potential molecular artifacts: have a look at the R package metabaR!

Acknowledgements

Thanks to Lucie Zinger, Frédéric Boyer, Céline Mercier and Clément Lionnet for their help with the obitools! Also thanks to the ECOFEED project for funding the development of the first version of this workflow.

How to cite this repository

Anne-Sophie Benoiston. (2022). AnneSoBen/obitools_workflow: v1.0.2. GitHub. https://doi.org/10.5281/zenodo.6676577.

🚩 Don't forget to cite this repository if you use it for your research 🙂

References

Boyer, F., Mercier, C., Bonin, A., Bras, Y. L., Taberlet, P., & Coissac, E. (2016). obitools: A unix-inspired software package for DNA metabarcoding. Molecular Ecology Resources, 16(1), 176‑182.

Mercier, C., Boyer, F., Bonin, A., & Coissac, E. (2013). SUMATRA and SUMACLUST: fast and exact comparison and clustering of sequences. In Programs and Abstracts of the SeqBio 2013 workshop. Abstract (pp. 27-29).

Mölder, F., Jablonski, K. P., Letcher, B., Hall, M. B., Tomkins-Tinch, C. H., Sochat, V., ... & Köster, J. (2021). Sustainable data analysis with Snakemake. F1000Research, 10.

Zinger, L., Lionnet, C., Benoiston, A. S., Donald, J., Mercier, C., & Boyer, F. (2021). metabaR: an R package for the evaluation and improvement of DNA metabarcoding data quality. Methods in Ecology and Evolution, 12(4), 586-592.

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
config		config
log		log
resources		resources
results		results
workflow		workflow
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
NEWS.md		NEWS.md
README.md		README.md

License

AnneSoBen/obitools_workflow

Folders and files

Latest commit

History

Repository files navigation

OBITools workflow

About

Getting started

Installation

Dependencies

Install the workflow

Directories and files structure

Download your data

Usage

Configuration

Run the workflow

Example

Download toy data

Run the workflow

Option: merging libraries

Going further

Acknowledgements

How to cite this repository

References

About

Topics

Resources

License

Stars

Watchers

Forks

Languages