Skip to content

AnneSoBen/obitools_workflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OBITools workflow

DOI

Table of Contents
  1. About
  2. Getting Started
  3. Usage

About

This is a Snakemake workflow based on the obitools suite of programs, that analyzes DNA metabarcoding data.

Sequence analysis is performed with the obitools (Boyer et al. 2016) and sumaclust (Mercier et al. 2013) through a Snakemake pipeline (Mölder et al. 2021).

Getting started

Installation

Dependencies

In order to run the workflow, the following languages/programs are required:

Please note that the workflow is currently running exclusively on Unix systems.

Install the workflow

Clone the repository:

git clone https://github.com/AnneSoBen/obitools_workflow.git

Directories and files structure

The repository contains five folders:

  • config/: contains the configuration file of the Snakemake workflow (config.yaml). This is where the value of the options for the various commands used is defined.
  • log/: where log files of each rule are written.
  • resources/: where you should download/copy your raw data (cf. Download your data)
  • results/: where all output files are written.
  • workflow/: contains the Snakemake workflow (Snakefile), the configuration file of the submission parameters on the cluster (cluster.yaml) and the script to submit the workflow on the cluster (sub_smk.sh).

Download your data

Download/copy your data in the resources/ folder. Three files are required:

  • forward and reverse fastq files
  • the corresponding ngsfilter file

They should be named as follows: prefix_R1.fastq, prefix_R2.fastq, prefix_ngsfilter.tab

And be put in a subfolder whose name is the prefix of the files (see Example).

Usage

Configuration

Before running the workflow, the configuration file (config/config.yaml) has to be edited. The parameters that can be set are listed in the table below:

parameter description concerned rule(s) default value comment
tomerge whether to merge libraries before dereplication merge_demultiplex FALSE should be set to 'TRUE' if you analyse several libraries that you want to merge
resourcesfolder relative path to the folder containing resource files (fastq files and ngsfilter) split_fastq, demultiplex ../resources should not be changed, unless you want to rename the folder
resultsfolder relative path to the folder where output files will be written all ../results should not be changed, unless you want to rename the folder
fastqfiles prefix of the name of the resource fastq files and ngsfilter all wolf_diet must be changed to match your files name prefix
mergedfile prefix of the name of the output files if tomerge=TRUE merge_demultiplex, split_fasta, derepl, merge_derepl, basicfilt, clustering, merge_clust, tab_format wolf_diet must be changed for the merged files name prefix you want
split_fastq:nfiles number of files to create when splitting fastq files for pairing split_fastq 2 should be changed according to the size of your dataset: the bigger it is, the more you will want to split your initial files - useful only on multi-threaded systems
minscore minimum alignment score required for pairing alifilt 40.00 set according to Taberlet et al. 2018
split_fasta:nfiles number of files to create when splitting demultiplexed fasta files for dereplication split_fasta 2 should be changed according to the size of you dataset: the bigger it is, the more you will want to split your initial file(s)
minlength minimum sequence length (in bp) basicfilt 80 must be changed according to the minimum length expected for your barcode
mincount minimum number of reads per unique sequence basicfilt 1 it's up to you!
minsim similarity threshold for clustering clustering 0.97 it's up to you!

If you run the workflow on a SLURM cluster, you must also check the workflow/cluster.yaml that sets up the ressources available for each rule.

Run the workflow

Then, run the workflow:

cd workflow
conda activate snakemake
snakemake -c1 --use-conda

Alternatively, you can run the workflow with a single command on a SLURM cluster by submitting the sub_smk.sh file:

cd workflow
sbatch sub_smk.sh

Example

Download toy data

If you want to test the workflow, download the toy dataset from the obitools tutorial (https://pythonhosted.org/OBITools/wolves.html) in the resources/ folder:

wget -O resources/wolf_tutorial.zip https://pythonhosted.org/OBITools/_downloads/wolf_tutorial.zip
unzip resources/wolf_tutorial.zip -d resources/
mv resources/wolf_tutorial resources/wolf_diet
rm resources/wolf_tutorial.zip

Rename the files to fit the template decribed above (or create symbolic links):

cd resources/wolf_diet
ln -s wolf_F.fastq wolf_diet_R1.fastq
ln -s wolf_R.fastq wolf_diet_R2.fastq
ln -s wolf_diet_ngsfilter.txt wolf_diet_ngsfilter.tab

You should get this directory and file structure:

tree
.
├── config
│   └── config.yaml
├── LICENSE
├── log
├── README.md
├── resources
│   └── wolf_diet
│       ├── db_v05_r117.fasta
│       ├── embl_r117.ndx
│       ├── embl_r117.rdx
│       ├── embl_r117.tdx
│       ├── wolf_diet_ngsfilter.tab -> wolf_diet_ngsfilter.txt
│       ├── wolf_diet_ngsfilter.txt
│       ├── wolf_diet_R1.fastq -> wolf_F.fastq
│       ├── wolf_diet_R2.fastq -> wolf_R.fastq
│       ├── wolf_F.fastq
│       └── wolf_R.fastq
├── results
└── workflow
    ├── cluster.yaml
    ├── Snakefile
    └── sub_smk.sh

Note that the name of the subfolder containing your source files (fastq and ngsfilter files) should be the prefix of the files.

The config.yaml file is already modified to fit this data.

Run the workflow

Now run the workflow:

cd ../../workflow/
conda activate snakemake
snakemake -c1 --use-conda

Option: merging libraries

You may want to merge libraries, for example if technical replicates are split in different libraries. To allow this, the value of "tomerge" in the config/config.yaml file should be set to TRUE. The prefix of your library files should be listed in the config/config.yaml file, such as:

tomerge:
  TRUE
resourcesfolder:
  ../resources/
resultsfolder:
  ../results/
fastqfiles:
  - myfirstlibfileprefix
  - mysecondlibfileprefix
mergedfile:
  mymergedlibs

The source files of each library should be in separate subfolders. For example:

└─ resources
 └── myfirstlibprefix
 |   ├── myfirstlibprefix_ngsfilter.tab
 |   ├── myfirstlibprefix_R1.fastq
 |   └── myfirstlibprefix_R2.fastq
 └── mysecondlibprefix
     ├── mysecondlibprefix_ngsfilter.tab
     ├── mysecondlibprefix_R1.fastq
     └── mysecondlibprefix_R2.fastq

Two ngsfilter files will be necessary: resources/myfirstlibfileprefix/myfirstlibfileprefix_ngsfilter.tab and resources/myfirstlibfileprefix/mysecondlibfileprefix_ngsfilter.tab.

⚠️ If you want to be able to distinguish your technical replicates in the final output, don't forget to give your samples different names in the ngsfilter files, e.g. for a sample named "sample", you could change its name to "sample_a" in the first ngsfilter file and "sample_b" in the second ngsfilter file (if you have two technical replicates).

The value of "mergedfile" corresponds to the prefix of the merged files from the dereplication to the end of the workflow.

Going further

You may want to clean up potential molecular artifacts: have a look at the R package metabaR!

Acknowledgements

Thanks to Lucie Zinger, Frédéric Boyer, Céline Mercier and Clément Lionnet for their help with the obitools! Also thanks to the ECOFEED project for funding the development of the first version of this workflow.

How to cite this repository

Anne-Sophie Benoiston. (2022). AnneSoBen/obitools_workflow: v1.0.2. GitHub. https://doi.org/10.5281/zenodo.6676577.

🚩 Don't forget to cite this repository if you use it for your research 🙂

References

Boyer, F., Mercier, C., Bonin, A., Bras, Y. L., Taberlet, P., & Coissac, E. (2016). obitools: A unix-inspired software package for DNA metabarcoding. Molecular Ecology Resources, 16(1), 176‑182.

Mercier, C., Boyer, F., Bonin, A., & Coissac, E. (2013). SUMATRA and SUMACLUST: fast and exact comparison and clustering of sequences. In Programs and Abstracts of the SeqBio 2013 workshop. Abstract (pp. 27-29).

Mölder, F., Jablonski, K. P., Letcher, B., Hall, M. B., Tomkins-Tinch, C. H., Sochat, V., ... & Köster, J. (2021). Sustainable data analysis with Snakemake. F1000Research, 10.

Zinger, L., Lionnet, C., Benoiston, A. S., Donald, J., Mercier, C., & Boyer, F. (2021). metabaR: an R package for the evaluation and improvement of DNA metabarcoding data quality. Methods in Ecology and Evolution, 12(4), 586-592.