Skip to content

6. CD_HIT_OTU_MiSeq

Weizhong Li edited this page Mar 22, 2019 · 11 revisions

Table of Contents

Introduction

This use case is developed for clustering 16S rDNA sequences sequenced with MiSeq platform into OTUs for microbiome studies. In recent years, Illumina MiSeq sequencers became dominant in 16S rDNA sequencing. The Paired End (PE) reads need to be assembled first. However many reads can not be accurately assembled because the poor quality at the 3’ ends of both PE reads in the overlapping region. This causes that many sequences are discarded in the analysis. CD-HIT-OTU-MiSeq has unique features to cluster MiSeq 16S sequences.

  • The package can clustering PE reads without joining them into contigs.
  • Users can choose a high quality portion of the PE reads for analysis (e.g. first 200 / 150 bases from forward / reverse reads), according to base quality profile.
  • We implemented a tool that can splice out the target region (e.g. V3-V4) from a full-length 16S reference database into the PE sequences. CD-HIT-OTU-MiSeq can cluster the spliced PE reference database together with samples, so we can derive Operational Tax-onomic Units (OTUs) and annotate these OTUs concurrently.
  • Chimeric sequences are effectively identified through de novo approache.
The most important unique feature of CD-HIT-OTU-MiSeq is to only use high quality region at the 5’ ends of R1 and R2 reads. For example, the effective clustering read length can be 200 bases for R1 and 150 bases for R2. The effective portions of PE reads are clustered together with spliced PE sequences from the reference database to derive OTUs (Figure).

Installation

1. Install CD-HIT package

  • download current CD-HIT at https://github.com/weizhongli/cdhit/releases, for example cd-hit-v4.6.2-2015-0511.tar.gz
  • unpack the file with " tar xvf cd-hit-v4.6.2-2015-0511.tar.gz --gunzip"
  • change dir by "cd cd-hit-v4.6.2-2015-0511"
  • compile the programs by "make" with multi-threading (default), or by "make openmp=no" without multi-threading (on old systems without OpenMP)
  • cd cd-hit-auxtools
  • compile cd-hit-auxtools by "make"
  • CD-HIT-OTU-MiSeq scripts are inside a folder like cd-hit-v4.6.2-2015-0511/usecases/Miseq-16S
2. Install Trimmomatic
CD-HIT-OTU-MiSeq uses Trimmomatic for sequence quality control. It can be downloaded from http://www.usadellab.org/cms/?page=trimmomatic or https://github.com/timflutre/trimmomatic. We also have a copy at http://weizhongli-lab.org/download-data/cd-hit-otu-miseq/.

3. Modify NG-Omics-Miseq-16S.pl
Please edit usecases/Miseq-16S/NG-Omics-Miseq-16S.pl, in the top few lines:

  $CD_HIT_dir   = "PATH_to_cd-hit";
  $NGS_prog_trimmomatic = "PATH/trimmomatic-0.32.jar";  #### where you have installed Trimmomatic

For Python users, Modify NG-Omics-Miseq-16S.py
Some scripts in the package are also available in Python.
Please edit usecases/Miseq-16S/NG-Omics-Miseq-16S.py, in the top few lines:

  'CD_HIT_dir'           : 'PATH_to_cd-hit_directory',
  'NGS_prog_trimmomatic' : 'PATH_to_Trimmomatic/trimmomatic-0.32.jar',   #### where you have installed Trimmomatic

4. Download reference dataset
Reference database can be downloaded from http://weizhongli-lab.org/download-data/cd-hit-otu-miseq/. The reference database Greengene-13-5-99.fasta.gz was re-formatted from original Greengene database, so that sequences with more specific annotations are at the beginning of the file. Please gunzip after download.

You can also download Greengene directly. You should download Greengene from http://greengenes.secondgenome.com/downloads, or ftp://greengenes.microbio.me/. Please download file like greengenes_release/gg_13_5/gg_13_5_otus.tar.gz, unpack the tar file. You may find gg_13_5_otus/taxonomy/99_otu_taxonomy.txt and gg_13_5_otus/rep_set/99_otus.fasta.

There is a script: usecases/Miseq-16S/greengene-ann1.pl, please run this script to re-format greengene:

    PATH_to_cd-hit/usecases/Miseq-16S/greengene-ann1.pl -i gg_13_5_otus/taxonomy/99_otu_taxonomy.txt -j gg_13_5_otus/rep_set/99_otus.fasta -o Greengene-13-5-99.fasta

Silva database can also be used, the re-formatted Silva database can also be found from http://weizhongli-lab.org/download-data/cd-hit-otu-miseq/. Or you can download Silva from the source, e.g. https://www.arb-silva.de/fileadmin/silva_databases/release_128/Exports/SILVA_128_SSURef_Nr99_tax_silva.fasta.gz. Then run silva-ana1.pl on SILVA_128_SSURef_Nr99_tax_silva.fasta.

    PATH_to_cd-hit/usecases/Miseq-16S/silva-ann1.pl -j SILVA_128_SSURef_Nr99_tax_silva.fasta -o SILVA_128_SSURef_processed.fasta

5. Download sample datasets
Sample datasets can be downloaded from http://weizhongli-lab.org/download-data/cd-hit-otu-miseq/. The Miseq-otu-example.tar.gz contains two Miseq 16S samples. You can download and unpack to test.

Usage of CD-HIT-OTU-MiSeq

1. Prepare fastq files and sample file
Most projects have multiple samples sequenced at the same variable regions. After your samples are sequenced, your sequencing center should give you two paired ended fastq files for each samples. Put them in a working directory in similar way as the testing datasets, where the R1.fq and R2.fq are placed in a folder for each sample. the folder name is the sample name. So in the working directory, you should have files:

  sample_name_1/R1.fq
  sample_name_1/R2.fq
  sample_name_2/R1.fq
  sample_name_2/R2.fq
  ...
  sample_name_N/R1.fq
  sample_name_N/R2.fq

2. Prepare sample file
Next is to prepare a SAMPLE_file, a text file, in the working directory. The file should look like:

  sample_name_1 R1.fq R2.fq
  sample_name_2 R1.fq R2.fq
  ...
  sample_name_N R1.fq R2.fq

3. Prepare reference database
We implemented a tool that can splice out the target amplicon region (e.g. V3-V4) from a full-length 16S rRNA reference sequence database, such as Greengene, RDP and Silva, into PE sequences. If there are multiple samples in a project sequenced with the same amplicon of same variable region, only one spliced reference database is needed. Please run:

    Path_to_cd-hit_dir/usecases/Miseq-16S/16S-ref-db-PE-splice.pl -i sample_name_1/R1.fq -j sample_name_2/R2.fq  -d Greengene-13-5-99.fasta -o gg_13_5-PE99.150-100 -p 150 -q 100 -c 0.99
    Where Greengene-13-5-99.fasta is our re-formatted Greengene sequence file. 
          -p 150 specify the effective clustering read length for R1 to be 150
          -q 100 specify the effective clustering read length for R2 to be 100
            -p and -q option need to be consistent with parameters in OTU clustering in step 4
            see next section for suggestions in choose effective clustering read length

This program will output spliced PE files gg_13_5-PE99.150-100-R1 and gg_13_5-PE99.150-100-R2.

4. Run sequence QC and OTU clustering for each sample
In the working directory, run

    PATH_to_cd-hit-dir/usecases/Miseq-16S/NG-Omics-WF.pl -i PATH_to_cd-hit-dir/usecases/Miseq-16S/NG-Omics-Miseq-16S.pl -s SAMPLE_file -j otu -T otu:150:100:0.97:0.0001:PATH_to-gg_13_5-PE99.150-100-R1:PATH_to-gg_13_5-PE99.150-100-R2:75 -J write-sh
    (Python user) PATH_to_cd-hit-dir/usecases/Miseq-16S/NG-Omics-WF.py -i PATH_to_cd-hit-dir/usecases/Miseq-16S/NG-Omics-Miseq-16S.py -s SAMPLE_file -j otu -T otu:150:100:0.97:0.0001:PATH_to-gg_13_5-PE99.150-100-R1:PATH_to-gg_13_5-PE99.150-100-R2:75 -J write-sh
    where: 150 and 100 are the effective length, 
           see next section for suggestions in choose effective clustering read length
           0.97 is the OTU clustering cutoff, 
           0.00001 is the abundance cutoff, 
           75 is the length for chimeric checking at each R1 and R2 read
           PATH_to-gg_13_5-PE99.150-100-R1 and PATH_to-gg_13_5-PE99.150-100-R2 need to be full path 
             e.g. /home/user/myproj/PATH_to-gg_13_5-PE99.150-100-R1

This command will generate shell scripts for QC and for OTU for each sample. The scripts will be in WF-sh folder. You can first run all the qc.sample_name.sh and after all these jobs finished you then run all otu.sample_name.sh

NG-Omics-WF.pl (NG-Omics-WF.py) https://github.com/weizhongli/ngomicswf is a very powerful workflow and pipeline tool developed in our group. It is not fully released yet, since we need more time to document this tool. However, you can try to use NG-Omics-WF.pl (NG-Omics-WF.py) to automatically run all your samples. First edit NG-Omics-Miseq-16S.pl (NG-Omics-Miseq-16S.py) and modify cores_per_node around line #36 to match the number of CPU cores of your computer, then run

    nohup PATH_to_cd-hit-dir/usecases/Miseq-16S/NG-Omics-WF.pl -i PATH_to_cd-hit-dir/usecases/Miseq-16S/NG-Omics-Miseq-16S.pl -s SAMPLE_file -j otu -T otu:150:100:0.97:0.0001:PATH_to-gg_13_5-PE99.150-100-R1:PATH_to-gg_13_5-PE99.150-100-R2:75 &
    (Python user) nohup PATH_to_cd-hit-dir/usecases/Miseq-16S/NG-Omics-WF.py -i PATH_to_cd-hit-dir/usecases/Miseq-16S/NG-Omics-Miseq-16S.py -s SAMPLE_file -j otu -T otu:150:100:0.97:0.0001:PATH_to-gg_13_5-PE99.150-100-R1:PATH_to-gg_13_5-PE99.150-100-R2:75 &

After the job finished, the OTU results will be in sample_name/otu folder, important files include

    OTU.clstr: file lists all clusters and sequences
    chimeric-small-clusters-list.txt: list of chimeric reads and low abundance reads not used

5. Pool all samples together
If you have multiple samples, you don't just want to stop here. It is important to pool all sample together and re-run OTU clustering so that all samples can be compared, run

    PATH_to_cd-hit-dir/usecases/Miseq-16S/pool_samples.pl -s SAMPLE_file -o pooled 

This will pool sequences from all samples. We can handle hundred and more sample without problem.

6. Cluster pooled samples, run

    PATH_to_cd-hit-dir/usecases/Miseq-16S/NG-Omics-WF.pl -i PATH_to_cd-hit-dir/usecases/Miseq-16S/NG-Omics-Miseq-16S.pl -S pooled -j otu-pooled -T otu:150:100:0.97:0.0001:PATH_to-gg_13_5-PE99.150-100-R1:PATH_to-gg_13_5-PE99.150-100-R2:75 -J write-sh
    (Python user) PATH_to_cd-hit-dir/usecases/Miseq-16S/NG-Omics-WF.py -i PATH_to_cd-hit-dir/usecases/Miseq-16S/NG-Omics-Miseq-16S.py -S pooled -j otu-pooled -T otu:150:100:0.97:0.0001:PATH_to-gg_13_5-PE99.150-100-R1:PATH_to-gg_13_5-PE99.150-100-R2:75 -J write-sh

This command will generate a script WF-sh/otu-pooled.pooled.sh, you can run this sh script. When it is finished, OTUs will be in the pooled directory:

  • OTU.clstr: file list all clusters and sequences from all samples in CD-HIT format
  • OTU.txt: spread sheet list number of sequences in each OTU for each sample, it also show annotation for each OTU.
  • chimeric-small-clusters-list.txt: list of chimeric reads and low abundance reads not used

Choose effective clustering read length

The key of this method is to use the high quality portion of reads from both R1 and R2, so how to choose effective clustering read length depends on the actual quality of the PE reads. In our paper five pairs of effective clustering read lengths (225, 175), (200, 150), (175, 125), (150, 100) and (125, 75) were selected for samples sequenced at V34 or V45. Two pairs of effective clustering read lengths (150, 100) and (125, 75) were used for samples of V4 region. All these settings gave good results.

You can try some different settings and compare the results. Also, programs such as FASTQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) can be used to scan the raw reads to help choose the effective clustering read length of R1 and R2.

Web server

A web server where users can upload fastq files to run cd-hit-otu-miseq is currently available at http://weizhong-lab.ucsd.edu:8088/. This server maybe temporary. A video guide is available at https://youtu.be/CKkYHCXBLaU.

Reference

  • Weizhong Li and Yuanyuan Chang. CD-HIT-OTU-MiSeq, an Improved Approach for Clustering and Analyzing Paired End MiSeq 16S rRNA Sequences. bioRXiv (2017) doi: https://doi.org/10.1101/153783. http://biorxiv.org/content/early/2017/06/22/153783