Skip to content

biomarv/16S-rRNA-gene-analysis-using-QIIME2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 

Repository files navigation

16S rRNA Pipeline

Introduction

Since we are using Qiime2 in the command line interface, basic knowledge about the shell is required. There are several tutorials out there to get into it (i.e. http://linuxcommand.org/lc3_learning_the_shell.php). When working with bioinformatical data, command line tools are very powerful and makes data processing a lot more effective than using a graphical interface.

Set up your working directory

Create project folder in your home directory. In your project folder create a folder named i.e. raw_data for the .fq files. Then change into the project folder and check with the ls command the content.

mkdir ~/PROJECT_NAME
mkdir ~/PROJECT_NAME/raw_data 
cd PROJECT_NAME/
pwd

Rawdata

In general, data from the sequencing facility comes already demultiplexed, barcodes are already removed. Beside our files of interest, .fastq.gz (compressed fasta files with Quality information), sometimes already fastqc-processed files a delivered, too. Thus, a fastqc-report stored as n .html file (with its corresponding .zip folders) is provided. For paired-end data, we should find two files differing only in the read no.:

J34142_S41_L001_R1_001.fastq.gz
J34142_S41_L001_R2_001.fastq.gz

The underscore separated filename is composed of:

Element Description
J34142 Samplename provided by the samplesheet. This is mostly given by the sequencing facility
S41 Sample number is based on the + order in the samplesheet
L001 Lane number; always the same +
R1 Read 1. If paired-end reads, th+ ere will be a read 2.
001 Last segment is set number: al+ ways 001 + + + + +

FastQC

When checking the quality reports of FastQC, one must be aware of the origin and processing of the data. The evaluation of the quality is somewhat biased to a purpose. Looking into an .html report, FastQC comes with 11 checkpoints, hightling if failed or passed not considering the input data. Starting with basic statistics Encoding: Illumina 1.9 tells us, quality format is encoded in Phred+33. Total Sequence count should be congruent in forward and reverse reads.

  • Sequence length distribution: Should show one peak.
  • Per sequence GC content content: Sharp shoulder may indicate existing adapters, primer or rRNA.
  • Per base sequence content: Nucleotide frequency for each base at each position. As each sequence, in 16S amplicon sequencing.
  • Sequence duplication level and Overrepresented sequences: For 16S amplicon sequencing those two criteria should not be considered as valuable quality parameters since we are sequencing one gene. Thus, observing similar sequences is expected.

MultiQC

Additionally, you can check all fastQC-reports with multiQC, which incorporates all reports into a single report. By moving into the report-folder multiqc only needs an input folder as an argument. For further options use multiqc -h

multiqc .

Set the environment

To activate the QIIME 2 environment use the following code:

conda activate qiime2-2021.2

To deactivate the environment use conda deactivate. Working in this conda environment makes sure, that the correct version of Python is used running QIIME2 commands.

QIIME2

To find out more about QIIME2 use the well documented source https://docs.qiime2.org/2021.8/ with several useful tutorials. Clearly, we are using QIIME2 in the command-line interface (CLI), although there is a graphical use available. Indispensible, is knowledge about the difference between .qza and .qzv, the qiime artifacts. The ladder describes any file content that can be visualized. The .qza ending is a file were data is stored, processed in and QIIME knows about type and format. Those are the files we are dealing with when processing our data. While in bioinformatics a lot of different file types exist which one needs to handle, this is one advantage of QIIME, so the user can focus on data processing. Another advantage of QIIME2 is, that every file keeps track of its provenance, so the workflow of the pipeline is always tracked. There are plenty of tools within qiime to manipulate data and the first to begin with is to import our data into a QIIME2 artifact. But first a metadatafile is needed.

Set up your metadata file

First, a metadata file has to be generated. Most of the times it already exist and just needs some modification or file transformation so qiime can handle it and integrate it in the pipeline. The first column in the metadata file has to be a unique identifier for the sample and should be one of the following values:

  • id
  • sampleid
  • sample-id

The rest of the metadata information are given by the experiment and can be of categorical or numerical type.

id environment diet sampling_data ...
sample_01 feaces X 2050-05-25 ...
sample_02 feaces Y 2050-05-25 ...
... ... ... ... ...
sample_40 water nA 2050-05-25 ...
... ... ... ... ...
sample_50 pos_control nA 2050-05-25 ...
... ... ... ... ...
sample_55 extr_control nA 2050-05-25 ...

More information about metadata in qiime2 here The file has to be saved as a tab-separated value format (.tsv) and can then be transformed into a qiime artifact file via the following command.

qiime metadata tabulate \
  --m-input-file metadata.tsv \
  --o-visualization tabulated-sample-metadata.qzv

Assign your metadata file to a variable. This file will be used more often, so using a variable comes in handy. The content of variable is accessed with a $ prefix.

METADATA="/home/user/PROJECT_NAME/metadata.tsv"
echo $METADATA
ls

Import data

mkdir qiime_files
qiime tools import \
    --type SampleData[PairedEndSequencesWithQuality] \
    --input-format CasavaOneEightSingleLanePerSampleDirFmt \
    --input-path $HOME/PROJECT_NAME/raw_data \
    --output-path /qiime_files/demux_paired_end.qza

The tools package has plenty functions to manipulate QIIME2 files. To find out more about its function you can use

qiime tools --help

The tool needs three required inputs: the --input-path to tell the program were the data is located, the --output-path were to output the QIIME2 artifact and the --type of data. In this case, we are dealing with paired-end sequencing data which also has qulity information. There are several data types which we can look up using:

qiime tools import \
    --show-importable-types

Additionally, the --input-format option is provided, which tells QIIME2 the files are in the format explained in the rawdata section.

Check out the file type which has been imported:

cd qiime_files
qiime tools peek \
    demux_paired_end.qza

Summarize data and visualization

The next command summarizes the data and generates a visualisation file which we can inspect now:

qiime demux summarize \
    --i-data demux_paired_end.qza \
    --o-visualization demux_paired_end.qzv

qiime tools view \
    demux_paired_end.qzv

This file gives you information about general sequence distribution and also an interactive quality plot.

Primer trimming

To remove adapter sequences, primers and any unwanted sequences from the reads, cutadapt is used.

qiime cutadapt trim-paired \
    --i-demultiplexed-sequences demux-paired-end.qza \
    --p-adapter-f ATTAGAWACCCBDGTAGTCC \
    --p-front-f CCTACGGGAGGCAGCAG \
    --p-adapter-r CTGCTGCCTCCCGTAGG \
    --p-front-r GGACTACHVGGGTWTCTAAT \
    --p-discard-untrimmed \
    --p-times 2 \
    --o-trimmed-sequences demux_cutadapt.qza \
    --verbose > trimming_report.log

To find out more about the primer options of this command have look at qiime cutadapt trim-primer --help. In this example data set the V3/V4 region of the 16S rRNA was used. The additional two primers are the reversed complements of the V3/V4 primer pair. The option --p-discard-untrimmed excludes reads in which no primers were found. The last option redirects the standard output (stdout) into a logfile. The output can also be visualized with the demux summarize command.

If another hypervariable region was sequences, the corresponding primer pairs needs to be adjusted!

Denoising

To denoise the reads qiime offers two packages, deblur and DADA2. Both pipelines end up with amplicon sequence variants (ASV). In this tutorial the DADA2 package will be used. You can find out more about the package in this tutorial or have a look at the publication. The general purpose of both is to correct for errors, filter phiX reads and chimeric sequences. To set the trunc option one has to inspect the quality plot from the demux_cutadapt.qza file.

qiime dada2 denoise-paired \
     --i-demultiplexed-seqs demux_cutadapt.qza \
     --p-trunc-len-f 270 \
     --p-trunc-len-r 247 \
     --p-trunc-q 5 \
     --o-table dada2_table.qza \
     --o-representative-sequences rep-seqs.qza \
     --o-denoising-stats denoising-stats.qza \
     --verbose &> dada2.log

Except for --p-trunc-q 5 all options are essential for DADA2. The truncation options for the reversed reads will be more a lower bases, because the quality drops mostly before the foreward reads. The command generates three output files.

First lets have a look at the denoising-stats.qza by running the following command:

qiime metadata tabulate \
    --m-input-file denoising-stats.qza \
    --o-visualization denoising-stats.qzv

There is also an export function in the tool package for exporting data in i.e. .tsv format.

qiime tools export \
    --input-path denoising-stats.qzv \
    --output-path denoising-stats

For each sample-id the following stats are available:

  • input: absoulute no. of input reads
  • filtered: absoulute reads filtered vie --p-options
  • percentage of input passed filter: in %
  • denoised: absoulute no. denoised reads
  • merged: absoulute no. reads left for merging
  • percentage of input merged: in %
  • non-chimeric: absoulute no. reads left from previous filter which are non-chimeric
  • percentage of input non-chimeric: in %

Based on certain criteria, DADA2 declares representative sequences for which a Feature ID, the Sequence length and the actual Sequence is denotetd. To visualize the representative sequences output:

qiime feature-table tabulate-seqs \
    --i-data rep-seqs.qza \
    --o-visualization rep-seqs.qzv

Beside some general statistics, by clicking on the sequence, the sequence will be aligned using BLAST algorithm vie NCBI.

The last output file is a feature count table, which is a very important table for further analysis. It is a two dimensional table with samples as rows and features as columns.

Sample Feature_1 Feature_2 [...]
Sample_1 120 50 [...]
Sample_2 20 200 [...]

The features are represented by the rep-seqs.qza were all sequences are stored in, which incorporate the unique bacteria (feature).

Taxonomy assignment

Now a part of the preprocessing is done. The dada2_table is somehow our key file to work with. One of the next steps is to taxonomacally classify the ASVs. By doing that, we use the SILVA databasa for classifying our ASVs.

The database is downloaded via the following command from QIIME2 website. This downloads the full length reference sequences for silva-138-99-seqs.qza to classify and the corresponding taxonomy information silva-138-99-tax.qza.

wget https://data.qiime2.org/2020.8/common/silva-138-99-seqs.qza
wget https://data.qiime2.org/2020.8/common/silva-138-99-tax.qza

If you work on our remote maschine you can find those files in the following directory:

ls /db/silva/

or using the $SILVA_DB variable which conveniently stores the path to the database.

ls $SILVA_DB

Extract reference sequences

The following function extract readsfrom the feature-classifierpackage trims the silva-seq around the givin primer sequences (note that these are the same oligos used in cutadapt) for 16S rRNA V3/V4 region. Extracting the region of interest and use those sequence with the Naive Bayes classifier has been found out to be an improvement. Additionally, max and min lenght are given to exclude sequences which are outside of the anticipated length. We output the trimmed reference sequences in our qiime_files directory.

qiime feature-classifier extract-reads \
    --i-sequences $SILVA_DB/silva-138-99-seqs.qza \
    --p-f-primer CCTACGGGAGGCAGCAG \
    --p-r-primer GGACTACHVGGGTWTCTAAT \
    --p-max-length 505 \
    --p-min-length 250 \
    --o-reads ref-seqs.qza

Train classifier

Next, the Niave bayes classifier is trained on the extracted reference sequences using SILVAs reference taxonomy database. Alternatively, on the QIIME2 website there are already pre-trained classifier available.

qiime feature-classifier fit-classifier-naive-bayes \
    --i-reference-reads ref-seqs.qza \
    --i-reference-taxonomy $SILVA_DB/silva-138-99-tax.qza \
    --o-classifier classifier_naive_bayes.qza

Classify reads

This step will classify the representative sequences from the data using the taxonmic classifier to output a FeatureData object which hols taxonomical information to our rep-seqs.

qiime feature-classifier classify-sklearn \
    --i-classifier classifier_naive_bayes.qza \
    --i-reads rep-seqs.qza \
    --o-classification taxonomy.qza

The output of this command is a taxonomy.qza file. The taxonomical information was depostied for every Feature ID again is connected to the representative seqeunces in the rep-seqs file. To inspect the file, again the metadata tabulate command is used:

qiime metadata tabulate \
    --m-input-file taxonomy.qza \
    --o-visualization taxonomy.qzv

Notice, the "Download metadata TSV file" in the top left corner. There is a third column which provides information about the Confidence of the assignment.

About

QIIME2 worklflow. From raw data to a feature table.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published