Meta-Barcoding Dual-Indexing

About

This is a collection of the computational methods used for the publication Increased efficiency in identifying mixed pollen samples by meta-barcoding with a dual-indexing approach . Here we provide all custom scripts and commands to replicate our results. Furthermore you can download the pre-computed training sets for the RDPclassifier here and UTAX here. If you find this information useful please consider citing our article.

Dependencies

This is a list of software tools and versions used in the original analysis. We provide precomputed results for most steps so don’t worry if you can’t get a certain tool in a certain version. If you get substantially different results (e.g. using other versions/tools) feel free to contact me. You can also do so by opening an Issue here on GitHub.

Hardware

Most of the analyses were performed on a Laptop with

Intel(R) Core(TM) i7-4800MQ CPU @ 2.70GHz Processor
16GB RAM
Ubuntu 14.04 operating system

Only the RDP training was performed on a machine with more RAM available.

Third party tools

ITS2 database (Version 4.0.0 and inofficial release (included))
UTAX (Version usearch v8.0.1477_i86linux32) ATTENTION: does not work with versions >=8.1 due to changes in the utax algorithm
RDPclassifier (Version master from GitHub (63c637b2c8b3b941cf021d1e549caec0041b4c3c, 2015-02-20))
NCBI::Taxonomy - perl module (Version v0.70.5, doi:10.5281/zenodo.17375)
R (Version 3.1.2 (2014-10-31) – “Pumpkin Helmet”, Platform: x86_64-pc-linux-gnu (64-bit))
perl (Version v5.20.2 built for x86_64-linux-gnu-thread-multi)
QIIME (Version 1.8.0+dfsg-4)
fastq-join (Version 1.01.759)

Quick Start

If you want to start to classify your own data right now, this is the easiest way to do it:

Make sure you have installed perl, usearch, fastq-join (and RDPclassifier if needed).
Put all your fastq read files in an empty directory
Execute the following commands:

# git clone the repository
git clone https://github.com/iimog/meta-barcoding-dual-indexing
# alternatively you can download and extract the release

cd meta-barcoding-dual-indexing
mkdir myanalyses
cd myanalyses
# extract the plant reference databases for classification
tar xzvf ../training/rdp/rdp_trained.tar.gz
tar xzvf ../training/utax/utax_trained.tar.gz

# this step is not required but recommended
# if you choose to skip it use .fa instead of .udb when calling classify_reads.pl
usearch8.0 -makeudb_usearch utax_trained/viridiplantae_all_2014.utax.fa\
 -output utax_trained/viridiplantae_all_2014.utax.udb

# call the wrapper script that performs:
# joining, filtering, classification and aggregation for all samples
perl ../code/classify_reads.pl --out results <path_to_reads>/*.fastq\
 --utax-db utax_trained/viridiplantae_all_2014.utax.udb\
 --utax-taxtree utax_trained/viridiplantae_all_2014.utax.tax\
 --rdp --rdp-jar <path_to_RDPTools>/classifier.jar\
 --rdp-train-propfile rdp_trained/its2.properties

Workflow

Preparation of Reference Database

Data

The ITS2 sequences can be retrieved from the ITS2 database. The previous reference db consisted of all “Direct folds & Homology modeled” sequences from “Viridiplantae” (db version v4.0.0, 2011). This file viridiplantae_folds_2011.fasta is included for comparison. The new reference database consists of all sequences from “Viridiplantae” of an inofficial ITS2 database release (2014). This file viridiplantae_all_2014.fasta is also included in this repository.

Taxonomic assignment

Assign NCBI taxonomy on kingdom, phylum, class, order, family, genus and species level by mapping gi to taxid.

ATTENTION: Results of the taxonomic assignment may vary depending on the date of calculation as NCBI Taxonomy is continuously revised. The viridiplantae_all_2014 dataset was processed on <2015-05-28 Do> and the viridiplantae_folds_2011 dataset on <2015-06-29 Mo>.

Create an analysis folder and execute the following commands there (you need the NCBI::Taxonomy module for this):

# First create a list of gi numbers from the fasta file
# This only works if the header is in the form
# '>123456 any description' with 123456 being the taxid without any prefix like 'gi|'
# If your fasta headers have a different format adjust the substitude expression accordingly.
grep "^>" ../data/viridiplantae_all_2014.fasta |
 perl -pe 's/^>(\d+).*/$1/' >viridiplantae_all_2014.gis

# Now find taxonomic lineages for the gis
perl ../code/gi2taxonomy.pl\
 --gis viridiplantae_all_2014.gis\
 --out viridiplantae_all_2014.tax\
 --species viridiplantae_all_2014.species.taxids\
 --genus viridiplantae_all_2014.genus.taxids

# This is only needed for comparison of the old reference db to the new one
grep "^>" ../data/viridiplantae_folds_2011.fasta |
 perl -pe 's/^>(\d+).*/$1/' >viridiplantae_folds_2011.gis
perl ../code/gi2taxonomy.pl\
 --gis viridiplantae_folds_2011.gis\
 --out viridiplantae_folds_2011.tax\
 --species viridiplantae_folds_2011.species.taxids\
 --genus viridiplantae_folds_2011.genus.taxids

This generates the following files:

and

All of those are also included in the precomputed folder.

ATTENTION If the gi2taxonomy.pl command throws the following error message:

20xx/xx/xx xx:xx:xx Unable to open taxonomic database at './t/data//gi_taxid.bin'
Unable to open taxonomic database at './t/data//gi_taxid.bin' at /xxx/xxx/NCBI-Taxonomy/lib//NCBI/Taxonomy.pm line 162

You have to download an NCBI Taxonomy dump by running:

<in NCBI::Taxonomy dir>: ./make_gi_taxid.pl --overwrite

And then adjust the $TAXDIR variable in NCBI-Taxonomy/lib/NCBI/Taxonomy.pm line 28.

UTAX and RDP training

The following commands executed in the analysis folder generate the required fasta and tax files for RDP and UTAX:

perl ../code/tax2rdp_utax.pl viridiplantae_all_2014.tax\
 ../data/viridiplantae_all_2014.fasta viridiplantae_all_2014

This generates the following files:

The first three are also included in the precomputed folder. And the last two are included in the training/utax folder. The required file format changed in the new versions of usearch. The compatible file is included in

viridiplantae_all_2014.sintax.fa

The utax files are ready to be used for classification. However to speed up the initial step a udb file can be created as follows (not needed for SINTAX):

usearch8.0 -makeudb_usearch viridiplantae_all_2014.utax.fa\
 -output viridiplantae_all_2014.utax.udb

This creates the file viridiplantae_all_2014.utax.udb which is not included as it is not required and its size is 225MB. To train the RDPclassifier execute the following commands (warning for the train command 16GB RAM did not suffice, but 32 did):

mkdir rdp_trained

java -jar classifier.jar rm-dupseq --infile viridiplantae_all_2014.rdp.fa\
 --outfile viridiplantae_all_2014.rdp.rm-dupseq.fa\
 --duplicates --min_seq_length 150

java -jar classifier.jar rm-partialseq viridiplantae_all_2014.rdp.fa\
 viridiplantae_all_2014.rdp.rm-dupseq.fa\
 viridiplantae_all_2014.rdp.rm-dupseq.rm-partialseq.fa\
 --alignment-mode overlap --min_gaps 50 --knn 20

java -Xmx32g -jar classifier.jar train --out_dir rdp_trained\
 --seq viridiplantae_all.rdp.rm-dupseq.rm-partialseq.fa\
 --tax_file viridiplantae_all.rdp.tax

cp data/its2.properties rdp_trained/its2.properties

This generates the following files:

All of those are also included in the precomputed folder. And the folder rdp_trained including five files:

rdp_trained/bergeyTrainingTree.xml
rdp_trained/genus_wordConditionalProbList.txt
rdp_trained/its2.properties
rdp_trained/wordConditionalProbIndexArr.txt
rdp_trained/logWordPrior.txt

Those are the files required for RDP classification and are included as rdp_trained.tar.gz in training/rdp

Now you have everything you need to classify sequences with either RDP classifier or UTAX/SINTAX.

SINTAX classification

Sintax is not yet included in the classify_reads.pl script. To classify your reads run this on preprocessed (paired-end merged, primer-trimmed and filtered) fasta files:

for f in <path_to_reads>/*.fasta
do
out=$(basename $f .fasta)
usearch10 -sintax $f \
          -db precomputed/viridiplantae_all_2014.sintax.fa \
          -tabbedout $out.sintax.txt \
          -strand both \
          -sintax_cutoff 0.90
done

Comparison of new database to old

Sequence increase

The number of sequences 2011 and 2014 can be calculated by using grep on header lines in the fasta files:

old=$(grep -c "^>" data/viridiplantae_folds_2011.fasta)
new=$(grep -c "^>" data/viridiplantae_all_2014.fasta)
increase=$(printf %.0f $(echo "100*$new/$old - 100" | bc -l))
echo "Sequences_2011: $old"
echo "Sequences_2014: $new"
echo "Increase: $increase%"

Sequences_2011:	73879
Sequences_2014:	182505
Increase:	147%

ATTENTION: You may notice the discrepancy between 73,879 and the 73,853 reported in the publication. The difference of 26 sequences is due to the fact that no taxonomy could be assigned to those 26 sequences at the time of training (of the first reference database). Those sequences have therefore been excluded.

Just to be sure:

printf %.0f%% $(echo "100*182505/73853 - 100" | bc -l)

147%

Species increase

The number of species can be calculating by counting the lines in *.specis.taxids which is a uniq list.

old=$(cat precomputed/viridiplantae_folds_2011.species.taxids | wc -l)
new=$(cat precomputed/viridiplantae_all_2014.species.taxids | wc -l)
increase=$(printf %d $(echo "100*$new/$old - 100" | bc -l))
echo "Species_2011: $old"
echo "Species_2014: $new"
echo "Increase: $increase%"

Species_2011:	37403
Species_2014:	72325
Increase:	93%

Bavaria/USA coverage

Retrieval of checklists

To assess the completeness of species and genera in the reference database in respect to known plant species in Bavaria and the USA lists of taxa were obtained from bayernflora.de (<2015-01-30 Fr>) and BISON (<2015-02-13 Fr>). In the analysis folder execute the following commands:

mkdir flora_bavaria flora_usa
cd flora_bavaria
../../code/get_taxa_bayern.sh
cd ../flora_usa
../../code/get_taxa_bison.sh

This generates the following files in analysis/flora_bavaria

And for each state of the USA the following files in analysis/flora_usa

<fips>.checklist
<fips>.genus
<fips>.genus.taxids
<fips>.genus.tsv
<fips>.species
<fips>.species.taxids
<fips>.species.tsv

The results may vary depending on the date of data retrieval, therefore those files are included in the precomputed folder.

Comparisons of checklists to reference database

Bavaria

Attention: There was an issue with the precomputed checklists for bavaria, see: #5. After fixing those and using the most recent version of the bayern-flora checklist (July 2018) you end up with the following numbers:

SPECIES_BAVARIA=$(cat flora_bavaria/bayern.species.cleaned.taxids | wc -l)
COMMON_OLD=$(cat viridiplantae_folds_2011.species.taxids flora_bavaria/bayern.species.cleaned.taxids | sort | uniq -d | wc -l)
COMMON_NEW=$(cat viridiplantae_all_2014.species.taxids flora_bavaria/bayern.species.cleaned.taxids | sort | uniq -d | wc -l)
echo Bavaria Species 2014 $(printf %.1f $(echo "100 * $COMMON_NEW/$SPECIES_BAVARIA" | bc -l))%
echo Bavaria Species 2011 $(printf %.1f $(echo "100 * $COMMON_OLD/$SPECIES_BAVARIA" | bc -l))%
GENERA_BAVARIA=$(cat flora_bavaria/bayern.genus.taxids | wc -l)
COMMON_OLD=$(cat viridiplantae_folds_2011.genus.taxids flora_bavaria/bayern.genus.taxids | sort | uniq -d | wc -l)
COMMON_NEW=$(cat viridiplantae_all_2014.genus.taxids flora_bavaria/bayern.genus.taxids | sort | uniq -d | wc -l)
echo Bavaria Genus 2014 $(printf %.1f $(echo "100 * $COMMON_NEW/$GENERA_BAVARIA" | bc -l))%
echo Bavaria Genus 2011 $(printf %.1f $(echo "100 * $COMMON_OLD/$GENERA_BAVARIA" | bc -l))%

Bavaria	Species	2014	73.6%
Bavaria	Species	2011	48.7%
Bavaria	Genus	2014	86.6%
Bavaria	Genus	2011	71.3%

USA

To get a list of species and genus coverage for each state execute the following in the analysis folder:

(echo -e "Fips\tSpecState\tSpec2011\tSpec2014\tGenusState\tGenus2011\tGenus2014"
for i in $(seq 1 56)
do  
    # Excludes 3, 7, 14, 43 and 52.
    if [ "$i" -eq 3 ] || [ "$i" -eq 7 ] || [ "$i" -eq 14 ] || [ "$i" -eq 43 ] || [ "$i" -eq 52 ]
    then
        continue      # Those fips are not used
    fi
    i=$(printf "%02d" $i)
    STATE_SPEC=$(cat flora_usa/$i.species.taxids | wc -l)
    STATE_GENUS=$(cat flora_usa/$i.genus.taxids | wc -l)
    COMMON_SPEC_2011=$(cat viridiplantae_folds_2011.species.taxids flora_usa/$i.species.taxids | sort | uniq -d | wc -l)
    COMMON_GENUS_2011=$(cat viridiplantae_folds_2011.genus.taxids flora_usa/$i.genus.taxids | sort | uniq -d | wc -l)
    COMMON_SPEC_2014=$(cat viridiplantae_all_2014.species.taxids flora_usa/$i.species.taxids | sort | uniq -d | wc -l)
    COMMON_GENUS_2014=$(cat viridiplantae_all_2014.genus.taxids flora_usa/$i.genus.taxids | sort | uniq -d | wc -l)
    echo -e "$i\t$STATE_SPEC\t$COMMON_SPEC_2011\t$COMMON_SPEC_2014\t$STATE_GENUS\t$COMMON_GENUS_2011\t$COMMON_GENUS_2014"
done) >flora_usa/states.common.tsv

This creates the file

flora_usa/states.common.tsv

which is also included in the precomputed/flora_usa folder.

This file is further analysed with R:

data=read.table("states.common.tsv", header=T, sep="\t")
print(summary(data$Spec2014/data$SpecState))
print(summary(data$Genus2014/data$GenusState))

	Min.	1st Qu.	Median	Mean	3rd Qu.	Max.
Species coverage	0.665	0.750	0.761	0.756	0.766	0.791
Genera coverage	0.738	0.832	0.849	0.840	0.858	0.873

Number of genera per order (Supplement)

All orders

The number of genera per order in the old reference database and the new one were calculated with the following commands:

cat viridiplantae_folds_2011.tax | grep "Viridiplantae" | perl -pe 's/.*(o__[^;]+);.*(g__[^;]+);.*/$1\t$2/' | sort -u | grep -v undef | datamash -g 1 count 2 >2011_genera_per_order
cat viridiplantae_all_2014.tax | grep "Viridiplantae" | perl -pe 's/.*(o__[^;]+);.*(g__[^;]+);.*/$1\t$2/' | sort -u | grep -v undef | datamash -g 1 count 2 >2014_genera_per_order
echo -e "Order\ttaxid\told\tnew" >increase_genera_per_order.tsv
join -t$'\t' -a1 2014_genera_per_order 2011_genera_per_order | perl -pe 's/^([^\s]+\t\d+)$/$1\t0/' | perl -F"\t" -ane 'chomp $F[2];print "$F[0]\t$F[2]\t$F[1]\n"' | sed 's/o__//;s/_/\t/' >>increase_genera_per_order.tsv
join -t$'\t' -v2 2014_genera_per_order 2011_genera_per_order | perl -pe 's/\n/\t0\n/;s/o__//;s/_/\t/' >>increase_genera_per_order.tsv

The created files:

are included in the precomputed folder.

Creation of the latex table

cat <<EOF >additional_file2.tex
\documentclass{article}
\usepackage{tabu}
\usepackage{longtable}
\newcolumntype{R}{>{\raggedleft\arraybackslash}X}
\usepackage{booktabs}
\renewcommand{\thetable}{S\arabic{table}}%

\begin{document}

\begin{longtabu}{lXRR}
\caption{Comparison of the number of genera per order for all orders.}\\\\
\toprule
Order & TaxID & Genera old & Genera new \\\\
\midrule
\endhead
EOF

join -t$'\t' -a1 2014_genera_per_order 2011_genera_per_order | perl -pe 's/^([^\s]+\t\d+)$/$1\t0/' | perl -F"\t" -ane 'chomp $F[2];print "$F[0]\t$F[2]\t$F[1]\n"' | sed 's/o__//;s/_/\t/;' | perl -pe 's/\t/ & /g;s/\n/\\\\\n/' >>additional_file2.tex
join -t$'\t' -v2 2014_genera_per_order 2011_genera_per_order | perl -pe 's/\n/\t0\n/;s/o__//;s/_/\t/' | perl -pe 's/\t/ & /g;s/\n/\\\\\n/' >>additional_file2.tex

cat <<EOF >>additional_file2.tex
\bottomrule
\end{longtabu}
\end{document}
EOF

pdflatex additional_file2.tex
pdflatex additional_file2.tex

The created tex file is included in the precomputed folder

additional_file2.tex

Sequences for selected groups

The increase of sequences for a number of selected groups can simply be determined by:

cat <<EOF >additional_file3.tex
\documentclass{article}
\usepackage{tabu}
\usepackage{longtable}
\newcolumntype{R}{>{\raggedleft\arraybackslash}X}
\usepackage{booktabs}
\setcounter{table}{1}
\renewcommand{\thetable}{S\arabic{table}}%

\begin{document}

\begin{longtabu}{XRR}
\caption{Comparison of the number of sequences per group for selected taxonomic groups.}\\\\
\toprule
Group & old & new \\\\
\midrule
\endhead
EOF

echo "Vitaceae & "$(grep -c Vitaceae viridiplantae_folds_2011.tax)" & "$(grep -c Vitaceae viridiplantae_all_2014.tax) '\\\\' >>additional_file3.tex
echo '\\'"textit{Heracleum} & "$(grep -c Heracleum viridiplantae_folds_2011.tax)" & "$(grep -c Heracleum viridiplantae_all_2014.tax) '\\\\' >>additional_file3.tex
echo '\\'"textit{Carduus} & "$(grep -c Carduus viridiplantae_folds_2011.tax)" & "$(grep -c Carduus viridiplantae_all_2014.tax) '\\\\' >>additional_file3.tex
echo '\\'"textit{Phacelia} & "$(grep -c Phacelia viridiplantae_folds_2011.tax)" & "$(grep -c Phacelia viridiplantae_all_2014.tax) '\\\\' >>additional_file3.tex
echo '\\'"textit{Convolvulus} & "$(grep -c Convolvulus viridiplantae_folds_2011.tax)" & "$(grep -c Convolvulus viridiplantae_all_2014.tax) '\\\\' >>additional_file3.tex
echo '\\'"textit{Helianthus} & "$(grep -c Helianthus viridiplantae_folds_2011.tax)" & "$(grep -c Helianthus viridiplantae_all_2014.tax) '\\\\' >>additional_file3.tex

cat <<EOF >>additional_file3.tex
\bottomrule
\end{longtabu}
\end{document}
EOF

pdflatex additional_file3.tex
pdflatex additional_file3.tex

The created tex file is included in the precomputed folder

additional_file3.tex

Analysis of Pollen Samples

Data

Create a folder called raw and download data from EBI SRA repository project accession number PRJEB8640. Extract into separate .fastq files (two for each sample). I assume your directory contains all the samples in the following form: <SampleName>_S<SampleNr>_L001_R<1|2>_001.fastq e.g. PoJ1_S1_L001_R1_001.fastq Where R1 is the file containing forward reads and R2 the file containing reverse reads for each sample. This can be accomplished by loading the list of archives from EBI:

wget http://www.ebi.ac.uk/ena/data/warehouse/filereport\?accession\=PRJEB8640\&result\=read_run\&fields\=study_accession,secondary_study_accession,sample_accession,secondary_sample_accession,experiment_accession,run_accession,tax_id,scientific_name,instrument_model,library_layout,fastq_ftp,fastq_galaxy,submitted_ftp,submitted_galaxy\&download\=txt -O reads.tsv

This downloads the reads.tsv file which is also included in the precomputed folder. It contains a list of the 384 sequencing libraries of this project. This file can now be used to download all the reads with the following command executed in the raw folder:

for i in $(cut -f13 reads.tsv | grep fastq.gz | perl -pe 's/;/\n/')
do
    wget $i
done
gunzip *.gz
# Fix typo - lowercase j in some samples:
rename 's/Poj/PoJ/' *.fastq

Now your folder should contain 768 .fastq files in the format described above.

Preprocessing

joining

In the raw folder create a subfolder joined and execute the following commands

qiime
for i in ../*_R1_001.fastq
do
    BASE=$(basename $i _R1_001.fastq)
    join_paired_ends.py -f $i -r ../${BASE}_R2_001.fastq -o $BASE
done

This creates a folder for each sample in the form <SampleName>_S<SampleNr>_L001 containing three files:

fastqjoin.join.fastq
fastqjoin.un1.fastq
fastqjoin.un2.fastq

Q20 filtering

In the raw folder create a subfolder filtered and execute the following commands

for i in ../joined/*
do
    BASE=$(basename $i)
    usearch8.0 -fastq_filter $i/fastqjoin.join.fastq\
     -fastq_truncqual 19 -fastq_minlen 150 -fastqout $BASE.q20.fq
done

Now you have one .fq file for each sample in the following form <SampleName>_S<SampleNr>_L001.q20.fq with joined and quality filtered reads.

Classification

UTAX

In the raw folder create a subfolder utax and execute the following commands: You can use viridiplantae_all_2014.utax.udb instead of viridiplantae_all_2014.utax.fa if you generated the udb file in the previous steps.

for i in $(find ../filtered -name "*.fq")
do   
    BASE=$(basename $i .fq)
    usearch8.0 -utax $i -db ../../training/utax/viridiplantae_all_2014.utax.udb\
     -utax_rawscore -tt ../../training/utax/viridiplantae_all.utax.tax\
     -utaxout $BASE.utax
done

This way you end up with a .utax file for each sample containing the utax classification. Create a subfolder called counts and there execute this:

for i in ../*.utax
do
    BASE=$(basename $i .utax)
    perl ../../../code/count_taxa_utax.pl --in $i --cutoff 20 >$BASE.count
done

Now you have a list of counts per taxon for each sample. To aggregate the counts of all samples into a common matrix and to create files for phyloseq use the following commands:

perl ../../../code/aggregate_counts.pl *.count >utax_aggregated_counts.tsv
perl -i -pe 's/(PoJ\d+)_S\d+_L001\.q20\.count/$1/g' utax_aggregated_counts.tsv
perl -pe 's/^([^\t]+)_(\d+)\t/TID_$2\t/' utax_aggregated_counts.tsv >utax_otu_table
perl -ne 'if(/^([^\t]+)_(\d+)\t/){print "TID_$2\t"; $tax=$1; $tax=~s/_\d+,/\t/g; $tax=~s/__sub__/__/g; $tax=~s/__super__/__/g; print "$tax\n"; }' utax_aggregated_counts.tsv >utax_tax_table

The files

are included in the precomputed folder

RDP classifier

In the raw folder create a subfolder rdp and execute the following commands:

for i in $(find ../filtered -name "*.fq")
do
    BASE=$(basename $i .fq)
    java -jar classifier.jar classify\
     --train_propfile ../../training/rdp/rdp_trained/its2.properties\
     --outputFile $BASE.rdp $i
done

This way you end up with a .rdp file for each sample containing the RDP classification. Create a subfolder called counts and there execute this:

for i in ../*.rdp
do
    BASE=$(basename $i .rdp)
    perl ../../../code/count_taxa_rdp.pl --in $i --cutoff 0.85 >$BASE.count
done

Now you have a list of counts per taxon for each sample. To aggregate the counts of all samples into a common matrix and to create files for phyloseq use the following commands:

perl ../../../code/aggregate_counts.pl *.count >rdp_aggregated_counts.tsv
perl -i -pe 's/(PoJ\d+)_S\d+_L001\.q20\.count/$1/g' rdp_aggregated_counts.tsv
perl -pe 's/^([^\t]+)_(\d+)\t/TID_$2\t/' rdp_aggregated_counts.tsv >rdp_otu_table
perl -ne 'if(/^([^\t]+)_(\d+)\t/){print "TID_$2\t"; $tax=$1; $tax=~s/_\d+,/\t/g; $tax=~s/__sub__/__/g; $tax=~s/__super__/__/g; print "$tax\n"; }' rdp_aggregated_counts.tsv >rdp_tax_table

The files

are included in the precomputed folder

Read counts

The reads are directly counted on the fastq files with the following commands in the analysis folder:

grep -c "^+$" ../raw/*_R1_001.fastq | sed 's/..\/raw\///;s/_R1_001.fastq:/\t/' >read_count_raw.tsv
echo -e "Sum\tMean\tSD\tMedian"
cat read_count_raw.tsv | datamash sum 2 mean 2 sstdev 2 median 2

Sum	Mean	SD	Median
11624087	30271	11373	30900

The created file:

read_count_raw.tsv

is also included in the precomputed folder.

To get the counts for filtered reads (rare taxa removed) use this R code (in the analysis folder):

library(phyloseq)
data = read.table("utax_otu_table", sep="\t", header=T, row.names=1)
otu = otu_table(data, taxa_are_rows=T)
otu_rel = transform_sample_counts(otu, function(x) x/sum(x))
otu_table(otu)[otu_table(otu_rel)<0.001]<-0
summary(colSums(otu))
#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#      7   11000   15740   15580   19650   36940 
sum(colSums(otu))
# [1] 5984543
sd(colSums(otu))
# [1] 6597.562

Sum	Mean	SD	Median
5984543	15580	6598	15740

Species accumulation curves

The following R code can be used to create the species accumulation curves:

library(vegan)
library(phyloseq)
data = read.table("utax_otu_table", sep="\t", header=T, row.names=1)
map = import_qiime_sample_data("../data/mapFile.txt")
otu = otu_table(data, taxa_are_rows=T)
otu_rel = transform_sample_counts(otu, function(x) x/sum(x))
otu_table(otu)[otu_table(otu_rel)<0.001]<-0
phy = merge_phyloseq(otu, map)

trunc = subset_samples(phy, BeeSpecies == "H.truncorum")
rufa = subset_samples(phy, BeeSpecies == "O.rufa")

veganotu <- function(physeq) {
    require("vegan")
    OTU <- otu_table(physeq)
    if (taxa_are_rows(OTU)) {
        OTU <- t(OTU)
    }
    return(as(OTU, "matrix"))
}

trunc.v = veganotu(trunc)
rufa.v = veganotu(rufa)

pdf("Figure2.pdf")
par(mfrow =c(1,2))
par(mar=c(3,3,1,1)+0.1, pin= c(2.73, 2.73))
rarecurve(rufa.v, step = 1, xlab = "", ylab = "",label = FALSE, xlim =c(-0.2, 5000), ylim = c(-0.2, 90), lwd = 0.5)
title(ylab = "No. Taxa", line= 2)
title(xlab = "Sequencing Depth [reads]", line = 2)
text(x = 100, y = 87, "a", cex = 2)

rarecurve(trunc.v, step = 1, xlab = "", ylab = "", label = FALSE, xlim = c(-0.2, 5000), ylim = c(-0.2, 90), lwd = 0.5)
title(ylab = "No. Taxa", line= 2)
title(xlab = "Sequencing Depth [reads]", line = 2)
text(x=100, y = 87, "b", cex = 2)
dev.off()

Comparison of utax and RDP

This code executed in the analysis folder compares the assignment of RDP and UTAX on the genus level (ignoring confidence values):

pv rdp/*.rdp | cut -f1,21 | grep -v undef | perl -pe 's/g__//;s/_/\t/' | sort -k 1b,1 >PoJ.genus.rdp
pv utax/*.utax | perl -pe 's/,/\t/g' | cut -f1,7 | grep -v undef | perl -pe 's/g__//;s/\(.*\)//;s/_/\t/' | sort -k 1b,1 >PoJ.genus.utax

printf %.1f%% $(echo "100 * " $(join PoJ.genus.rdp PoJ.genus.utax | cut -f2,4 -d" " | perl -F"\s" -ane '$g++;chomp $F[1];$c++ if($F[0] eq $F[1]);END{$e=$c/$g;print "$c / ( $g + "}') $(join PoJ.genus.rdp PoJ.genus.utax -v1 | wc -l) " + " $(join PoJ.genus.rdp PoJ.genus.utax -v2 | wc -l) ")" | bc -l)

90.3%

The two files:

are also included in the precomputed folder (as gzipped archives).

Comparison to flowering data

The file data/genera_flowering contains a list of genera found near the sampling plots. The following R code calculates the fraction of reads in all samples (with rare taxa removed) that belong to genera listed in the genera_flowering file:

library(phyloseq)
otu = otu_table(read.table("utax_otu_table", sep="\t", header=T, row.names=1), taxa_are_rows=T)
otu_rel = transform_sample_counts(otu, function(x) x/sum(x))
# remove rare taxa from each sample in otu
otu_table(otu)[otu_table(otu_rel)<0.001]<-0
tax = tax_table(as.matrix(read.table("utax_tax_table", sep="\t", fill=T, row.names=1)))
otu = merge_phyloseq(otu, tax)
# remove taxa that have only 0 counts after rare filtering and restriction to PoJ
otu_pruned = prune_taxa(rowSums(otu_table(otu))>0, otu)
# accumulate at genus level (ignoring species names)
otu_pruned_glom = tax_glom(otu_pruned, taxrank="V7")
write.table(as.factor(tax_table(otu_pruned_glom)[,6]),file="utax_genera_pruned_glom",quote=F,row.names=F,col.names=F)
flowering = c(read.table("../data/genera_flowering", stringsAsFactors=F))
# Remove undefined genera from the total set
otu_pruned_glom_noundef = subset_taxa(otu_pruned_glom, V7 != "g__undef_")
flowering_genera = tax_table(otu_pruned_glom_noundef)[,6] %in% paste("g__",flowering$V1, sep="")
100 * sum(otu_table(otu_pruned_glom_noundef)[flowering_genera,]) / sum(otu_table(otu_pruned_glom_noundef))

73.7%

As a side product the file:

utax_genera_pruned_glom

was created (included in the precomputed folder).

To determine the fraction of documented flowering genera also found in at least one of the samples the following code can be executed (analysis folder):

TOTAL=$(cat ../data/genera_flowering | wc -l)
COMMON=$(cat ../data/genera_flowering <(perl -pe 's/^g__//' utax_genera_pruned_glom | grep -v undef_ | sort -u) | sort | uniq -d | wc -l)
echo $COMMON" / "$TOTAL" = "$(printf %.1f $(echo "100 * "$COMMON" / "$TOTAL | bc -l))"%"

98 / 201 = 48.8%

Name		Name	Last commit message	Last commit date
Latest commit History 134 Commits
analyses		analyses
code		code
data		data
precomputed		precomputed
training		training
.gitignore		.gitignore
LICENSE		LICENSE
README.org		README.org

License

molbiodiv/meta-barcoding-dual-indexing

Folders and files

Latest commit

History

Repository files navigation

Meta-Barcoding Dual-Indexing

About

Dependencies

Hardware

Third party tools

Quick Start

Workflow

Preparation of Reference Database

Data

Taxonomic assignment

UTAX and RDP training

SINTAX classification

Comparison of new database to old

Sequence increase

Species increase

Bavaria/USA coverage

Retrieval of checklists

Comparisons of checklists to reference database

Bavaria

USA

Number of genera per order (Supplement)

All orders

Sequences for selected groups

Analysis of Pollen Samples

Data

Preprocessing

joining

Q20 filtering

Classification

UTAX

RDP classifier

Read counts

Species accumulation curves

Comparison of utax and RDP

Comparison to flowering data

About

Topics

Resources

License

Stars

Watchers

Forks

Languages