Skip to content

Latest commit

 

History

History
359 lines (347 loc) · 17 KB

reproduce_results.org

File metadata and controls

359 lines (347 loc) · 17 KB

Introduction

Here we describe all steps required to reproduce the results from our publication in as much detail as possible. It should be possible to follow along this recipe, execute all the commands and arrive at the same results. However if you have difficulties reproducing the results or find any clitches in the description don’t hesitate to open an issue.

Environment

The original analyses have been performed on a machine with the following specs (except where otherwise noted):

  • Intel(R) Core(TM)2 Duo CPU E8500 @3.16GHz
  • 4GB RAM
  • Ubuntu 14.04.3 LTS 64bit

Versions

bcgTree

  • Version 1.0.4 CLI https://zenodo.org/badge/doi/10.5281/zenodo.47769.svg

external Programs

  • hmmsearch: HMMER version 3.1b1
  • muscle: v3.8.31
  • Gblocks: version 0.91b
  • RAxML: version 8.2.4 (raxmlHPC-PTHREADS-SSE3)

Data

Lactobacillales ezbiocloud

To download all Lactobacillales from ezbiocloud execute the following command in reproduce/data:

mkdir -p ezbiocloud_lactobacillales
# Search for Lactobacillales, extract project IDs, download zipped peptide fastas of the projects
curl "http://www.ezbiocloud.net/search?k=ezgenome&v=Lactobacillales" |
 grep case2 | cut -f5 -d"=" | cut -f1 -d \" | sort -u | tail -n+2 |
 xargs -n1 -I{} wget -O ezbiocloud_lactobacillales/{}.zip "http://www.ezbiocloud.net/mod_download_fasta_ezgenome.jsp?acc="{}"&mode=CDS_AA"
cd ezbiocloud_lactobacillales
for i in *.zip; do unzip $i; done
rm *.zip

# move files with errors into error subdir
mkdir -p error
for i in $(grep --color=none -lr error); do mv $i error; done

# replace offending characters in file names
rename "s/[,:\s)(;'\]\[=]/_/g" *

# create options file for bcgTree
for i in $(\ls *.fasta)
do
   # Filter empty sequences
   cat $i | SeqFilter --min-len 1 --out $i
   BASE=$(basename $i _1.0_CDS_AA.fasta)
   echo --proteome $BASE=$PWD/$i
done >ezbiocloud_lactobacillales.options

As this returns all the Lactobacillales genomes in ezbiocloud at the time you execute it the results will vary over time. In order to reproduce the tree from the publication you can use the ezbiocloud_lactobacillales.options file in reproduce/precomputed. But be aware, that this file contains the paths relative to that directory so either make them absolute or call bcgTree from there if you want to use the precomputed file.

Bacteria NCBI

Proteomes

To download the faa files from all bacteria in genomes via ftp do the following in reproduce/data:

mkdir -p ncbi
cd ncbi
# The download link has been invalidated in the meantime, it was originally:
# wget ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/all.faa.tar.gz
# now it is:
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Bacteria/all.faa.tar.gz
tar xvzf all.faa.tar.gz
rm all.faa.tar.gz
# Combine the peptide files of each genome
for i in *
do
    cat $i/*.faa > $i.fa
    rm -rf $i
done
rename "s/[,:\s)(;'\]\[=]/_/g" *
# Create the options file for bcgTree
for i in *.fa
do
     echo "--proteome "$(basename $i .fa)"="$PWD"/"$i
done >ncbi_all.options
cd ..

You can also use the precomputed options file in the reproduce/precomputed folder. But be aware, that this file contains the paths relative to that directory so either make them absolute or call bcgTree from there if you want to use the precomputed file.

16S rRNA

To download the 16S rRNA sequences from NCBI execute the following commands in reproduce/data:

mkdir -p ncbi_16S
cd ncbi_16S
# The download link has been invalidated in the meantime, it was originally:
# wget ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/all.frn.tar.gz
# now it is:
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Bacteria/all.frn.tar.gz
tar xvzf all.frn.tar.gz
rm all.frn.tar.gz
for i in *
do
    cat $i/*.frn | grep 16S | head -n1 | perl -pe 's/^>//;s/ .*//' >$i.ids
    cat $i/*.frn | ../../../SeqFilter/bin/SeqFilter - --ids $i.ids --out $i.16S.fa
    rm $i.ids
    rm -rf $i
done
rename "s/[,:\s)(;'\]\[=]/_/g" *
cd ..

This way there is exactly one 16S sequence for most of the genomes. However that of Lactobacillus_brevis_KB290_uid195560 is missing due to a lack of annotation. To get this the whole genome is downloaded and rnammer used for extraction. You need rnammer version 1.2 for that (be aware that rnammer relies on hmmsearch version 2).

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Bacteria/Lactobacillus_brevis_KB290_uid195560/NC_020819.fna -O Lactobacillus_brevis_KB290_uid195560.fna
rnammer -S bac -m lsu,ssu,tsu -f Lactobacillus_brevis_KB290_uid195560.rrna.fa Lactobacillus_brevis_KB290_uid195560.fna
grep 16s Lactobacillus_brevis_KB290_uid195560.rrna.fa | head -n1 | perl -pe 's/^>//;s/ .*//' >Lactobacillus_brevis_KB290_uid195560.16S.id
../../../SeqFilter Lactobacillus_brevis_KB290_uid195560.rrna.fa --ids Lactobacillus_brevis_KB290_uid195560.16S.id --out Lactobacillus_brevis_KB290_uid195560.16S.fa
perl -i -pe 's/^>/>Lactobacillus_brevis_KB290_uid195560 /' Lactobacillus_brevis_KB290_uid195560.16S.fa
rm Lactobacillus_brevis_KB290_uid195560.fna Lactobacillus_brevis_KB290_uid195560.rrna.fa Lactobacillus_brevis_KB290_uid195560.16S.id

If you have difficulties here I also provide the precomputed 16S rRNA file for Lactobacillus_brevis_KB290_uid195560. Preliminary analyses also showed that the sequence of Lactobacillus_casei_LC2W_uid162121 was not really 16S due to misannotation. So to get the real 16S sequence execute the following commands:

wget ftp://ftp.ncbi.nih.gov/genomes/archive/old_refseq/Bacteria/Lactobacillus_casei_LC2W_uid162121/NC_017473.frn
echo "ref|NC_017473|:258849-260429|16S" | SeqFilter --ids - NC_017473.frn --out Lactobacillus_casei_LC2W_uid162121.16S.fa
rm NC_017473.frn

INPG Code

Case study: Lactobacillus phylogeny

The full list of Lactobacillales genomes from ezbiocloud contains 2225 genomes. Of those 622 from the genus Enterococcus and 1188 from the genus Streptococcus. To reduce the size of the tree all genomes from ezbiocloud except those of Enterococcus and Streptococcus have been used and only 50 random genomes of each of those genera have been added: The commands to get this set were (in reproduce/data/ezbiocloud_lactobacillales)

grep -vP "Enterococcus|Streptococcus" ezbiocloud_lactobacillales.options >ezbiocloud_lactobacillales_reduced.options
grep "Enterococcus" ezbiocloud_lactobacillales.options | shuf | head -n50 >>ezbiocloud_lactobacillales_reduced.options
grep "Streptococcus" ezbiocloud_lactobacillales.options | shuf | head -n50 >>ezbiocloud_lactobacillales_reduced.options

This results in a list of 515 Lactobacillales genomes. As shuf returns a random list you will need to use the precomputed ezbiocloud_lactobacillales_reduced.options file in order to exactly reproduce the published tree. ATTENTION: This tree has been calculated on a 80 Core Interl Xeon CPU with 512GB of RAM running Ubuntu 12.04 LTS 64 bit. To start the calculation simply run in the reproduce analysis folder:

../../bin/bcgTree.pl @../data/ezbiocloud_lactobacillales/ezbiocloud_lactobacillales_reduced.options --outdir case_study_lactobacillus

The final tree has been refined with figtree (collapsing of nodes on Genus <G> or Species <S> level). Sorry, no idea how to automate this.

Comparison to 16S topology

Evaluation set 1 (Lactobacillus NCBI)

bcgTree

To calculate the Lactobacillus tree (bcgTree) run the following in reproduce/analysis

mkdir -p evaluation1/bcgTree
cd evaluation1/bcgTree
grep Lactobacillus ../../../data/ncbi/ncbi_all.options >lactobacillus_paenibacillus.options
grep Paenibacillus ../../../data/ncbi/ncbi_all.options >>lactobacillus_paenibacillus.options
perl ../../../../bin/bcgTree.pl @lactobacillus_paenibacillus.options --out lactobacillus_paenibacillus --bootstraps 100
python ../../../../bin/plot_matrix.py lactobacillus_paenibacillus/absence_presence.csv lactobacillus_paenibacillus_ap.svg
cd ../..

The final tree is in RAxML_bipartitions_final

16S

To calculate the corresponding 16S tree:

mkdir -p evaluation1/16S
cd evaluation1/16S
rm -f lactobacillus_paenibacillus.16S.fa
for i in $(cat ../bcgTree/lactobacillus_paenibacillus.options | grep proteome | rev | cut -f1 -d"/" | rev | perl -pe 's/\.fa$/.16S.fa/')
do
    BASE=$(basename $i .16S.fa)
    cat ../../../data/ncbi_16S/$i | perl -pe 's/^>/>'$BASE' /' >>lactobacillus_paenibacillus.16S.fa
done
muscle -in lactobacillus_paenibacillus.16S.fa -out lactobacillus_paenibacillus.16S.aln
perl -i -pe 's/ .*//' lactobacillus_paenibacillus.16S.aln
Gblocks lactobacillus_paenibacillus.16S.aln -t=p -b4=4 -b5=h
perl -i -pe 's/ //g;s/:/_/g' lactobacillus_paenibacillus.16S.aln-gb
raxmlHPC -f a -m GTRGAMMA -p 12345 -s lactobacillus_paenibacillus.16S.aln-gb -n lactobacillus_16S -T 4 -x 12345 -N 100
cd ../..

The final tree is in RAxML_bipartitions.lactobacillus_16S

Evaluation set 2 (High Level Taxonomy)

bcgTree

To calculate the high-level tree (bcgTree) run the following in reproduce/analysis

mkdir -p evaluation2/bcgTree
cd evaluation2/bcgTree
rm -f high_level.options
for i in Sulfolobus Thermococcus Bifidobacterium Rhodococcus Bacteroides Flavobacterium Chlamydia Parachlamydia Lactobacillus Paenibacillus Rhizobium Wolbachia Burkholderia Ralstonia Desulfovibrio Geobacter Campylobacter Helicobacter Erwinia Pseudomonas Borrelia Sphaerochaeta Kosmotoga Thermotoga
do
    grep $i ../../../data/ncbi/ncbi_all.options | head -n1 >>high_level.options
done
perl ../../../../bin/bcgTree.pl @high_level.options --out high_level --bootstraps 100
cd ../..

16S

To calculate the corresponding 16S tree:

mkdir -p evaluation2/16S
cd evaluation2/16S
rm -f high_level.16S.fa
for i in $(cat ../bcgTree/high_level.options | grep proteome | rev | cut -f1 -d"/" | rev | perl -pe 's/\.fa$/.16S.fa/')
do
    BASE=$(basename $i .16S.fa)
    cat ../../../data/ncbi_16S/$i | perl -pe 's/^>/>'$BASE' /' >>high_level.16S.fa
done
muscle -in high_level.16S.fa -out high_level.16S.aln
perl -i -pe 's/ .*//' high_level.16S.aln
Gblocks high_level.16S.aln -t=p -b4=4 -b5=h
perl -i -pe 's/ //g;s/:/_/g' high_level.16S.aln-gb
raxmlHPC -f a -m GTRGAMMA -p 12345 -s high_level.16S.aln-gb -n high_level_16S -T 4 -x 12345 -N 100
cd ../..

The final tree is in RAxML_bipartitions.high_level_16S

Bootstrap distribution

In order to get the distribution of bootstrap values for each tree you can execute:

perl -pe 's/\)/\n/g' evaluation1/16S/RAxML_bipartitions.lactobacillus_16S | cut -f1 -s -d":" | tail -n+2 | perl -pe 's/^/16S\tevaluation1\t/' >bootstraps
perl -pe 's/\)/\n/g' evaluation2/16S/RAxML_bipartitions.high_level_16S | cut -f1 -s -d":" | tail -n+2 | perl -pe 's/^/16S\tevaluation2\t/' >>bootstraps
perl -pe 's/\)/\n/g' evaluation1/bcgTree/lactobacillus_paenibacillus/RAxML_bipartitions.final | cut -f1 -s -d":" | tail -n+2 | perl -pe 's/^/bcgTree\tevaluation1\t/' >>bootstraps
perl -pe 's/\)/\n/g' evaluation2/bcgTree/high_level/RAxML_bipartitions.final | cut -f1 -s -d":" | tail -n+2 | perl -pe 's/^/bcgTree\tevaluation2\t/' >>bootstraps

Multi-marker benefits

Calculate trees with random subselections of the partitions on high-level and low-level use cases. In reproduce/analysis execute the following commands:

mkdir -p multimarker/evaluation1
cd multimarker/evaluation1
cp ../../evaluation1/bcgTree/lactobacillus_paenibacillus/full_alignment.concat.* .
for j in {0..9}
do
    for i in {1..108}
    do
        NEWDIR=$(printf "%02d" $j)_$(printf "%03d" $((109-i)))
        mkdir $NEWDIR
        cp full_alignment.concat.fa full_alignment.concat.partition $NEWDIR
        cd $NEWDIR
        cut -f4 -d" " full_alignment.concat.partition | shuf | head -n $i >exclude_file
        raxmlHPC -E exclude_file -s full_alignment.concat.fa -q full_alignment.concat.partition -n EXCLUDE -m GTRGAMMA
        raxmlHPC -f a -m GTRGAMMA -p 12345 -q full_alignment.concat.partition.exclude_file -s full_alignment.concat.fa.exclude_file -n partsub_$i -T 6 -x 12345 -N 1
        cd -
    done
done

# Create qdist table
echo -e "batch\tgene_count\tqdist\tqdist_rel" >lactobacillus_paenibacillus.qdist_by_genenumber.tsv
for j in {0..9}
do
   for i in {1..108}
   do
      echo -n $j$'\t'$i$'\t'
      FILE="$(printf "%02d" $j)_$(printf "%03d" $i)/RAxML_bestTree.partsub_"
      if [ -f $FILE ]
      then
         qdist ../../evaluation1/bcgTree/lactobacillus_paenibacillus/RAxML_bestTree.final $FILE | tail -n1 | cut -f7,8
      else
         echo -e "NA\tNA"
      fi
   done
done >>lactobacillus_paenibacillus.qdist_by_genenumber.tsv
echo -ne "16S\t1\t" >>lactobacillus_paenibacillus.qdist_by_genenumber.tsv
qdist ../../evaluation1/bcgTree/lactobacillus_paenibacillus/RAxML_bestTree.final ../../evaluation1/16S/RAxML_bestTree.lactobacillus_16S | tail -n1 | cut -f7,8 >>lactobacillus_paenibacillus.qdist_by_genenumber.tsv

cd ../..

Repeat this step for the evaluation2 set:

mkdir -p multimarker/evaluation2
cd multimarker/evaluation2
cp ../../evaluation2/bcgTree/high_level/full_alignment.concat.* .
for j in {0..9}
do
    for i in {1..108}
    do
        NEWDIR=$(printf "%02d" $j)_$(printf "%03d" $((109-i)))
        mkdir $NEWDIR
        cp full_alignment.concat.fa full_alignment.concat.partition $NEWDIR
        cd $NEWDIR
        cut -f4 -d" " full_alignment.concat.partition | shuf | head -n $i >exclude_file
        raxmlHPC -E exclude_file -s full_alignment.concat.fa -q full_alignment.concat.partition -n EXCLUDE -m GTRGAMMA
        raxmlHPC -f a -m GTRGAMMA -p 12345 -q full_alignment.concat.partition.exclude_file -s full_alignment.concat.fa.exclude_file -n partsub_$i -T 6 -x 12345 -N 1
        cd -
    done
done

# Create qdist table
echo -e "batch\tgene_count\tqdist\tqdist_rel" >high_level.qdist_by_genenumber.tsv
for j in {0..9}
do
   for i in {1..108}
   do
      echo -n $j$'\t'$i$'\t'
      FILE="$(printf "%02d" $j)_$(printf "%03d" $i)/RAxML_bestTree.partsub_"
      if [ -f $FILE ]
      then
         qdist ../../evaluation2/bcgTree/high_level/RAxML_bestTree.final $FILE | tail -n1 | cut -f7,8
      else
         echo -e "NA\tNA"
      fi
   done
done >>high_level.qdist_by_genenumber.tsv
echo -ne "16S\t1\t" >>high_level.qdist_by_genenumber.tsv
qdist ../../evaluation2/bcgTree/high_level/RAxML_bestTree.final ../../evaluation2/16S/RAxML_bestTree.high_level_16S | tail -n1 | cut -f7,8 >>high_level.qdist_by_genenumber.tsv

cd ../..

Computational performance

To get an impression on how the runtime scales with the number of genomes the following analysis has been performed. To reproduce go to reproduce/analysis:

mkdir -p performance
cd performance
for i in $(seq 5 1 15)
do
    for j in $(seq 1 1 5)
    do
        shuf ../../data/ncbi/ncbi_all.options | head -n ${i} >${i}_${j}.options
        ../../../bin/bcgTree.pl @${i}_${j}.options --out ${i}_${j}
    done
done
for i in $(seq 5 5 50)
do
    shuf ../../data/ncbi/ncbi_all.options | head -n ${i} >${i}_0.options
    ../../../bin/bcgTree.pl @${i}_0.options --out ${i}_0
done

#Gather timestamps:
echo "genomes"$'\t'"rep"$'\t'"start"$'\t'"startRaxml"$'\t'"end" >timestamps.tsv 
for i in *_*.options
do
    BASE=$(basename $i .options)
    echo -n $BASE$'\t' | perl -pe 's/_/\t/'
    LOG="$BASE/bcgTree.log"
    cat <(head -n1 $LOG) <(grep "Starting: raxml" $LOG) <(tail -n1 $LOG) | cut -f1,2 -d" " | perl -pe 's/\n/\t/' | perl -pe 's/\t$/\n/'
done >>timestamps.tsv
cd ..

Analysis of timestamps in R (go to folder reproduce/analysis/performance):

data <- read.table("timestamps.tsv", header=T, sep="\t")
#The date/time string can be converted to an appropriate object with strptime
#strptime(as.character(data$end), "[%m-%d %H:%M:%S]")
time_until_finished = difftime(strptime(as.character(data$end), "[%m-%d %H:%M:%S]"), strptime(as.character(data$start), "[%m-%d %H:%M:%S]"), unit="sec")
time_until_raxml = difftime(strptime(as.character(data$startRaxml), "[%m-%d %H:%M:%S]"), strptime(as.character(data$start), "[%m-%d %H:%M:%S]"), unit="sec")

cor(as.numeric(data$genomes), as.numeric(time_until_raxml))
cor(as.numeric(data$genomes), as.numeric(time_until_finished))
cor.test(as.numeric(data$genomes), as.numeric(time_until_finished))
# quite surprisingly the linear(!) correlation between number of genomes and runtime is very strong

pdf("runtime_vs_genome_number.pdf")
plot(data$genomes, 
 time_until_finished,
 xlab="Number of genomes",
 ylab="Time in s",
 main="Runtime of bcgTree depending on number of genomes",
 ylim=c(1,max(time_until_finished))
)
points(data$genomes,
 time_until_raxml,
 col="red"
)
# add regression line
abline(lm(time_until_finished ~ data$genomes))
legend("topleft", legend=c("runtime before raxML", "total runtime", "regression"), pch=c(1,1,NA), lty=c(0,0,1), col=c("red","black","black"))
dev.off()