Merge pull request #109 from rrohwer/master

Minor Edits to Reflect mSphere publication
McMahonLab · Sep 4, 2018 · bd08f9d · bd08f9d
2 parents a7b218b + 5bccb66
commit bd08f9d
Show file tree

Hide file tree

Showing 29 changed files with 153 additions and 54 deletions.
diff --git a/.gitignore b/.gitignore
@@ -9,12 +9,15 @@ tax-scripts/.RData
 FreshTrain-files/gg_13_5.fasta
 FreshTrain-files/gg_13_5.taxonomy
 
-# Expanded FreshTrain Folders
+# Expanded FreshTrain Folders and unreleased FreshTrain versions
 FreshTrain-files/FreshTrain18Aug2016/
 FreshTrain-files/FreshTrain11Feb2016/
+FreshTrain-files/FreshTrain11Feb2016.zip
 FreshTrain-files/FreshTrain25Jan2018Greengenes13_5/
 FreshTrain-files/FreshTrain25Mar2018SILVAv128/
+FreshTrain-files/FreshTrain25Mar2018SILVAv128.zip
 FreshTrain-files/FreshTrain25Mar2018SILVAv132/
+FreshTrain-files/FreshTrain25Mar2018SILVAv132.zip
 FreshTrain-files/FreshTrain30Apr2018SILVAv128/
 FreshTrain-files/FreshTrain30Apr2018SILVAv132/
 

diff --git a/README.md b/README.md
@@ -7,12 +7,12 @@ How do I TaxAss?
 
 **Step-by-step directions:** [tax-scripts/TaxAss_Directions.html](https://htmlpreview.github.io/?https://github.com/McMahonLab/TaxAss/blob/master/tax-scripts/TaxAss_Directions.html)
 
-For now, please cite our preprint:  
-TaxAss: Leveraging Custom Databases Achieves Fine-Scale Taxonomic Resolution
-Robin Rebecca Rohwer, Joshua J Hamilton, Ryan J Newton, Katherine D McMahon
-bioRxiv 214288; doi: https://doi.org/10.1101/214288
+Please cite our mSphere paper:  
+TaxAss: Leveraging a Custom Freshwater Database Achieves Fine-Scale Taxonomic Resolution
+Robin R Rohwer, Joshua J Hamilton, Ryan J Newton, Katherine D McMahon
+mSphere; doi: https://doi.org/10.1128/mSphere.00327-18
 
-TaxAss uses a series of python, R, and bash scripts in addition to using BLAST+ and mothur's classify.seqs() command.  The scripts are sourced from the terminal window (mac or linux). You'll need to download this repository (green "Clone or download" button, top right), and then just add the tax-scripts folder to your working diriectory.
+TaxAss uses a series of R, python, and bash scripts in addition to using BLAST+ and mothur's classify.seqs() command.  The scripts are sourced from the terminal window (mac or linux). You'll need to download this repository (green "Clone or download" button, top right), and then just add the tax-scripts folder to your working diriectory.
 
 Where's the stuff I need?
 ---
@@ -24,16 +24,17 @@ Where's the stuff I need?
 
 **The stuff you can ignore:**  
 1. Scripts to process & edit the FreshTrain arb files are in the `arb-scripts` folder  
-2. Scripts for making poster/paper figures are in `figure-scripts` folder.  
+2. Scripts for making the paper figures are in the `figure-scripts` folder.  
 3. A pdf of my ISME16 poster (that might be a good overview): `poster-ISME16.pdf`  
 
 What if it doesn't work?
 ---
 
-Please tell me!  If you find a bug (or if you get confused) I want to fix it/help you.  If you're all github savvy you can submit an issue, or you can just send me an e-mail with TaxAss in the title: robin.rohwer@gmail.com
+Please tell me!  If you find a bug (or if you get confused) I want to fix it/help you.  If you're all github savvy you can submit an issue, or you can just send me an e-mail with TaxAss in the subject: robin.rohwer@gmail.com
 
 Who made TaxAss?
 ---
 The McMahon lab studies lake microbial ecology (and some of it also studies wastewater). We are at the Univerisity of Wisconsin-Madison.  
 Lab Website: https://mcmahonlab.wisc.edu/  
-Lab Twitter: @mcmahonlab 
+Lab Twitter: @mcmahonlab  
+Robin Twitter: @RobinRohwer
diff --git a/arb-scripts/README.md b/arb-scripts/README.md
@@ -1,11 +1,5 @@
-These are scripts for reformatting the ARB output of the Freshwater Training Set
+# README arb-scripts  
 
-Robin's in the process of pulling them together into a repeatable workflow.
-That's gonna be in this repo soon.
-
-The .pl script is the one Trina uses already.
-The ones Robin's adding are to address bugs she's encountered while working on 16STaxAss.
-
-When Robin is not writing code, she enjoys referring to herself in the 3rd person.
-
-RRR 2-8-16
+These are scripts for reformatting the ARB output of the Freshwater Training Set.  
+
+This folder also includes the scripts and files used for the TaxAss manuscript's Marathonas Validation (fig 1 in paper). These are located inside the `Marathonas_test` folder.  
diff --git a/figure-scripts/README.md b/figure-scripts/README.md
@@ -0,0 +1,16 @@
+# README figure-scripts
+
+These scripts generate the final versions of figures and tables for the paper. 
+Many of the figures and tables are already generated in the workflow, but this is
+the version of the script where I tweak colors, dimensions, resolution, etc.
+For tables, a csv with the data is generated that msphere later formatted pretty.  
+
+Also included are figures generated for my ISME16 poster, figures that didn't make 
+the final cut, and a script for finding all the stats listed in the text.  
+
+The folder structure for reproducing the manuscript's data is `TaxAss-BatchFiles.zip`. 
+This folder pairs with the directions in `README-TaxAss-BatchFiles.html`.  
+
+The Marathonas validation batch files and directions are inside the `arb-scripts` folder. 
+The directions are in the file `arb-scripts/Marathonas_test/README_marathonas_validation.html` 
+and all necessary files are also in that folder.  
diff --git a/figure-scripts/README.txt b/figure-scripts/README.txt
diff --git a/figure-scripts/fig_marathonas.R → figure-scripts/fig_2_marathonas.R b/figure-scripts/fig_marathonas.R → figure-scripts/fig_2_marathonas.R
@@ -7,7 +7,7 @@ file.path.summary.table.v3v4 <- "../arb-scripts/Marathonas_test/v3v4_mara/v3v4_r
 file.path.summary.table.v4v5 <- "../arb-scripts/Marathonas_test/v4v5_mara/v4v5_results/summary_table.rds"
 
 
-# file.path.created.plot <- "~/Dropbox/PhD/Write\ It/draft\ 7/res-submission_figures/mara_validation.pdf"
+# file.path.created.plot <- "~/Dropbox/PhD/TaxAss_manuscript/draft 7/re-submission_figures/mara_validation-referencelabel.pdf"
 # file.path.created.table <- "~/Dropbox/PhD/Write\ It/draft\ 7/res-submission_figures/mara_validation.csv"
 
 # ---- functions ----
@@ -69,6 +69,7 @@ make.example.table <- function(){ # all lazy calls to global env
   col.wrong <- adjustcolor(col = "darkred", alpha.f = .3)
   col.vector <- c(col.correct, col.correct, col.correct, col.correct, col.under, col.wrong, col.wrong, col.wrong)
 
+  # these are actual examples I found in the results, not just made-up examples
   left.text.vect <- c("acI-A1", "bacI-unclassified", "n/a", "acI-A", "acI-B1", "n/a", "betI-A","acI-C")
   right.text.vect <- c("acI-A1","bacI-unclassified", "Microcystaceae","acI", "Actinobacteria", "LD19","betI-B","acI-C1")
 
@@ -85,7 +86,7 @@ make.example.table <- function(){ # all lazy calls to global env
 
   middle.left <- box.lefts + (box.middles - box.lefts) * (1/2)
   middle.right <- box.middles + (box.rights - box.middles) * (1/2)
-  mtext(text = c("Correct","TaxAss"), side = 1, line = .5, at = c(middle.left, middle.right), cex = label.cex)
+  mtext(text = c("Reference","TaxAss"), side = 1, line = .5, at = c(middle.left, middle.right), cex = label.cex)
   mtext(text = "Examples of Each Category:", side = 3, line = 1, at = box.middles, cex = label.cex)
 
 }

diff --git a/...orkflow_better_than_comprehensive_alone.R → ...re-scripts/fig_3_red-yellow_improvement.R b/...orkflow_better_than_comprehensive_alone.R → ...re-scripts/fig_3_red-yellow_improvement.R
diff --git a/...ow_better_than_ecosystem-specific_alone.R → figure-scripts/fig_4_blue-red_diversity.R b/...ow_better_than_ecosystem-specific_alone.R → figure-scripts/fig_4_blue-red_diversity.R
diff --git a/...toff_choice--sup_table_1_coarse_forcing.R → figure-scripts/fig_5_pident_choice.R b/...toff_choice--sup_table_1_coarse_forcing.R → figure-scripts/fig_5_pident_choice.R
diff --git a/...scripts/supp_fig_1_pident_recalc_needed.R → figure-scripts/fig_S1_cyano_pident_recalc.R b/...scripts/supp_fig_1_pident_recalc_needed.R → figure-scripts/fig_S1_cyano_pident_recalc.R
diff --git a/...ripts/supp_table_1_coarse_forcing_check.R → ...ripts/not_included_coarse_forcing_check.R b/...ripts/supp_table_1_coarse_forcing_check.R → ...ripts/not_included_coarse_forcing_check.R
diff --git a/figure-scripts/table_2_example_otu_recalc.R → ...scripts/not_included_example_otu_recalc.R b/figure-scripts/table_2_example_otu_recalc.R → ...scripts/not_included_example_otu_recalc.R
diff --git a/...ts/table_1_database_performance_summary.R → figure-scripts/table_1_best_of_both.R b/...ts/table_1_database_performance_summary.R → figure-scripts/table_1_best_of_both.R
diff --git a/figure-scripts/table_3_blast_hit_num.R → figure-scripts/table_2_blast_hit_number.R b/figure-scripts/table_3_blast_hit_num.R → figure-scripts/table_2_blast_hit_number.R
diff --git a/figure-scripts/text_S1_BLAST_example_calc.pdf b/figure-scripts/text_S1_BLAST_example_calc.pdf
diff --git a/tax-scripts/README.md b/tax-scripts/README.md
@@ -1,4 +1,6 @@
-Scripts for 16S Taxonomy Assignment Workflow
+README tax-scripts
 ===
 
-This folder contains the scripts used for OTU assignment. Detailed descriptions of them are in the `Clean Workflow.txt` document.
+This folder contains the TaxAss scripts. Detailed descriptions of them are in `TaxAss_Directions.html`. You can read the directions online here: [tax-scripts/TaxAss_Directions.html](https://htmlpreview.github.io/?https://github.com/McMahonLab/TaxAss/blob/master/tax-scripts/TaxAss_Directions.html)  
+
+To use TaxAss, download this folder, add your data and databases, and use it as your working directory.  
diff --git a/tax-scripts/TaxAss_Directions.Rmd b/tax-scripts/TaxAss_Directions.Rmd
@@ -407,7 +407,8 @@ _i.e._ you're choosing a pident from the start, and just making the taxonomy tab
 - **with headers (column names), but no row names**  
 - **no "totals" row/column**  
    (as might have been added by mother)
-- **numbers only for all abundance values**
+- **numbers only for all abundance values**  
+   Note that if you had a sample that failed sequencing, there might be no reads left in it after QC. This sample should be removed before TaxAss or normalizing it could cause a NaN non-number value.
 
 ```
 colname	colname		colname		colname		colname

diff --git a/tax-scripts/TaxAss_Directions.html b/tax-scripts/TaxAss_Directions.html
@@ -94,6 +94,9 @@
 .tabbed-pane {
   padding-top: 12px;
 }
+.html-widget {
+  margin-bottom: 20px;
+}
 button.code-folding-btn:focus {
   outline: none;
 }
@@ -725,7 +728,8 @@ <h3><code>.abund</code> file format</h3>
 </li>
 <li><strong>no “totals” row/column</strong><br />
 (as might have been added by mother)</li>
-<li><strong>numbers only for all abundance values</strong></li>
+<li><strong>numbers only for all abundance values</strong><br />
+Note that if you had a sample that failed sequencing, there might be no reads left in it after QC. This sample should be removed before TaxAss or normalizing it could cause a NaN non-number value.</li>
 </ul>
 <pre><code>colname colname     colname     colname     colname
 seqID   Abundance   Abundance   Abundance   Abundance

diff --git a/tax-scripts/calc_full_length_pident.R b/tax-scripts/calc_full_length_pident.R
diff --git a/tax-scripts/create_fastas_given_seqIDs.py b/tax-scripts/create_fastas_given_seqIDs.py
diff --git a/tax-scripts/filter_seqIDs_by_pident.R b/tax-scripts/filter_seqIDs_by_pident.R
diff --git a/tax-scripts/find_classification_disagreements.R b/tax-scripts/find_classification_disagreements.R
diff --git a/tax-scripts/find_seqIDs_blast_removed.py b/tax-scripts/find_seqIDs_blast_removed.py
diff --git a/tax-scripts/force_consistent_unclassifieds_on_database.R b/tax-scripts/force_consistent_unclassifieds_on_database.R
diff --git a/tax-scripts/plot_blast_hit_stats.R b/tax-scripts/plot_blast_hit_stats.R
diff --git a/tax-scripts/plot_classification_disagreements.R b/tax-scripts/plot_classification_disagreements.R
diff --git a/tax-scripts/plot_classification_improvement.R b/tax-scripts/plot_classification_improvement.R
diff --git a/tax-scripts/reformat_dada2_seqtabs.R b/tax-scripts/reformat_dada2_seqtabs.R
@@ -0,0 +1,106 @@
+# RRR 2018-7-20
+
+# dada2 has an internal implementation of the Wang classifier/ RDP classifier / mothur default / TaxAss choice
+# BUT, just keep using mothur with TaxAss because formatting is painful.
+# this script takes the dada2 output file (seqtab_nochim) and cretes otus.fasta and otus.abund
+
+# command line syntax:
+
+# Rscript reformat.dada2.seqtabs.R seqtab_nochim.rds otus.fasta otus.abund
+
+# ---- Accept Arguments from Terminal Command Line ----
+
+userprefs <- commandArgs(trailingOnly = TRUE)
+path.to.seqtab <- userprefs[1] 
+fasta.output <- userprefs[2] 
+abund.output <- userprefs[3]
+
+cat("fuck you forgot to comment out the file paths in reformat_dada2_seqtabs.R!")
+path.to.seqtab <- "/Users/athena/Desktop/dada2-meV45/dada2/seqtab_nochim.rds"
+fasta.output <- "~/Desktop/otus.fasta"
+abund.output <- "~/Desktop/otus.abund"
+count.output <- "/Users/athena/Desktop/dada2-meV45/taxass_gg/data/otus.count"
+
+# ---- define functions ----
+
+import.dada2.file <- function(path){
+  is.rds <- grepl(pattern = "*.rds", x = path)
+  if (is.rds){
+    seqtab_nochim <- readRDS(file = path)
+    return(seqtab_nochim)
+  }else{
+    cat("input must be rds file ending in \".rds\"\nSave it in this format at the end of dada2 pipeline using\nsaveRDS(object = seqtab_nochim, file = \"filename\")")
+  }  
+}
+
+make.fasta.file <- function(dadatable, fasta.path){
+  fasta.seqs <- colnames(dadatable)
+  otu.names <- paste("otu", 1:length(fasta.seqs), sep = "_")
+  fasta.names <- paste(">", otu.names, sep = "")
+  fasta.file <- paste(fasta.names, fasta.seqs, sep = "\n")
+  write.table(x = fasta.file, file = fasta.path, quote = F, row.names = F, sep = "\n", col.names = F)
+  cat("Made file: ", fasta.path, "\n")
+  return(otu.names)
+}
+
+find.zero.samples <- function(tot.reads, otu.table){
+  read.stats <- boxplot(x = tot.reads, plot = F)
+  cat("These samples have outlier read counts. Samples with zero reads are being removed.\n",paste(names(read.stats$out), " : ", read.stats$out, sep = "", "\n"))
+  index <- which(tot.reads == 0)
+  return(index)
+}
+
+convert.to.rel.abund <- function(OTUs){
+  sample.totals <- rowSums(OTUs)
+  # vectors are applied to matrices by stepping down rows in a column
+  norm.otus <- OTUs / sample.totals * 100
+  return(norm.otus)
+}
+
+reformat.for.taxass <- function(OTUs){
+  # rows = OTUs, cols = samples, col 1 = seqIDs
+  otu.table <- t(OTUs)
+  otu.table <- cbind(row.names(otu.table), otu.table)
+  colnames(otu.table)[1] <- "seqID"
+  return(otu.table)
+}
+
+make.abund.file <- function(OTUs, FilePath){
+  write.table(x = OTUs, file = FilePath, quote = FALSE, sep = "\t", row.names = FALSE)
+  cat("Made file: ", FilePath, "\n")
+}
+
+make.count.file <- function(Count, FilePath){
+  Count <- cbind(names(Count), Count)
+  colnames(Count) <- c("Sample.Name", "Total.Reads")
+  write.table(x = Count, file = FilePath, quote = F, sep = "\t", row.names = F)
+  cat("Made file: ", FilePath, "\n")
+}
+
+# ---- use functions ----
+
+seqtab_nochim <- import.dada2.file(path = path.to.seqtab)
+
+seqIDs <- make.fasta.file(dadatable = seqtab_nochim, fasta.path = fasta.output)
+
+colnames(seqtab_nochim) <- seqIDs
+
+read.count <- rowSums(seqtab_nochim)
+
+index <- find.zero.samples(tot.reads = read.count, otu.table = seqtab_nochim)
+if(length(index) > 0){
+  read.count <- read.count[-index]
+  seqtab_nochim <- seqtab_nochim[-index, ]
+}
+
+seqtab_nochim <- convert.to.rel.abund(OTUs = seqtab_nochim)
+
+seqtab_nochim <- reformat.for.taxass(OTUs = seqtab_nochim)
+
+make.abund.file(OTUs = seqtab_nochim, FilePath = abund.output)
+
+make.count.file(Count = read.count, FilePath = count.output)
+
+
+
+
diff --git a/tax-scripts/reformat_mothur_OTU_tables.R b/tax-scripts/reformat_mothur_OTU_tables.R