diff --git a/FreshTrain-files/README.md b/FreshTrain-files/README.md index 4961178..631ba40 100755 --- a/FreshTrain-files/README.md +++ b/FreshTrain-files/README.md @@ -14,10 +14,16 @@ zipped file name | description FreshTrain18Aug2016 | old version formatted for Greengenes (don't use) FreshTrain25Jan2018Greengenes13_5.zip | current version formatted for Greengenes FreshTrain30Apr2018SILVAv128.zip | current version formatted for SILVA v128 -FreshTrain30Apr2018SILVAv132.zip | current version formatted for SILVA v132 +**FreshTrain30Apr2018SILVAv132.zip** | **current version formatted for SILVA v132** -The different formats match the FreshTrain's coarse-level nomenclature to the nomenclature in the comprehensive database of choice. The FreshTrain defines lineage-clade-tribe (~family-genus-species) level phylogenies, so the phylum, class, and order names are changed in the different versions to be consistent with the chosen comprehensive database. +The different formats match the FreshTrain's coarse-level nomenclature to the nomenclature in the comprehensive database of choice. The FreshTrain defines lineage-clade-tribe (~family-genus-species) level phylogenies, so the phylum, class, and order names are changed in the different FreshTrain versions to be consistent with the paired comprehensive database.
-The citation for the FreshTrain database is: -[Newton, R. J., Jones, S. E., Eiler, A., McMahon, K. D. & Bertilsson, S. A guide to the natural history of freshwater lake bacteria. Microbiol. Mol. Biol. Rev. 75, 14–49 (2011).](http://mmbr.asm.org/content/75/1/14.full) +The citation for the original FreshTrain database and the arb version of it is: + +[Newton RJ, Jones SE, Eiler A, McMahon KD, Bertilsson S. 2011. A Guide to the Natural History of Freshwater Lake Bacteria. Microbiol Mol Biol Rev 75:14–49.](https://mmbr.asm.org/content/75/1/14.full) The arb files are available at [github.com/McMahonLab/FWMFG](https://github.com/McMahonLab/FWMFG). + +
+The citation for these taxonomy assignment-compatible formats of the FreshTrain and the TaxAss method is: + +[Rohwer RR, Hamilton JJ, Newton RJ, McMahon KD. 2018. TaxAss: Leveraging a Custom Freshwater Database Achieves Fine-Scale Taxonomic Resolution. mSphere 3:e00327-18.](https://msphere.asm.org/content/3/5/e00327-18) diff --git a/README.md b/README.md index ce30951..a701b46 100644 --- a/README.md +++ b/README.md @@ -7,12 +7,9 @@ How do I TaxAss? **Step-by-step directions:** [tax-scripts/TaxAss_Directions.html](https://htmlpreview.github.io/?https://github.com/McMahonLab/TaxAss/blob/master/tax-scripts/TaxAss_Directions.html) -Please cite our mSphere paper: -TaxAss: Leveraging a Custom Freshwater Database Achieves Fine-Scale Taxonomic Resolution -Robin R Rohwer, Joshua J Hamilton, Ryan J Newton, Katherine D McMahon -mSphere; doi: https://doi.org/10.1128/mSphere.00327-18 +**Please cite TaxAss:** [Rohwer RR, Hamilton JJ, Newton RJ, McMahon KD. 2018. TaxAss: Leveraging a Custom Freshwater Database Achieves Fine-Scale Taxonomic Resolution. mSphere 3:e00327-18.](https://msphere.asm.org/content/3/5/e00327-18) -TaxAss uses a series of R, python, and bash scripts in addition to using BLAST+ and mothur's classify.seqs() command. The scripts are sourced from the terminal window (mac or linux). You'll need to download this repository (green "Clone or download" button, top right), and then just add the tax-scripts folder to your working diriectory. +TaxAss only assigns taxonomy, so you can use TaxAss after using mothur, dada2, vsearch, or whatever QC pipeline you prefer. TaxAss uses a series of R, python, and bash scripts in addition to using BLAST+ and mothur's classify.seqs() command. The scripts are sourced from the terminal window (mac or linux). You'll need to download this repository (green "Clone or download" button, top right), and add the tax-scripts folder to your working diriectory. Where's the stuff I need? --- diff --git a/tax-scripts/RunSteps_quickie.sh b/tax-scripts/RunSteps_quickie.sh index 65eac99..084d6b9 100755 --- a/tax-scripts/RunSteps_quickie.sh +++ b/tax-scripts/RunSteps_quickie.sh @@ -4,17 +4,19 @@ # That means that you do not try different percent identity cutoffs to choose the best one. # That might make sense for you if you have already made a similarity choice, for example by # choosing a cutoff to cluster OTUs. Then just have pident match that cutoff. +# In almost all of our test datasets we found a pident of 98 was best. # Note: this also skips the BLAST check (step 6). You could go back and just do that one. # Note: still run step 16 to tidy up. +# Note: still gotta do the reformatting manually (step 0) -# Choose pident. +# USER CAN CHANGE THIS INPUT --------------------------------- pident=("98") fwbootstrap=("80") ggbootstrap=("80") processors=("2") -# Note: still gotta do the reformatting manually (step 0) +# ------------------------------------------------------------- # 1 makeblastdb -dbtype nucl -in custom.fasta -input_type fasta -parse_seqids -out custom.db && diff --git a/tax-scripts/TaxAss_Directions.Rmd b/tax-scripts/TaxAss_Directions.Rmd index 20afe82..866ab06 100644 --- a/tax-scripts/TaxAss_Directions.Rmd +++ b/tax-scripts/TaxAss_Directions.Rmd @@ -1,7 +1,7 @@ --- title: "TaxAss Workflow" author: "Robin Rohwer" -date: "last updated: 11/5/2017" +date: "last updated: June 7, 2019" output: html_document: toc: true @@ -25,13 +25,33 @@ pre code, pre, code { TaxAss assigns taxonomy to a fasta file of OTU sequences using both a small, custom taxonomy database and a large general database. -Download the scripts from the github repo: https://github.com/McMahonLab/TaxAss +Download the scripts from the github repo: https://github.com/McMahonLab/TaxAss + +Please cite TaxAss: [Rohwer RR, Hamilton JJ, Newton RJ, McMahon KD. 2018. TaxAss: Leveraging a Custom Freshwater Database Achieves Fine-Scale Taxonomic Resolution. mSphere 3:e00327-18.](https://msphere.asm.org/content/3/5/e00327-18) + # List of Steps -This is a list of all the commands in each step. There is a more detailed section for each step. + +These directions include extra validation steps used in the [paper](https://msphere.asm.org/content/3/5/e00327-18) that you don't need to include to just assign taxonomy to your data. + +**To assign FreshTrain taxonomy to your own data, we recommend:** + +* add your data files, database files, and the `tax-scripts` files to an otherwise empty folder (your working directory) +* format your files and file names according to `step 0` +* run a batch script from within your working directory: `./RunSteps_quickie.sh` +* run a batch script that deletes all the intermediate files: `./RunStep_16.sh` + +The RunSteps_quickie.sh batch script uses a taxass percent identity of 98 to assign the database, a mothur bootstrap confidence of 80 to assign taxonomy, and 2 processors for mothur commands. You can change these settings at the top of the batch script file. +
+If you encounter error messages or if your data is very large, we recommend troubleshooting by running the steps included in RunSteps_quickie.sh individually with a small subset of your data by following the directions in +**steps 0, 1, 2, 3, 4, 5, 7, 8, 9, 10, 12, 15** +Navigate to detailed instructions for each step using the table of contents. +

-**You can run most of the steps in one go by using the batch scripts listed at the bottom of this section.** +___________ + +Below is a list of all the commands in each step, including the optional ones: ```` -1. Download/Install BLAST+ @@ -40,19 +60,19 @@ This is a list of all the commands in each step. There is a more detailed sectio (your computer already has python) taxonomy databases -0. format files (textwrangler or bash) +0. format files (textwrangler/bash/R) depends on your starting file formats - for Green Genes database as general.taxonomy: - sed 's/ //g' NoSpaces - sed 's/$/;/' EndLineSemicolons - mv EndLineSemicolons general.taxonomy - rm NoSpaces - for aligned fasta files: + for Green Genes as general database: + ./reformat_greengenes.sh gg_13_5_taxonomy.txt + for Silva as general database: + ./un-align_silva.sh silva.nr_v132.align general.fasta silva.nr_v132.tax general.taxonomy + Rscript force_consistent_unclassifieds_on_database.R silva.nr_v132.tax general.taxonomy + for mothur count_table files: sed 's/-//g' otus.fasta - for mothur .count_table as OTU table: + sed 's/-//g' otus.count_table Rscript reformat_mothur_OTU_tables.R StupidLongMothurName.count_table count_table otus.abund - - + for dada2 seqtab_nochim as OTU table: + Rscript reformat_dada2_seqtabs.R seqtab_nochim.rds otus.fasta otus.abund otus.count 1. make BLAST database file (blast) makeblastdb -dbtype nucl -in custom.fasta -input_type fasta -parse_seqids -out custom.db @@ -88,35 +108,40 @@ This is a list of all the commands in each step. There is a more detailed sectio 10. combine taxonomy files (bash) cat otus.above.98.custom.wang.taxonomy otus.below.98.general.wang.taxonomy > otus.98.taxonomy -11. assign taxonomy with general database only (mothur, bash) +11. OPTIONAL- assign taxonomy with general database only (mothur, bash) + (needed for validation steps 13, 14, 15.5) mothur "#classify.seqs(fasta=otus.fasta, template=general.fasta, taxonomy=general.taxonomy, method=wang, probs=T, processors=2, cutoff=0)" cat otus.general.wang.taxonomy > otus.general.taxonomy -11.5 OPTIONAL- feeds into Database_Improvement_Workflow - assign taxonomy to custom database with general database (mothur, bash) +11.5 OPTIONAL- assign taxonomy to custom database with general database (mothur, bash) + (feeds into Database_Improvement_Workflow) mothur "#classify.seqs(fasta=custom.fasta, template=general.fasta, taxonomy=general.taxonomy, method=wang, probs=T, processors=2, cutoff=0)" cat custom.general.wang.taxonomy custom.general.taxonomy 12. reformat taxonomy files (bash) sed 's/[[:blank:]]/\;/' otus.98.taxonomy.reformatted mv otus.98.taxonomy.reformatted otus.98.taxonomy + OPTIONAL- need if did step 11 sed 's/[[:blank:]]/\;/' otus.general.taxonomy.reformatted mv otus.general.taxonomy.reformatted otus.general.taxonomy -13. compare taxonomy files (R) +13. OPTIONAL- compare taxonomy files (R) + (needed for validation steps 11, 14, 15.5) mkdir conflicts_98 Rscript find_classification_disagreements.R otus.98.taxonomy otus.general.taxonomy ids.above.98 conflicts_98 98 80 80 14. OPTIONAL: choose appropriate pident cutoff (R) + (needed for validation steps 11, 13, 15.5) note: you have to repeat steps 5, 7-10, & 12-13 with multiple pident cutoffs to do this step Rscript plot_classification_disagreements.R otus.abund plots regular NA NA conflicts_94 ids.above.94 94 conflicts_96 ids.above.96 96 conflicts_98 ids.above.98 98 15. generate final taxonomy file (R) + OPTIONAL- if did all the validation steps 11, 13, 14, 15.5a Rscript find_classification_disagreements.R otus.98.taxonomy otus.general.taxonomy ids.above.98 conflicts_98 98 85 70 final - If skipping optional steps 11-14 or 15.5.a.: - Rscript find_classification_disagreements.R otus.98.taxonomy quickie ids.above.98 conflicts_98 98 85 70 final + If doing only "quickie" steps.: + Rscript find_classification_disagreements.R otus.98.taxonomy quickie ids.above.98 conflicts_98 98 85 70 final -15.5 OPTIONAL: plot benefits of using this workflow (R, mothur, bash) +15.5 OPTIONAL- plot benefits of using this workflow (R, mothur, bash) a. Improvement over general database only: Rscript plot_classification_improvement.R final.taxonomy.pvalues final.general.pvalues total.reads.per.seqID.csv plots final.taxonomy.names final.general.names b. Improvement over custom database only: @@ -135,12 +160,8 @@ This is a list of all the commands in each step. There is a more detailed sectio mkdir data ; mv otus* data mkdir databases ; mv *.taxonomy *.fasta databases -```` -
-
-**OR: Run steps as a block using these batch scripts:** -(you have to open the batch script and enter your chosen pident and confidence values at the beginning) -```` +Batch Scripts: + Full TaxAss: a. Run a bunch of pidents ./RunSteps_1-14.sh @@ -149,15 +170,11 @@ Full TaxAss: c. clean up after everything worked ./RunStep_16.sh -As few steps as possible (choose pident at start): +As few steps as possible (choose pident without comparing): ./RunSteps_quickie.sh ```` - - -Detailed explanations of commands and inputs are below: - __________________________________________________________________________________________ # -1. download/install @@ -166,13 +183,13 @@ ________________________________________________________________________________ Program | Version | Source | Specific Link --------------|---------------------|-------------------------|------------------------------------------ -mothur | v.1.39.5 | www.mothur.org | https://github.com/mothur/mothur/releases/tag/v1.39.5 -BLAST | 2.2.31+ | ncbi.nlm.nih.gov | https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download +mothur | v.1.42.1 | www.mothur.org | https://github.com/mothur/mothur/releases +BLAST | 2.9.0 | ncbi.nlm.nih.gov | ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ TaxAss | master branch | `tax-scripts` folder | https://github.com/McMahonLab/TaxAss
You don't need your own github account; you can click the green download button on the top right of the main page (do have to download all folders unfortunately) -python | 2.7.11 | your computer already has this | don't need anything extra +python | 2.7 or 3.7 | your computer already has this | don't need anything extra You want mothur and BLAST on your "path" (that means if you type them, your computer recognizes it as a program and runs it). -BLAST does that automatically when you install it I think, but mothur does not. +BLAST does that automatically if you use the `.dmg` file to install it, but mothur does not. First make sure mothur is downloaded to your home directory- should open if you type `~/mothur/mothur`. **To add mothur to your path:** @@ -184,7 +201,7 @@ First make sure mothur is downloaded to your home directory- should open if you ~ $ touch .bash_profile ~ $ open .bash_profile -# add this line to the .bash_profile doc +# add this line to the .bash_profile document export PATH=~/mothur:$PATH ``` Now you can open the mothur program just by typing `mothur` from any location. (after restarting terminal) @@ -193,43 +210,44 @@ Now you can open the mothur program just by typing `mothur` from any location. ( Database | Version | File Name | Download From --------------|---------|--------------------|------------------------- -FreshTrain | 18Aug2016 | FreshTrain18Aug2016.zip | https://github.com/McMahonLab/TaxAss/FreshTrain-files
(You can download the single zip file without the whole TaxAss repository by clicking on it) +FreshTrain | 30Apr2018
or
25Jan2018 | FreshTrain30Apr2018SILVAv132.zip
FreshTrain30Apr2018SILVAv128.zip
FreshTrain25Jan2018Greengenes13_5.zip | https://github.com/McMahonLab/TaxAss/FreshTrain-files Greengenes | Aug 2013 | "greengenes reference taxonomy"
"greengenes reference alignment" | https://www.mothur.org/wiki/Greengenes-formatted_databases Silva | v.132
v.128 | "Full length sequences and taxonomy references" | https://www.mothur.org/wiki/Silva_reference_files (Note that if you're a non-academic or commercial user you have to pay to use silva.) + __________________________________________________________________________________________ # 0. format files {.tabset} ## Files Needed -#### (Use this section's tabs for details on formatting each file type.) +#### Use the tabs (above) for details on formatting each file type. These are the files you supply as input into the workflow: File | Description --------------------|------------------------------------------------------------------ - custom.fasta | fasta sequences in your small, ecosystem-specific taxonomy database - custom.taxonomy | taxonomy names in your small, ecosystem-specific taxonomy database + otus.fasta | fasta sequences for each of your OTUs (OTUs can be clustered or unique sequences) + otus.abund | relative abundance of each OTU (i.e. the OTU table)
(Optional- don't need if following "quickie" procedure) general.fasta | fasta sequences in your large, general comrpehensive taxonomy database general.taxonomy | taxonomy names in your large, general comprehensive taxonomy database - otus.fasta | fasta sequences for each of your OTUs (OTUs can be clustered or unique sequences) - otus.abund | relative abundance of each OTU (i.e. the OTU table) + custom.fasta | fasta sequences in your small, ecosystem-specific taxonomy database + custom.taxonomy | taxonomy names in your small, ecosystem-specific taxonomy database **Move everything into the same folder, and make that your working directory.** -In other words, create a folder that all the tax-scripts and database and data files are inside of, -and then use that as your present working directory (`pwd`) to source all the scripts from. -
-It might also help to rename your files to match the above names so that you can copy and paste commands from this workflow. +To make your life easier, create a new folder that contains the tax-scripts, the database and data files listed above, and nothing else. Then navigate to this folder in the terminal to make it your present working directory (`pwd`). Also, rename your files to match the above names so that you can copy and paste commands from this workflow. _____________________________________________ -## seqID formats +## OTU files + +How to format `otus.fasta` and `otus.abund`, including specific directions for converting mothur and dada2 file formats. +

-General notes on the seqIDs in all files: +### Format your sequence ID's -OTU seqID's: +The seqIDs in your otus.fasta and otus.abund files must follow these requirements: - **cannot contain any whitespace** BLAST will call some parts the seqID and some parts comments if they're separated. @@ -245,106 +263,45 @@ OTU seqID's: - **must match between the otus.fasta file and the otus.abund file** (though consistent ordering is not necessary.) -_____________________________________________ - -## .taxonomy files - -### `.taxonomy` file format - -Must be compatible with mothur, example format: - -- no whitespace except for the tab between seqID and taxonomy -- taxonomy level names separated by semicolons -- must have a semicolon at the end of each line, too! - -``` -seqID kingdom;phylum;class;order;family;genus;species; -``` - -___________ - -#### Format the Silva Taxonomy Database: - -Downloading from mothur, you shouldn't need to worry about delimiters etc, -this will be in the correct format. -BUT, SILVA uses inconsistent nomenclature for sequences that don't have names. -We **strongly recommend re-formatting them into a more consistent nomenclature** using -the script `force_consistent_unclassifieds_on_database.R`. - -Full Command (type in terminal): - -``` -Rscript force_consistent_unclassifieds_on_database.R silva.nr_v132.tax general.taxonomy -``` - -command argument | description --------------------|---------------------------------------------- -silva.nr_v132.tax | input file. the mothur-formatted silva taxonomy with inconsistent names. If you are unaligning the fasta, you can do this step before or after removing the periods from the seqID names. -general.taxonomy | silva taxonomy with consistent unclassified names that is ready-to-taxass! - -Comments in the script are very detailed as to examples of the changes. -The new missing names are easier to deal with because they follow the -consistent format where they begin with either: - -* unnamed -* unknown -* uncultured -* unclassified - -And are followed by a period and the closest available parent-name. All the names that get changed will be spit out into the terminal output. Note there will be a lot due to the format "knownphylum_cl" for an unknown class name. These all get changed to "unnamed.knownphylum" to follow a consistent and easy-to-grep nomenclature. - ___________ -#### Format the Greengenes Taxonomy Database - -Full 4 Commands (type in terminal): - -``` -reformat_greengenes.sh general.taxonomy -``` - -command argument | description --------------------|---------------------------------------------- -`general.taxonomy` | the downloaded greengenes file: `gg_13_5_taxonomy.txt` - -Output is a file called `general.taxonomy` that is formatted correctly. - - -______________________ - -## .fasta files: +
-### `.fasta` file format +### Format `otus.fasta` These should be in fasta format: -- carrot before the seqID (formatted according to seqID formats tab) +- carrot before the seqID - a new line separating seqID from sequence - sequence can be one line or multiple lines - OTU sequences should not be aligned (taxonomy reference can be aligned) -example correct format: +Example of correct format: ``` >seqID TACGTAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAG >seqID TACGTAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAG ``` +___________ + +
-#### How to un-align: +### Un-align `otus.fasta`: -The dashes in aligned sequences don't work with BLAST. -An Aligned file would look something like this: +If you are using mothur to process your data, the fasta files will be aligned, but the dashes in aligned sequences don't work with BLAST. +An Aligned file looks something like this: ``` >SRR1531609.6 GAG-G-A-A--TA-TT--GG-T-C----AA-T-G-G--GC-----GC-A--A---G-C-C-T-G-A-A-C-C-A---GC-C--A--T-GCC-G-A-G-T------G-C-A--G--GA-----------------------------T-G--A--C--G-G-TC----C---TA-TG-----G-A-T-----T-G-T-A---AA-C-T-GC--------------------TT-TT-G-T--A-CAG----G--A-A--G---AA-ACAC-T---C-C-C--T---------------------C----------------------------GT---------------------------G------------------------A-GGG-A-GC-T-T-G-A-C-G-----G-T---A-C-TG--------T-A-A-G---- -``` - +``` +
You can remove hyphens using this terminal command: -Full Command (Type in Terminal): +Full 2 Commands (Type in Terminal): ``` sed 's/-//g' otus.fasta +sed 's/-//g' otus.count_table ``` command | description @@ -356,41 +313,14 @@ command | description `aligned.fasta` | the name of your aligned fasta file `otus.fasta` | the name of the reformatted file you're creating (note it must have a different name than input) -**NOTE: this removes hyphens from everywhere!** If you had hyphens in your seqID names you have to also remove -hyphens from the names in your abund file. SeqIDs must match exactly btwn the two files (tho order doesn't matter) - -So may also want to do: -``` -sed 's/-//g' otus.count_table -``` +**NOTE: this removes hyphens from everywhere!** In case there are hyphens in any seqID names, you have to also remove +hyphens from the names in your abund file (the second terminal command). SeqIDs must match exactly btwn the two files. -
- -#### Optionally you can un-align the silva taxonomy reference to make it smaller: - -The silva reference downloaded from mothur is aligned. Using this file directly for taxonomy assignment works just fine. BUT, if you want to save 10GB of space you don't need it aligned for taxonomy assignment so you can un-align it using the shell script `un-align_silva.sh`. This reduces the file size (for v132) from **10.68 GB to 318 MB.** _dang!_ It takes about 7 minutes to run this script to remove hyphens and semicolons, and it saves about 7 minutes later on during taxonomy assignment. None of the intermediate files change sizes, the processing time breaks even, so this is optional, just saves you 10GB of space if you're keeping the reference around. - -Example syntax: -``` -./un-align_silva.sh silva.nr_v132.align general.fasta silva.nr_v132.tax general.taxonomy -``` - -command | description -----------|---------------------------------------------- -`un-align_silva.sh` | script located in tax-scripts folder -`silva.nr_v132.align` | the silva database fasta file downloaded from mothur -`general.fasta` | a new name for the output unaligned database. -`silva.nr_v132.tax` | the silva database taxonomy file downloaded from mothur
Removing the `.`'s to un-align the fasta also removes dots from the sequence ID name. So need to remove the same dots from the sequence IDs in the taxonomy file. -`general.taxonomy` | a new name for the output unaligned database. +___________
-(It is possible I made up the word "un-align") -_____________________________________________ - -## .abund file - -### `.abund` file format +### Format `otus.abund` NOTE: you don't need this file if you are taking the "quickie" route and not running the validation steps. _i.e._ you're choosing a pident from the start, and just making the taxonomy table with no other checks. @@ -408,7 +338,7 @@ _i.e._ you're choosing a pident from the start, and just making the taxonomy tab - **no "totals" row/column** (as might have been added by mother) - **numbers only for all abundance values** - Note that if you had a sample that failed sequencing, there might be no reads left in it after QC. This sample should be removed before TaxAss or normalizing it could cause a NaN non-number value. + Note that if you had a sample that failed sequencing, there might be no reads left in it after QC. This sample should be removed before TaxAss or normalizing it could cause a `NaN` not-a-number value. ``` colname colname colname colname colname @@ -418,7 +348,9 @@ seqID Abundance Abundance Abundance Abundance ``` ___________ -### Format a mothur `.count_table` file +
+ +### Get `otus.abund` from a mothur `.count_table` If you used mothur for QC and are proceeding with unique sequences, as in the TaxAss paper, you can reformat the outs.count_table file this way: @@ -443,6 +375,154 @@ More notes on the mothur file types: Note that the .abund mothur file is a non-table-y space-saving format that you can't use here. Sorry for the confusing file extension choice. + +___________ + +
+ +### Get `otus.fasta` and `otus.abund` from a dada2 `seqtab_nochim` + +If you are using dada2 within RStudio, save the `seqtab_nochim` object as an R data structure: +``` +saveRDS(object = seqtab_nochim, file = "seqtab_nochim.rds) +``` + +You can convert this into both the otus.fasta and the otus.abund files this way: + +Full Command (type in terminal): +``` +Rscript reformat_dada2_seqtabs.R seqtab_nochim.rds otus.fasta otus.abund otus.count +``` + +command argument | description +-----------------------------------|---------------------------------------------- +`Rscript` | way to source an R script with arguments +`reformat_dada2_seqtabs.R` | Name of R script to run +`seqtab_nochim.rds` | saved dada2 object file +`otus.fasta` | name of created fasta file to feed into TaxAss +`otus.abund` | name of created relative abundance table to feed into TaxAss
note: this has been converted to relatve abundance (normalized by sample so that abundance values in each sample sum to 100) +`otus.count` | name of created file that saves the total reads per sample. This is not used in TaxAss, but saves the information you would need to un-normalize your abundance data so that it's not lost. + + +_____________________________________________ + +## Silva files + +How to format the `general.taxonomy` and `general.fasta` files when using the silva database as your taxonomy reference. +

+ +### Format delimiters in `general.taxonomy` + +Downloading from mothur, you shouldn't need to worry about reformatting. +But if you are exporting the file from Arb yourself, it must be compatible with mothur: + +- no whitespace except for the tab between seqID and taxonomy +- taxonomy level names separated by semicolons +- must have a semicolon at the end of each line, too! + +Example of correct format: +``` +seqID kingdom;phylum;class;order;family;genus;species; +``` + +___________ + +
+ +### Format unclassifieds in `general.taxonomy` + +SILVA uses inconsistent nomenclature for sequences that don't have names. +We strongly recommend re-formatting them into a more consistent nomenclature using +the script `force_consistent_unclassifieds_on_database.R`. + +Full Command (type in terminal): + +``` +Rscript force_consistent_unclassifieds_on_database.R silva.nr_v132.tax general.taxonomy +``` + +command argument | description +-------------------|---------------------------------------------- +silva.nr_v132.tax | input file. the mothur-formatted silva taxonomy with inconsistent names. If you are unaligning the fasta, you can do this step before or after removing the periods from the seqID names. +general.taxonomy | silva taxonomy with consistent unclassified names that is ready-to-taxass! + +Comments in the script are very detailed as to examples of the changes. +The new missing names are easier to deal with because they follow the +consistent format where they begin with either: + +* unnamed +* unknown +* uncultured +* unclassified + +And are followed by a period and the closest available parent-name. All the names that get changed will be spit out into the terminal output. Note there will be a lot due to the format "knownphylum_cl" for an unknown class name. These all get changed to "unnamed.knownphylum" to follow a consistent and easy-to-grep nomenclature. + +___________ + +
+ +### Unalign `general.fasta` (optional): + +The silva reference downloaded from mothur is aligned. Using this file directly for taxonomy assignment works just fine and takes the same amount of time computationally. BUT, if you want to save 10GB of space you don't need it aligned for taxonomy assignment so you can un-align it using the shell script `un-align_silva.sh`. This reduces the file size (for v132) from **10.68 GB to 318 MB.** _dang!_ It takes about 10 minutes to run this script to remove hyphens and semicolons, and it saves about 10 minutes later on during taxonomy assignment. None of the intermediate files change sizes, the processing time breaks even, so this is optional, just saves you 10GB of space if you're keeping the reference around. + +Example syntax: +``` +./un-align_silva.sh silva.nr_v132.align general.fasta silva.nr_v132.tax general.taxonomy +``` + +command | description +----------|---------------------------------------------- +`un-align_silva.sh` | script located in tax-scripts folder +`silva.nr_v132.align` | the silva database fasta file downloaded from mothur +`general.fasta` | a new name for the output unaligned database. +`silva.nr_v132.tax` | the silva database taxonomy file downloaded from mothur
Removing the `.`'s to un-align the fasta also removes dots from the sequence ID name. So need to remove the same dots from the sequence IDs in the taxonomy file. +`general.taxonomy` | a new name for the output unaligned database. + +
+(It is possible I made up the word "un-align") + +_____________________________________________ + +## Greengenes files + +How to format the `general.taxonomy` and `general.fasta` files when using Greengenes as your general database. +

+ +### Format `general.taxonomy` + +The Greengenes file needs to be in a format compatible with mothur: + +- no whitespace except for the tab between seqID and taxonomy +- taxonomy level names separated by semicolons +- must have a semicolon at the end of each line, too! + +Example of correct format: +``` +seqID kingdom;phylum;class;order;family;genus;species; +``` + +To convert Greengenes to a mothur-compatible format: + +Full Command (type in terminal): + +``` +./reformat_greengenes.sh gg_13_5_taxonomy.txt +``` + +command argument | description +-------------------|---------------------------------------------- +`gg_13_5_taxonomy.txt` | the downloaded greengenes file: `gg_13_5_taxonomy.txt` + +Output is a file called `general.taxonomy` that is formatted correctly. + +___________ + +
+ +### Format `general.fasta` + +This file should not require additional formatting. If for some reason you get errors, check that it meets the fasta requirements explained in the OTU files tab. + __________________________________________________________________________________________ diff --git a/tax-scripts/TaxAss_Directions.html b/tax-scripts/TaxAss_Directions.html index ad543ab..5af69c5 100644 --- a/tax-scripts/TaxAss_Directions.html +++ b/tax-scripts/TaxAss_Directions.html @@ -14,18 +14,1274 @@ TaxAss Workflow - + - - - - - - - - - - + + + + + + + + + + - - -
- + + + -