Skip to content

colinaverill/NEFI_microbe

 
 

Repository files navigation

NEFI_microbe

This git repository contains code to replicate the analyses and findings presented in Averill C., Werbin Z.R., Atherton K.F., Bhatnagar J.M and Dietze M.C. 202x. Soil microbiome predictability increases with spatial and taxonomic scale.

All paths for data products are located in the paths.r file. If you are replicating these analyses on your machine you will need to change where the master data directory points to here. This is also setup to work across multiple machines based on the hostname of a particular machine.

This project is one component of a larger project, the Near Term Ecological Forecasting Initiative (NEFI). The project repository is divided into four main subdirectories:

data_construction contains code to build analysis ready data sets from raw data products.

data_analysis contains code to perform all analyses described in the text of the manuscript.

figure_scripts contains code to generate figures in the manuscript.

NEFI_functions contains several custom functions that are called throughout the above scripts, as well as the bbmap set of tools provided by the Department of Energy Joint Genome Institute.

In addition to these main directories, there is a directory called transfer_scripts which contains code to sync data directories, as this project was developed on multiple computers. This directory can be ignored.

Below we detail in depth how each component of this repository works, and how it can be used to replicate our analyses. The code contained in data_construction has the most dependencies, requires substantial amounts of computing time, and will be the most challenging to replicate, as many files will need to be configured to work with your particular computing resources. We are happy to provide analysis ready data files if you would prefer to skip these steps and jump straight to data_analysis. The data analysis still takes a substantial amount of computing time, as Bayesian statistical models are slow to fit.

data_construction

This subdirectory is further divided into code used to workup data associated with bacteria analysis, fungal analysis, or querying and aggregating environmental data from the U.S. National Ecological Observatory Network (NEON).

  • 1._NEON_env/: this directory contains code to query raw NEON data products that contain environmental covariates needed to generate bacterial and fungal forecasts. Finally, there is code to hierarchically aggregate these data to from core to plot and site scales as appropriate, as well as propagate associated uncertainties. Scripts within 1._covariate_data_acquisition/ should be run before scripts within 2._covariate_aggregation/

  • 2._ITS/: this directory contains code to process fungal sequence datasets and associated environmental covariates.

    • tedersoo_ITS_sequence_processing/ Tedersoo et al. 2014, Science, is used as our fungal calibration data set. The code downloads DNA sequences from the SRA database, processes sequences into unique sequence variants using the dada2 pipeline, assigns taxonomy using the UNITE database, and functional groups using the FUNGuild database. There is also code to standardize the mapping file that contains associated environmental covariates so it works with downstream scripts.
    • NEON_ITS_sequence_processing/ processes fungal sequences generated by NEON. Raw sequence data was provided directly by NEON and files can be provided upon request. Additionally, there is code to processes sequences into unique sequence variants using the dada2 pipeline, assigns taxonomy using the UNITE database, and functional groups using the FUNGuild database. NEON environmental covariates are processed separately within the NEFI_mcirobe/data_construction/NEON_env/ subdirectory.
  • 3._16S/: this directory contains code to process bacterial sequence datasets, and associated environmental covariates. Before running code in the subdirectories, the user should run the script 1._create_tax_to_function_reference.r, which creates a necessary file used to classify bacterial taxonomic groups into functional ones.

    • NEON_16S/ Downloads DNA bacterial DNA sequences from the NEON API, processes these sequences through the dada2 pipeline, assigns taxonomy using greengenes, function using in house scripts. Then there is some further data processing and aggregating. Scripts should be run in the order they are numbered.
    • prior_synthesis/ This subdirectory contains code to build the bacterial calibration dataset. To do so, it processes data from two studies, and then merges them. Users should run each subdirectory in its numbered order, which in turn has scripts that are numbered and should be run in order. We realize it's complicated, and we're sorry. Furthermore, the scripts within 1._delgado/ require datafiles provided by those authors. We can share these datafiles for purposes of replicating our analysis.

data_analysis

This directory is subdivided into two sub-directories, ITS/ and 16S/, used for performing fungal and bacterial analyses, respectively. Within each sub-directory there is a roughly identical set of 11 scripts to perform the analyses reported in the manuscript.

  • 01._global_calibration_fit This script fits a set of Bayesian statistical models to taxonomic and functional groups of soil microbes, using the calibration datasets (i.e. - not the NEON data).
  • 02._global_validation_fcast This script uses the statistical models generated from the global calibration fit to forecast the relative abundances of soil microbes at the NEON sites. These forecasts are made at the core, plot and site scales.
  • 03._NEON_CV_core_fit This script divides NEON core-level observations into training and validation sets, then fits a statistical models to taxonomic and functional groups of soil microbes as done in the first script, but using the randomly selected subset of NEON core-level training observations.
  • 04._NEON_CV_plot_fit This script divides NEON plot-level observations into training and validation sets, then fits a statistical models to taxonomic and functional groups of soil microbes as done in the first script, but using the randomly selected subset of NEON plot-level training observations.
  • 05._NEON_CV_core_fcast This script uses the core-level models fit in script 3, and validates them against the validation data set at the core level.
  • 06._NEON_CV_plot_fcast This script uses the plot-level models fit in script 4, and validates them against the validation data set at the plot level.
  • 07._Morans_I_NEON This script calcualtes Moran's I statistics for all taxonomic and functional groups across the NEON network.
  • 08._predictor_importance Here we perform a principle component analysis of model parameters across different taxonomic and fucntional groups fit to understand predictor importance. Within the principle component analysis all model parameters are mean centered and variance scaled to facilitate comparisons among them.
  • 09._variance_decomp This script decomposes sources of model uncertainty into observation uncertainty (how accurate your observations are), parameter uncertainty (fundamentally linked to sampling effort), and process uncertainty (residual uncertainty that cannot be explained by parameter or observation uncertainty).
  • 10._summarize_results This scripts generates r-squared and RMSE statistics for all taxonomic and functional groups modeled.
  • 11._parameter_vs_rsq This script performs regressions of model parameter magnitudes to r-squared values across taxonomic and functional groups based in the calibration model. The goal is to understand if there are correlations between the sensitivity of certain taxonomic or functional groups to environmental drivers, and their associated predictability.

figure scripts

This directory contains code to reproduce all figures, supplementary figures and supplementary data files in the manuscript.

NEFI_functions

This directory contains custom functions necessary to reproduce the findings of this manuscript.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Java 80.3%
  • R 10.0%
  • Shell 6.0%
  • Python 3.0%
  • C 0.5%
  • HTML 0.2%