Skip to content

helenphillips/GlobalEWDiversity

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Global distribution of earthworm diversity

Phillips et al., 2019 Science

This work amalgamates data from various sources (i.e., from data published in the literature or that has been sent by 'data providers').

I have tried to write this code so that it works on anyone's machine....but at various points, this will not work very well/at all. Sorry. This primarily won't work as I can not provide the global data layers (of climate, soil etc.), and so have provided these variables in the dataframe that we have made open access. But to make the maps, these data layers would be needed. Also, when making the map, I relied on a pipeline from a co-author, Carlos - this split the data layers into chunks, for speed. This pipeline has not been made open-access.

In theory, the input for one script is the output from the previous (or a previous) script. All scripts create an output folder, for data, and place all figures into a separate folder.

Please feel free to raise any issues.

Erratum

The Phillips et a., 2019 Science paper has now had a Erratum issued. Unfortunately in the original data preparation a bug in the code was introduced. Please see this Erratum README which provides additional detail.

Apart from changes related to the bug, the majority of the code remains the same, with one exception. The exception is the mixed effects models (Script 8 onward). This is due to the lme4 package no longer being appropriate, and instead zero-inflated models from the glmmTMB package being used.

Two additional files have been added to this repository:

0.1_DataCheck.R

Having identified and fixed the bug, this script was used for checking the underlying data to ensure that the bug had been fixed as intended, and that everything else was correct.

erratumFigures.R

This script firstly calculates the number of datasets that were affected by the bug. Then creates Figure 1 from the erratum.

Files

0_GetData.R

This script accesses my personal Google Drive to download all the files that contain the data. This will not work for anyone.

(To replicate this analysis, the data needs to be downloaded from the iDiv Data portal (see published erratum for DOI), and steps 1-5 are no longer necessary.)

Once all the data has been downloaded, the first stage of cleaning is undertaken (i.e., ensuring all columns are the correct class). Three files are created: a bibliography, a site-level dataframe and a species-level dataframe.

For the site-level dataframe, the site-level metrics (species richness, abundance and biomass) are calculated.

1_AddCHELSA.R

Based on the coordinates of each site in the site-level metrics, the relevant CHELSA data (http://chelsa-climate.org/bioclim/) are appended. The CHELSA data had been downloaded previously.

2_AddSoilGrids.R

Based on the coordinates of each site in the site-level metrics, the relevant SoilGrids data (https://soilgrids.org/ ) are appended. The SoilGrids data had been downloaded previously.

3_AddOtherVariables.R

Based on the coordinates of each site in the site-level metrics, data from other global data layers are appended. For now, please see the manuscript for details on the other datalayers.

4_SpeciesNames.R

This creates a dataframe of every species with an ID, along with some additional data, such as who provided the data, so that our earthworm experts could harmonise the names

5_DataCheck.R

Most importantly, this script converts the biomass and abundance values to a common unit. Renames factor levels, and does some basic checks

6_DataExploration.R

Basic exploration of the data. Also converts pH values to a common unit, snow cover to a categorical, and log's response variables.

7_MeasuredversusSoilGrids.R

This script investigates whether it is appropriate to use a model that contains a mixture of site-level sampled soil properties and the soil properties from SoilGrids.

8.1_ModellingRichness.R & 8.1_submitScript_ModellingRichness.sh

Script for the site-level model of species richness. As this was run on the HPC, the submit script is also present.

8_Modelling.R

Script for all the site-level models (species richness, abundance and biomass).

9_MainModelsFigures.R

Script that produces some figures from the main site-level models

10_DataForGlobalLayers.R

Creates a dataframe of the just the predictive data that was used in the models.

10.x_MapCoefficients_xxx.R & 10.x_submitScript_MapCoefficients_XX.sh

Scripts that take the relevant site-level model, and creates a raster of the predicted values globally (run on the HPC, so submit scripts included). For this, Carlos had split the underlying global data layers into regions, so the scripts/models run for a single region.

11_CreateMap.R

Using the predicted regions of the globe, the final maps are assembled and saved

11.1_ValuesFromMap.R

From the predicted values, calculating the means, SDs etc.

12_MainModelsVariableImportance.R

Using randomForest models, the three main models are reconstructed, and the most important variables are identified

13_CrossValidationMainModels.R

For the three main models, 10-fold cross validation is performed

14_CrossValidationSoilGrids.R

The three site-level models (species richness, abundance and biomass) are reconstructed, but using only the SoilGrids soil properties data. 10-fold cross validation is then performed on these models

14.1_SoilversusSoilGrids_richness.R & 14.1_submitScript_SoilversusSoilGrids_richness.sh

The species richness SoilGrids model and cross validation is performed on the HPC.

15_CreateFunctionalDataset.R

Using the species-level dataset, the harmonised species names and functional groups are appended. This is a messy script! But also uses a harmonised names dataframe, which have not been made publicly available (the data that has been made open-access, has the harmonised names in).

16_FunctionalGroupsDataset.R

After some initial cleaning, tor each site the richness, abundance and biomass of each of the functional groups present is calculated.

17_LatitudinalDiversityGradient.R

Based on the species names, calculating the number of species within each equal sized band across the globe

17.1_LDG_fixedNumber.R

Based on the species names, calculating the number of species within each zone of the globe, when a zone contains the same number of sites.

18_FunctionalGroupAnalysis.R

Models based on the functional groups at each site.

19_FunctionalGroupFigures.R

Figures for the models based on the functional groups.

20_Authorships.R

Determining which data we used in this analysis, in order to contact the relevant people for authorship.

21_ReEmailingAuthors.R

We made initial contact (the 6 'sent' dataframes), then wanted to re-email those who had not responded.

22_NematodePipelineData.R

Getting the data together to send to JvdH, DR and TGC, so they can create the uncertainty maps.

23_OpenData.R

Script that cleans the data ready to make it open access.

AuthorsandInstitutes.R

Creates a csv of all the co-authors and their institutes, ready for the manuscript

AuthorsandInstitutes_part2.R

I manually correct the names and address of the authors, then this script assigns the relevant number to them, so the output can be copied into the word document (and final format done)

DownloadSoilGrids.R

Script to automatically download the SoilGrids data.

FundingAndAcknowledgements.R

For all the funding and acknowledgements that were given by the co-authors, this script formats the information and create a text string that can be copied into the word document.

Meta-data.R

This script calculates all the numbers in the manuscript (i.e., how many sites, how many countries)

Folders

Functions

Contains all the functions needed for the analysis

PreparingGlobalLayers

This contains a whole laod of submit scripts, because I ran out of time one weekend and needed a whole load of jobs and didn't know how to write a nicer submit script.

Credits

Thanks to Carlos who also wrote some python code for some of the analysis (not currently in this git repo).

Also, thanks to Johan van der Hoogen, Devin Roth and Thomas Crowther, who used their own pipeline to create some maps for our data.

License

Apache License.

If you use this code, it would be great if you could link back to my account.