Taxonomic backbone for sPlot 2.1 and TRY 3.0

Oliver Purschke 12 August, 2017

This document describes the workflow (with contributions from Jürgen Dengler and Florian Jansen) that was used to generate the taxonomic backbone that standardizes taxon names across the (i) global vegetation plot database sPlot version 2.1 and (ii) the global plant trait data base TRY version 3.

To cite the backbone use: and the respective sources listed below.

Combine taxon names from sPlot 2.0 and TRY 3.0

Load required packages

library(stringr)
library(knitr)
library(vegdata)
library(doParallel)
library(dplyr)
library(plyr)
library(Taxonstand)
library(rgbif)

Read in taxon names from sPlot and TRY

Reading sPlot Species

path.sPlot <- "/))
home/oliver/Dokumente/PhD/PostPhD/IDiv/sDiv/sPlot/Analyses/Data/Species/
sPlot/sPlot_14_04_2015/"
sPlot.version <- "sPlot_14_4_2015_species"

splot.species <- read.csv(paste0(path.sPlot, sPlot.version, ".csv"), sep = "\t")
gc()

head(splot.species)

##   PlotObservationID Taxonomy Taxon.group Taxon.group.ID
## 1                 1 Pasl2012     Unknown              0
## 2                 1 Pasl2012     Unknown              0
## 3                 1 Pasl2012     Unknown              0
## 4                 1 Pasl2012     Unknown              0
## 5                 1 Pasl2012     Unknown              0
## 6                 1 Pasl2012     Unknown              0
##                Turboveg2.concept                Matched.concept Match
## 1     Allantoparmelia almquistii     Allantoparmelia almquistii     0
## 2             Andreaea rupestris             Andreaea rupestris     0
## 3       Arctoparmelia centrifuga       Arctoparmelia centrifuga     0
## 4             Cetraria nigricans             Cetraria nigricans     0
## 5           Cladonia amaurocraea           Cladonia amaurocraea     0
## 6 Cladonia arbuscula subsp. lat. Cladonia arbuscula subsp. lat.     0
##   Original.taxon.concept Layer Cover.. Cover.code x_
## 1                            0       1          1 NA
## 2                            0       2          2 NA
## 3                            0       5          5 NA
## 4                            0       2          2 NA
## 5                            0       1          1 NA
## 6                            0       2          2 NA

Use the Matched.concept column (Federhen 2010), as it already contains some standardization by Stephan Hennekkens according to synbiosys[1].

splot.species$Matched.concept<- as.character(splot.species$Matched.concept)

Read in TRY 3.0 species list

try3.species <- read.csv("/home/oliver/Dokumente/PhD/PostPhD/IDiv/sDiv/sPlot/Analyses/Data/
Species/TRY/Species/TRY30_Gapfilling2015_Species.csv")

head(try3.species)

##   AccSpeciesID               Species
## 1            1                 Aa sp
## 2            2    Abarema adenophora
## 3            3   Abarema barbouriana
## 4            4 Abarema brachystachya
## 5            6       Abarema jupunba
## 6            7         Abarema laeta

Combine species lists

spec.list.TRY.sPlot <- sort(unique(c(as.character(splot.species$Matched.concept),
                                     as.character(try3.species$Species))))
length(spec.list.TRY.sPlot)

Create dataframe that will later hold the original (uncleaned) names in sPlot and TRY:

spec.list.TRY.sPlot.2 <- cbind(spec.list.TRY.sPlot, spec.list.TRY.sPlot,
                               spec.list.TRY.sPlot)
str(spec.list.TRY.sPlot.2)
dim(spec.list.TRY.sPlot.2)
spec.list.TRY.sPlot.2 <- as.data.frame(spec.list.TRY.sPlot.2)

Convert factors into characters:

spec.list.TRY.sPlot.2[,1] <- as.character(spec.list.TRY.sPlot.2[,1])
spec.list.TRY.sPlot.2[,2] <- as.character(spec.list.TRY.sPlot.2[,2])
spec.list.TRY.sPlot.2[,3] <- as.character(spec.list.TRY.sPlot.2[,3])

Give column names[2]:

colnames(spec.list.TRY.sPlot.2) <- c("names.sPlot.TRY", "names.corr.string", "sPlot.TRY")
write.csv(spec.list.TRY.sPlot.2, file = "spec.list.TRY3.sPlot2.csv")

Add species' database affiliation

Species in sPlot 2.0:

spec.list.TRY.sPlot.2$sPlot.TRY[which(spec.list.TRY.sPlot.2$names.sPlot.TRY %in%
                                      unique(splot.species$Matched.concept))] <- "S"

Species in TRY 3.0:

spec.list.TRY.sPlot.2$sPlot.TRY[which(spec.list.TRY.sPlot.2$names.sPlot.TRY %in%
                                      unique(try3.species$Species))] <- "T"

Species that are both in sPlot 2.0 and TRY 3.0

spec.list.TRY.sPlot.2$sPlot.TRY[which(spec.list.TRY.sPlot.2$names.sPlot.TRY %in%
                                      unique(try3.species$Species) &
                                      spec.list.TRY.sPlot.2$names.sPlot.TRY %in%
                                      unique(splot.species$Matched.concept))] <- "ST"

A-priori cleaning of names

Manual cleaning

Based one the column Matched.concept in sPlot, a list of 4,093 "weird" species names (consisting mainly of trivial names) was generated, and corrected manually by Jürgen Dengler (thereafter JD). Some examples:

Gras silbrig Bl haarig 133106 → Poaceae sp. [silbrig Bl haarig 133106]
Ãƒâ€”Achnella species → Achnella sp.
[KHH Composite (breite Bl.)] → Asteraceae sp. [breite Bl.]
LICH Xanthomaculina hottentotta → Xanthomaculina hottentotta
cf. Silberknospe 134249 → Spermatophyta sp. [Silberknospe 134249]

String manipulation routines

Stripping unwanted characters as well as abbreviation (such as hybrid markers) which would prevent name matching:

OriginalNames <- gsub('*', '', OriginalNames, fixed=TRUE)
OriginalNames <- gsub('cf. ', '', OriginalNames, fixed=TRUE)
OriginalNames <- gsub('Cf. ', '', OriginalNames, fixed=TRUE)
OriginalNames <- gsub('[', '', OriginalNames, fixed=TRUE)
OriginalNames <- gsub(']', '', OriginalNames, fixed=TRUE)
OriginalNames <- gsub(' x ', ' ', OriginalNames, fixed=TRUE)
OriginalNames <- gsub('Ã—', '', OriginalNames, fixed=TRUE)
OriginalNames <- gsub('aff ', '', OriginalNames, fixed=TRUE)
OriginalNames <- gsub('(', '', OriginalNames, fixed=TRUE)
OriginalNames <- gsub(')', '', OriginalNames, fixed=TRUE)
OriginalNames <- gsub(' cf ', ' ', OriginalNames, fixed=TRUE)
OriginalNames <- gsub(' aff. ', ' ', OriginalNames, fixed=TRUE)
OriginalNames <- gsub('c‚e', 'ceae', OriginalNames, fixed=TRUE)
OriginalNames <- gsub('    ', ' ', OriginalNames, fixed=TRUE)
OriginalNames <- gsub('   ', ' ', OriginalNames, fixed=TRUE)
OriginalNames <- gsub('  ', ' ', OriginalNames, fixed=TRUE)
OriginalNames <- gsub('x-', '', OriginalNames, fixed=TRUE)
OriginalNames <- gsub('X-', '', OriginalNames, fixed=TRUE)
OriginalNames <- gsub('like ', '', OriginalNames, fixed=TRUE)  
OriginalNames <- gsub(',', '', OriginalNames, fixed=TRUE)

For all names, that have a number in their first word, and consist of > 1 words, remove that word:

library(stringr)
firstWordWithNumbers <- grepl('[0-9]', word(OriginalNames, 1))
numberOfWords <- sapply(gregexpr("\\W+", OriginalNames), length) + 1
OriginalNames[firstWordWithNumbers & numberOfWords > 1] <-
    sapply(OriginalNames[firstWordWithNumbers & numberOfWords > 1],
           function(x) substr(x, start=regexpr(pattern =' ', text=x)+1, stop=nchar(x)))

Correct some name abbreviations using taxname.abbr in vegdata:

CleanedNames <- taxname.abbr(OriginalNames)
save(CleanedNames, file = "CleanedNames.Rdata")
write.csv(CleanedNames, file = "CleanedNames.csv")
length(CleanedNames)

Create a data frame from the string cleaned names:

CleanedNames.df <- as.data.frame(CleanedNames, stringsAsFactors=FALSE)
spec.list.TRY.sPlot.3 <- cbind(spec.list.TRY.sPlot.2[,c(1,2,4)], CleanedNames.df,
                               CleanedNames.df)

Skip column names.corr.string, as the cleanednames.df column as well as another column that will include the names corrected by JD.

colnames(spec.list.TRY.sPlot.3) <- c("index","original.names.sPlot.TRY","sPlot.TRY",
                                     "CleanedNames","CleanedNames.Juergen")
Tax_Back_sPlot2_TRY3 <- spec.list.TRY.sPlot.3
save(Tax_Back_sPlot2_TRY3, file = "Tax_Back_sPlot2_TRY3.Rdata")
load("Tax_Back_sPlot2_TRY3.Rdata")
write.csv(Tax_Back_sPlot2_TRY3, file = "Tax_Back_sPlot2_TRY3.csv")
head(Tax_Back_sPlot2_TRY3)

Substitute a-priorily cleaned names with corrected names

names.correct.JD <- read.csv("weird.names_JD_string_correct.csv")
str(names.correct.JD)
head(names.correct.JD)
index <- match(names.correct.JD$weird.names, Tax_Back_sPlot2_TRY3$original.names.sPlot.TRY)
Tax_Back_sPlot2_TRY3$CleanedNames.JD[index] <- names.correct.JD$Name.corrected

Do some more cleaning on `CleanedNames.JD`

This was done manually in a csv-file.

CleanedNames.JD <- Tax_Back_sPlot2_TRY3$CleanedNames.JD
write.csv(CleanedNames.JD, file = "CleanedNames.JD.csv")

Substitute old `CleanedNames.JD` with new string-manipulated one:

Tax_Back_sPlot2_TRY3$CleanedNames.JD <- CleanedNames.JD
save(Tax_Back_sPlot2_TRY3, file = "Tax_Back_sPlot2_TRY3.Rdata")

Match names against Taxonomic Name Resolution Service (TNRS)

Slice `CleanedNames` into chunks

... of 5000 species:

seq1 <- seq(from =1, to = 120001, 5000)
seq2 <- seq(from =5000, to = 125000, 5000)

for(i in 1:length(seq1)) {
    write.csv(Tax_Back_sPlot2_TRY3$CleanedNames.JD[seq1[i]:seq2[i]],
              file = paste(paste("tnrs_submit", seq1[i], sep = "_"), "csv", sep = "."))
}

The single csv-files, containing 5,000 names each, were submitted to Taxonomic Name Resolution Service web application (Boyle et al. 2013, iPlant Collaborative (2015)). TNRS version 4.0 was used, which became available in August 2015 (this version also included The Plant List version 1.1).

TNRS settings

The following settings were used for resolving names on TNRS.

Sources for name resolution

The initial TNRS name resolution run was based on the five standard sources that were ranked according to preference in the following order (default of TNRS):

The Plant List (TPL)(The Plant List 2013)
The Global Compositae Checklist (GCC)(Flann 2009)
The International Legume Database and Information Service (ILDIS)(International legume database and information service 2006)
Tropicos (Missouri Botanical Garden 2013)
PLANTS Database (USDA)(USDA, NRCS 2012)

Because it is possible that the best match is found in lower ranked sources, see section TNRS settings, two additional name resolution runs were realized in which the highest ranking was given to (1) Tropicos, or (2) the sixth source available in TNRS, NCBI (The National Center for Biotechnology Information's Taxonomy database; (Federhen 2010)), respectively, see section TNRS settings.

Family Classification

Resolved names were assigned to families based on the APGIII classification (Chase & Reveal 2009), the same classification system used by Tropicos.

Retrieve results

Once the matching process was finished, results were retrieved from TNRS using the Detailed Download option that included the full name information (parsed components, warnings, links to sources, etc.). We retrived the single best match for each species, constrained by source (TNRS default), where the name in the first source was selected as best match, unless there was no suitable match found in that source, the match from the next lower-ranked source was selected, until all resources where exhausted.

Manually inspecting name matching results

Manually inspect the TNRS-results table in a spreadsheat application (i.e. LibreOffice or Excel). Starting with the highest taxonomic rank considered (i.e. Family). For instance, if manual checking of the TRNS output reveals that all accepted names or synonyms that have accuracy scores >0.9 are correct taxon names, use the following selection procedure:

Name_matched_rank (==Family)
Taxonomic_status (==Accepted, Synomyn)
Family_score (>0.9)

Continue this selection procedure for entries that were matched at lower taxonomic ranks, i.e. genus, species, etc..

Read and combine TNRS result files

Set up `doParallel` cluster to speed up the reading process

Find out how many cores are available:

detectCores()

## [1] 4

Create cluster with desired number of cores:

cl <- makeCluster(3)

Register cluster:

registerDoParallel(cl)

Find out how many cores are being used:

getDoParWorkers()

Read and combine the single files

Read the downloaded TNRS files, including 5000 names each[3] into R.

setwd("/home/oliver/Downloads/")
seq1 <- seq(from =1, to = 120001, 5000)
system.time(
    x <- foreach(i = 1:length(seq1), .combine = rbind) %dopar% {
        read.csv(paste(paste("/home/oliver/Downloads/tnrs.tpl", seq1[i], sep = "."),
                       "csv", sep = "."), sep = ",", stringsAsFactors = FALSE, skip = 0,
                 head = T)[-1, ]
    }
)
tnrs.tpl <- x

Assign index to first column:

tnrs.tpl$Name_number <- 1:length(tnrs.tpl$Name_number)

Select correctly resolved names

General procedure

Open tnrs.tpl in a spread sheat program and sort according to Name_matched_rank, Taxonomic_status and Family_score[4].
Repeat selection for entries matched at lower taxonomic ranks, such Name_matched_rank ==:
- forma
- genus
- infraspecies
- ...
Adjust accuracy score threshold values, e.g. use higher or lower values for infraspec., variety, ...

Family level

Manually inspect sorted table and select all entries at the highest hierarchical level (family). Manually identify the family accuracy score threshold value above which a name can be considered a correct name. In the following case, this corresponds to a score $>$0.88.

Create index for correct names

index.family <- tnrs.tpl[which(tnrs.tpl$Name_matched_rank == "family" &
                               (tnrs.tpl$Taxonomic_status == "Accepted" |
                                tnrs.tpl$Taxonomic_status == "Synonym") &
                                tnrs.tpl$Family_score > 0.88), 1]
length(index.family)

Forma level

index.forma <- tnrs.tpl[which(tnrs.tpl$Name_matched_rank == "forma"), 1]
length(index.forma)

Genus level

index.genus <- tnrs.tpl[which(tnrs.tpl$Name_matched_rank == "genus" &
                              tnrs.tpl$Taxonomic_status == "Accepted" &
                              tnrs.tpl$Genus_score > 0.83), 1]
length(index.genus)

Infraspecies level

index.infraspec <- tnrs.tpl[which(tnrs.tpl$Name_matched_rank == "infraspecies"), 1]
length(index.infraspec)

Species level

index.species <- tnrs.tpl[which(tnrs.tpl$Name_matched_rank == "species" &
                                 (tnrs.tpl$Taxonomic_status == "Accepted" |
                                  tnrs.tpl$Taxonomic_status == "Synonym") &
                                  tnrs.tpl$Genus_score > 0.78 &
                                  tnrs.tpl$Name_score > 0.93), 1]
length(index.species)

Subspecies level

index.subspec <- tnrs.tpl[which(tnrs.tpl$Name_matched_rank == "subspecies" &
                                (tnrs.tpl$Taxonomic_status == "Accepted" |
                                 tnrs.tpl$Taxonomic_status == "Synonym")), 1]
length(index.subspec)

Variety level

index.variety <- tnrs.tpl[which(tnrs.tpl$Name_matched_rank == "variety" &
                                (tnrs.tpl$Taxonomic_status == "Accepted" |
                                 tnrs.tpl$Taxonomic_status == "Synonym")), 1]
length(index.variety)

Identifying "non-matched" species that are spermatophyta

index.spermatophyt <- tnrs.tpl[which(tnrs.tpl$Name_matched == "No suitable matches found."
                                     & word(tnrs.tpl$Name_submitted, 1) == "Spermatophyta")
                             , 1]
length(index.spermatophyt)

Select `certain` or `uncertain` names

Select names that do not fulfill the search criteria, i.e. that were not selected as certain species, for further name matching.

index.tpl <- c(index.family, index.forma, index.genus, index.species, index.subspec,
               index.variety, index.spermatophyt)
length((index.tpl))

tnrs.tpl.certain <- tnrs.tpl[index.tpl,]
dim(tnrs.tpl.certain)
save(tnrs.tpl.certain, file = "tnrs.tpl.certain.Rdata")
write.csv(tnrs.tpl.certain, file = "tnrs.tpl.certain.csv")

tnrs.tpl.uncertain <- tnrs.tpl[tnrs.tpl$Name_number %in% index.tpl == F, ]
dim(tnrs.tpl.uncertain)
save(tnrs.tpl.uncertain, file = "tnrs.tpl.uncertain.Rdata")
write.csv(tnrs.tpl.uncertain, file = "tnrs.tpl.uncertain.csv")

Generate list of uncertain species that are still to be resolved on TNRS:

write.csv(tnrs.tpl.uncertain[,2], file = "tnrs.tpl.uncertain.upload.csv")

Resolve 'uncertain' names using `Tropicos`

Because the matching procedure above, giving the highest ranking to the sources TPL, GCC and ILDIS (which seem to be the most reliable and up-to-date), will not always result in the match, in cases where better matching scores are achieved in lower ranked sources. Therefore the TNRS matching procedure was continued for the uncertain species from the previous step, but assigning the highest rank to the source Tropicos, while repeating the steps that were used to generate tnrs.tpl.[5]

Read in 'TNRS-Tropicos' results

setwd("/home/oliver/Downloads/")

system.time(
    x <- foreach(i = 1:length(seq1), .combine = rbind) %dopar% {
        read.csv(paste(paste("/home/oliver/Downloads/tnrs.trop", seq1[i], sep = "."),
                       "csv", sep = "."), sep = ",", stringsAsFactors = FALSE, skip = 0,
                 head = T)[-1, ]
    }
)

tnrs.trop <- x

str(tnrs.trop)
tnrs.trop$Name_number <- 1:length(tnrs.trop$Name_number)

Reduce to set of uncertain species `tnrs.tpl`

tnrs.trop.small <- tnrs.trop[tnrs.all.small.uncertain$Name_number, ]
str(tnrs.trop.small)

save(tnrs.trop.small, file = "tnrs.trop.small.Rdata")
write.csv(tnrs.trop.small, file = "tnrs.trop.small.csv")

Selecting correctly resolved names

Using the same procedure as above.

Family level

index.family <- tnrs.trop.small[which(tnrs.trop.small$Name_matched_rank == "family" &
                                      (tnrs.trop.small$Taxonomic_status == "Accepted"|
                                       tnrs.trop.small$Taxonomic_status == "Synonym")), 1]
length(index.family)

Forma level

index.forma <- tnrs.trop.small[which(tnrs.trop.small$Name_matched_rank == "forma" &
                                     (tnrs.trop.small$Taxonomic_status == "Accepted" |
                                      tnrs.trop.small$Taxonomic_status == "Synonym")), 1]
length(index.forma)

Genus level

index.genus <- tnrs.trop.small[which(tnrs.trop.small$Name_matched_rank == "genus" &
                                     (tnrs.trop.small$Taxonomic_status == "Accepted" |
                                      tnrs.trop.small$Taxonomic_status == "Synonym") &
                                      tnrs.trop.small$Genus_score > 0.83 &
                                      tnrs.trop.small$Name_score > 0.5), 1]
length(index.genus)

Species level

index.species1 <- tnrs.trop.small[which(tnrs.trop.small$Name_matched_rank == "species" &
                                        (tnrs.trop.small$Taxonomic_status == "Accepted" |
                                         tnrs.trop.small$Taxonomic_status == "Synonym") &
                                         tnrs.trop.small$Genus_score > 0.88 &
                                         tnrs.trop.small$Name_score > 0.9), 1]
length(index.species1)

index.species2 <- tnrs.trop.small[which(tnrs.trop.small$Name_matched_rank == "species" &
                                        (tnrs.trop.small$Taxonomic_status == "Accepted" |
                                         tnrs.trop.small$Taxonomic_status == "Synonym") &
                                         tnrs.trop.small$Genus_score > 0.78 &
                                         tnrs.trop.small$Name_score > 0.94), 1]
length(index.species1)

index.species3 <- tnrs.trop.small[which(tnrs.trop.small$Name_matched_rank == "species" &
                                        (tnrs.trop.small$Taxonomic_status == "Accepted" |
                                         tnrs.trop.small$Taxonomic_status == "Synonym") &
                                        tnrs.trop.small$Genus_score > 0.88 &
                                        tnrs.trop.small$Name_score > 0.49), 1]
length(index.species3)

index.species <- unique(c(index.species1, index.species2, index.species3))
length(index.species)

Subspecies level

index.subspec <- tnrs.trop.small[which(tnrs.trop.small$Name_matched_rank == "subspecies" &
                                       (tnrs.trop.small$Taxonomic_status == "Accepted" |
                                        tnrs.trop.small$Taxonomic_status == "Synonym")), 1]
length(index.subspec)

Variety level

index.variety <- tnrs.trop.small[which(tnrs.trop.small$Name_matched_rank == "variety" &
                                       (tnrs.trop.small$Taxonomic_status == "Accepted" |
                                        tnrs.trop.small$Taxonomic_status == "Synonym")), 1]
length(index.variety)

Select `certain` and `uncertain` names

index.trop <- c(index.family, index.forma, index.genus, index.species, index.subspec,
                index.variety)
length((index.trop))
tnrs.trop.small.certain <- tnrs.trop.small[tnrs.trop.small$Name_number %in% index.trop
                                           == T,]
dim(tnrs.trop.small.certain)
save(tnrs.trop.small.certain, file = "tnrs.trop.small.certain.Rdata")
write.csv(tnrs.trop.small.certain, file = "tnrs.trop.small.certain.csv")

tnrs.trop.small.uncertain <- tnrs.trop.small[tnrs.trop.small$Name_number %in% index.trop
                                             == F, ]
dim(tnrs.trop.small.uncertain)
save(tnrs.trop.small.uncertain, file = "tnrs.trop.small.uncertain.Rdata")
write.csv(tnrs.trop.small.uncertain, file = "tnrs.trop.small.uncertain.csv")

Resolve ‘uncertain’ names using `NCBI`

Cut the list of the remaining 9,641 unresolved species into chunks of ∼ 5,000 species.

write.csv(tnrs.trop.small.uncertain$Name_submitted[1:5000], file = "trop.uncert.1.csv")
write.csv(tnrs.trop.small.uncertain$Name_submitted[5001:9641], file = "trop.uncert.5001.csv")

Match these lists against TNRS using the settings decribed above, but rank the additional (sixth) sources in TNRS, The National Center for Biotechnology Information's Taxonomy database NCBI, first.

Read in ‘TNRS-NCBI’ results

tnrs.ncbi.1 <- read.csv("/home/oliver/Downloads/tnrs.trop.uncert.1.csv",
                        stringsAsFactors = FALSE)[-1,]
tnrs.ncbi.2 <- read.csv("/home/oliver/Downloads/tnrs.trop.uncert.5001.csv",
                        stringsAsFactors = FALSE)[-1,]

tnrs.ncbi <- rbind(tnrs.ncbi.1, tnrs.ncbi.2)
str(tnrs.ncbi)
tnrs.ncbi$Name_number <- tnrs.trop.small.uncertain$Name_number
range(tnrs.ncbi$Name_number)

save(tnrs.ncbi, file = "tnrs.ncbi.Rdata")
write.csv(tnrs.ncbi, file = "tnrs.ncbi.csv")

Selecting correctly resolved names in `TNRS_NCBI`

Family level

index.family <- tnrs.ncbi[which(tnrs.ncbi$Name_matched_rank == "family" &
                                (tnrs.ncbi$Taxonomic_status == "Accepted"|
                                 tnrs.ncbi$Taxonomic_status == "Synonym") &
                                tnrs.ncbi$Family_score > 0.85), 1]
length(index.family)

Genus level

index.genus.1 <- tnrs.ncbi[which(tnrs.ncbi$Name_matched_rank == "genus" &
                                 (tnrs.ncbi$Taxonomic_status == "Accepted" |
                                  tnrs.ncbi$Taxonomic_status == "Synonym") &
                                 tnrs.ncbi$Genus_score > 0.89 &
                                 tnrs.ncbi$Name_score > 0.49), 1]
length(index.genus.1)

index.genus.2 <- tnrs.ncbi[which(tnrs.ncbi$Name_matched_rank == "genus" &
                                 (tnrs.ncbi$Taxonomic_status == "Accepted" |
                                  tnrs.ncbi$Taxonomic_status == "Synonym") &
                                 tnrs.ncbi$Genus_score > 0.99 &
                                 tnrs.ncbi$Name_score > 0.2), 1]
length(index.genus.2)

index.genus.3 <- tnrs.ncbi[which(tnrs.ncbi$Name_matched_rank == "genus" &
                                 tnrs.ncbi$Taxonomic_status == "No opinion" &
                                 tnrs.ncbi$Genus_score > 0.88 &
                                 tnrs.ncbi$Name_score > 0.49), 1]
length(index.genus.3)

index.genus <- unique(c(index.genus.1, index.genus.2, index.genus.3))
length(index.genus)

Species level

index.species.1 <- tnrs.ncbi[which(tnrs.ncbi$Name_matched_rank == "species" &
                                   (tnrs.ncbi$Taxonomic_status == "Accepted" |
                                    tnrs.ncbi$Taxonomic_status == "Synonym") &
                                   tnrs.ncbi$Name_score > 0.94), 1]
length(index.species.1)

index.species.2 <- tnrs.ncbi[which(tnrs.ncbi$Name_matched_rank == "species" &
                                   (tnrs.ncbi$Taxonomic_status == "Accepted" |
                                    tnrs.ncbi$Taxonomic_status == "Synonym") &
                                   tnrs.ncbi$Genus_score > 0.81 &
                                   tnrs.ncbi$Name_score > 0.51), 1]
length(index.species.2)

index.species.3 <- tnrs.ncbi[which(tnrs.ncbi$Name_matched_rank == "species" &
                                   tnrs.ncbi$Taxonomic_status == "No opinion"  &
                                   tnrs.ncbi$Genus_score > 0.7 &
                                   tnrs.ncbi$Specific_epithet_score > 0.75), 1]
length(index.species.3)

index.species <- unique(c(index.species.1, index.species.2, index.species.3))
length(index.species)

Variety level

index.var <- tnrs.ncbi[which((tnrs.ncbi$Name_matched_rank == "subspecies" |
                              tnrs.ncbi$Name_matched_rank == "unknown" |
                              tnrs.ncbi$Name_matched_rank == "variety") &
                             (tnrs.ncbi$Taxonomic_status == "Accepted" |
                              tnrs.ncbi$Taxonomic_status == "No opinion" |
                              tnrs.ncbi$Taxonomic_status == "Synonym")), 1]
length(index.var)

index.ncbi <- c(index.family, index.genus, index.species, index.var)
length((index.ncbi))

Select `certain` or `uncertain` names

tnrs.ncbi.certain <- tnrs.ncbi[tnrs.ncbi$Name_number %in% index.ncbi == T,]
dim(tnrs.ncbi.certain)
save(tnrs.ncbi.certain, file = "tnrs.ncbi.certain.Rdata")
write.csv(tnrs.ncbi.certain, file = "tnrs.ncbi.certain.csv")

tnrs.ncbi.uncertain <- tnrs.ncbi[tnrs.ncbi$Name_number %in% index.ncbi == F, ]
dim(tnrs.ncbi.uncertain)
save(tnrs.ncbi.uncertain, file = "tnrs.ncbi.uncertain.Rdata")
write.csv(tnrs.ncbi.uncertain, file = "tnrs.ncbi.uncertain.csv")

Resolve remaining `uncertain` names

tnrs.ncbi still contained 1,464 uncertain names that were resolved in the following way:

679 names (mainly trivial names) were selected manually and corrected by JD.
of those 679 names, 62 were corrected using the matching tools on the TPL webpage (e.g. Dicra vagin var. clathrata → Dicranella vaginata).
the remaining 785 names were checked manually.

JD.679 <- read.csv("JD.679.csv")
head(JD.679)

Manually correct, in a spread sheat program, some further mispellings (e.g. ABIESNORD., xboris). Select non-JD.corrected and correct species in tnrs.ncbi.uncertain.corrected.csv[6].

ncbi.uncertain.corr <- read.csv("/home/oliver/Downloads/tnrs.ncbi.uncertain.corrected.csv")

str(ncbi.uncertain.corr)

Select the 679 JD-species, incl. the 62 species that were corrected manually.

index.JD.corrected <- ncbi.uncertain.corr[which((ncbi.uncertain.corr$for_JD ==
                                                      "x" |
                                                      ncbi.uncertain.corr$corrected !=
                                                      "")), 2] 
length(index.JD.corrected)

Further select some correct species within genus

Some species were cut down to the genus level by TNRS, although they might be resolvable using the name matching tools on the TPL webpage. Select those names for further name matching.

index.correct.genus <- ncbi.uncertain.corr[which((ncbi.uncertain.corr$Name_matched_rank ==
                                                  "genus" &
                                                  ncbi.uncertain.corr$Taxonomic_status ==
                                                  "Accepted" &
                                                  ncbi.uncertain.corr$Overall_score > 0.6))
                                         , 2] 
length(index.correct.genus)

Further select some correct names that were matched at the rank of species

index.correct.species <- ncbi.uncertain.corr[which((ncbi.uncertain.corr$Name_matched_rank ==
                                                    "species" &
                                                    ncbi.uncertain.corr$Taxonomic_status ==
                                                    "Accepted" &
                                                    ncbi.uncertain.corr$Overall_score > 0.89))
                                           , 2] 
length(index.correct.species)

Create an index for those names that could be further corrected:

index.ncbi <- unique(c(index.JD.corrected, index.correct.genus, index.correct.species))
length(index.ncbi)

Identify `certain` and `uncertain` species

ncbi.uncertain.corr.certain <- ncbi.uncertain.corr[ncbi.uncertain.corr$Name_number %in%
                                                   index.ncbi == T,]
dim(ncbi.uncertain.corr.certain)
save(ncbi.uncertain.corr.certain, file = "ncbi.uncertain.corr.certain.Rdata") 
write.csv(ncbi.uncertain.corr.certain, file = "ncbi.uncertain.corr.certain.csv")

ncbi.uncertain.corr.uncertain <- ncbi.uncertain.corr[ncbi.uncertain.corr$Name_number %in%
                                                     index.ncbi == F, ]
dim(ncbi.uncertain.corr.uncertain)
save(ncbi.uncertain.corr.uncertain, file = "ncbi.uncertain.corr.uncertain.Rdata")
write.csv(ncbi.uncertain.corr.uncertain, file = "ncbi.uncertain.corr.uncertain.csv")

Using `The Plant List` matching tools for unresolved names

Generate names list from ncbi.uncertain.corr.uncertain to be matched against The Plant List, using Taxonstand::TPL.

ncbi.uncertain <- as.character(ncbi.uncertain.corr.uncertain$Name_submitted)

Lists of names for `TPL`

Run "raw" list against TPL:

Set large edit distance to allow for fuzzy matching, and strip some characters that would prevend matching:

tpl.ncbi.1 <- TPL(ncbi.uncertain, corr=T, diffchar = 9, max.distance = 9)
tpl.ncbi.1 <- gsub("[", "", tpl.ncbi.1)
tpl.ncbi.1 <- gsub("]", "", tpl.ncbi.1)
tpl.ncbi.1 <- gsub("|", "", tpl.ncbi.1)
tpl.ncbi.1 <- gsub("?", "", tpl.ncbi.1)
write.csv(tpl.ncbi.1, file = "tpl.ncbi.1.csv")

Extent each word by *, to allow for even 'fuzzier' matching:

ncbi.uncertain.2 <- paste(gsub(" ", "* ", ncbi.uncertain), "*", sep = "")
tpl.ncbi.2 <- TPL(ncbi.uncertain.2, corr=T, diffchar = 9, max.distance = 9)
write.csv(tpl.ncbi.2, file = "tpl.ncbi.2.csv")

Truncate each word to its first 3 (5 or 7) letters and extent by *, to do more fuzzy matching

simpleCap3 <- function(x) {
  s <- strsplit(x, " ")[[1]]
  paste(paste(substring(s, 1,1), substring(s, 2,3), sep="", collapse="* "), "*", sep = "")
}

simpleCap5 <- function(x) {
  s <- strsplit(x, " ")[[1]]
  paste(paste(substring(s, 1,1), substring(s, 2,5), sep="", collapse="* "), "*", sep = "")
}

simpleCap7 <- function(x) {
  s <- strsplit(x, " ")[[1]]
  paste(paste(substring(s, 1,1), substring(s, 2,7), sep="", collapse="* "), "*", sep = "")
}

Apply the string truncation functions above:

ncbi.uncertain.3 <- sapply(ncbi.uncertain, simpleCap3)

Repeat, but instead using simpleCap5 or simpleCap7. Further, remove some strings to improve name matching:

ncbi.uncertain.3 <- gsub("[", "", ncbi.uncertain.3, fixed = T)
ncbi.uncertain.3 <- gsub("]", "", ncbi.uncertain.3, fixed = T)
ncbi.uncertain.3 <- gsub("|", "", ncbi.uncertain.3, fixed = T)
ncbi.uncertain.3 <- gsub("?", "", ncbi.uncertain.3, fixed = T)
ncbi.uncertain.3 <- gsub("+", "", ncbi.uncertain.3, fixed = T)
ncbi.uncertain.3 <- gsub(".", "", ncbi.uncertain.3, fixed = T)
ncbi.uncertain.3 <- gsub("<", "", ncbi.uncertain.3, fixed = T)
ncbi.uncertain.3 <- gsub("/", "", ncbi.uncertain.3, fixed = T)
str(ncbi.uncertain.3)

tpl.ncbi.3 <- TPL(ncbi.uncertain.3, corr=T, diffchar = 9, max.distance = 9)
write.csv(tpl.ncbi.3, file = "tpl.ncbi.3.csv")

tpl.ncbi.5 <- TPL(ncbi.uncertain.5, corr=T, diffchar = 9, max.distance = 9)
write.csv(tpl.ncbi.5, file = "tpl.ncbi.5.csv")

tpl.ncbi.7 <- TPL(ncbi.uncertain.7, corr=T, diffchar = 9, max.distance = 9)
write.csv(tpl.ncbi.7, file = "tpl.ncbi.7.csv")

Combine tpl.ncbi tables:

tpl.ncbi <- cbind(tpl.ncbi.1[,c(1,2,6,8,10,12)], tpl.ncbi.2[,c(6,8,10,12)],
                  tpl.ncbi.3[,c(6,8,10,12)], tpl.ncbi.5[,c(6,8,10,12)],
                  tpl.ncbi.7[,c(6,8,10,12)])

rownames(tpl.ncbi) <- rownames(tpl.ncbi.7)
tpl.ncbi <- cbind(ncbi.uncertain.corr.uncertain[,1:2], tpl.ncbi)
str(tpl.ncbi)
tpl.ncbi <- tpl.ncbi[,-1]
write.csv(tpl.ncbi, file = "tpl.ncbi.csv")

Manually select correct genera and species in tpl.ncbi csv-file and add columns Genus.correct and Species.correct.

Read the manually corrected tpl.ncbi table:

tpl.ncbi.2 <- read.csv("tpl.ncbi.csv")
str(tpl.ncbi.2)
names(tpl.ncbi.2)

Select corrected species and concatenate genus and epithet columns:

tpl.ncbi.2$name.correct <- paste(tpl.ncbi.2$Genus.correct, tpl.ncbi.2$Species.correct)
index.corr <- tpl.ncbi.2[which(tpl.ncbi.2$name.correct != " "), 2]

Merge ncbi.uncertain correspond and tpl.ncbi.2

ncbi.uncertain.corr.uncertain.2 <- join(ncbi.uncertain.corr.uncertain,
                                        tpl.ncbi.2[,c(1,2,6,26:29)], by = "Name_number")

str(ncbi.uncertain.corr.uncertain.2)
names(ncbi.uncertain.corr.uncertain.2)

write.csv(ncbi.uncertain.corr.uncertain.2, file = "ncbi.uncertain.corr.uncertain.2.csv")
ncbi.uncertain.corr.uncertain.2 <- read.csv("ncbi.uncertain.corr.uncertain.2.csv")

Tag names that could not be resolved

If names were not corrected, set Taxonomic.status == ""

ncbi.uncertain.corr.uncertain.2$Status.correct[
                                    ncbi.uncertain.corr.uncertain.2$Status.correct==""] <-
    ncbi.uncertain.corr.uncertain.2$Taxonomic.status[
                                        ncbi.uncertain.corr.uncertain.2$Status.correct ==""]

summary(ncbi.uncertain.corr.uncertain.2$Status.correct)
str(ncbi.uncertain.corr.uncertain.2$Status.correct)

... and assign No suitable matches found. to the remaining species:

ncbi.uncertain.corr.uncertain.2$Status.correct <-
    as.character(ncbi.uncertain.corr.uncertain.2$Status.correct)
ncbi.uncertain.corr.uncertain.2$Status.correct
[is.na(ncbi.uncertain.corr.uncertain.2$Status.correct)] <- "No suitable matches found."

Add uncorrected names in column X to name.correct:

ncbi.uncertain.corr.uncertain.2$name.correct[
                                    ncbi.uncertain.corr.uncertain.2$Genus.correct==""] <-
    as.character(ncbi.uncertain.corr.uncertain.2[,41])[
        ncbi.uncertain.corr.uncertain.2$Genus.correct==""]

Assign No suitable matches found. to remaining species in name.correct according to Status.correct.

ncbi.uncertain.corr.uncertain.2$name.correct[ncbi.uncertain.corr.uncertain.2$Status.correct==
                                             "No suitable matches found."] <-
    "No suitable matches found."

write.csv(ncbi.uncertain.corr.uncertain.2, file = "ncbi.uncertain.corr.uncertain.2.csv")

Done! Use ncbi.uncertain.corr.uncertain.2 for later merging with the other data sets.

Check `ncbi.certain`

Create extra column species.correct and add the 679 species from status=accepted in JD.corrected. Assign status No suitable matches found. to the remaining non-resolved species.

JD.correct <- read.csv("/home/oliver/Dokumente/PhD/PostPhD/IDiv/sDiv/sPlot/Analyses/
Code/Unresolved_sPlot2.0_TRY3.0_2_JD.csv", stringsAsFactors=FALSE)

Assign `No suitable matches found.` to non-correctable species

JD.correct$Taxon[JD.correct$Taxon==""] <- "No suitable matches found."
str(JD.correct)
str(ncbi.uncertain.corr.certain)

Join to ncbi.uncertain.corr.certain based on Name_number:

ncbi.certain.JD.corr <- join(ncbi.uncertain.corr.certain, JD.correct[,c(2,3,8)],
                                  by = "Name_number")
str(ncbi.certain.JD.corr)
write.csv(ncbi.certain.JD.corr, file = "ncbi.certain.JD.corr.csv")
ncbi.certain.JD.corr.2 <- read.csv("ncbi.certain.JD.corr.csv",
                                        stringsAsFactors=FALSE)
str(ncbi.certain.JD.corr.2)

Add the corrected names from ncbi.certain.JD.corr.2$corrected to name.correct:

ncbi.certain.JD.corr.2$name.correct[is.na(ncbi.certain.JD.corr.2$name.correct)] <-
    ncbi.certain.JD.corr.2$corrected[is.na(ncbi.certain.JD.corr.2$name.correct)]
write.csv(ncbi.certain.JD.corr.2, file = "ncbi.certain.JD.corr.2.csv")

Fill the missing names in name.correct, because they were correctly resolved.

ncbi.certain.JD.corr.2$name.correct[ncbi.certain.JD.corr.2$name.correct==""] <-
    ncbi.certain.JD.corr.2$Name_matched[ncbi.certain.JD.corr.2$name.correct==""]

All entries in ncbi.certain.JD.corr.2 are assigned to a name. Now merge it with ncbi.uncertain.corr.uncertain.2 (~1,400 names). Add tag manual matching. Means the scores from TNRS matching are useless here, but nevertheless keep them.

ncbi.certain.JD.corr.3 <- read.csv("ncbi.certain.JD.corr.3.csv",
                                   stringsAsFactors=FALSE)
str(ncbi.certain.JD.corr.3)
ncbi.uncertain.corr.uncertain.3 <- read.csv("ncbi.uncertain.corr.uncertain.3.csv",
                                          stringsAsFactors=FALSE)
str(ncbi.uncertain.corr.uncertain.3)

Add status No suitable matches found. or Accepted in ncbi.certain.JD.corr.3

ncbi.certain.JD.corr.3$Status.correct[ncbi.certain.JD.corr.3$name.correct ==
                                           "No suitable matches found."] <-
    "No suitable matches found."
ncbi.certain.JD.corr.3$Status.correct[
                                is.na(ncbi.certain.JD.corr.3$Status.correct)] <-
    "Accepted"
write.csv(ncbi.certain.JD.corr.3, file = "ncbi.certain.JD.corr.3.csv")

In ncbi.certain.JD.corr.3, assign all entries an x for Manual matching:

ncbi.certain.JD.corr.3$Manual.matching <- "x"
ncbi.uncertain.corr.uncertain.3$Manual.matching <- "x"
write.csv(ncbi.certain.JD.corr.3, file = "ncbi.certain.JD.corr.3.csv")
write.csv(ncbi.uncertain.corr.uncertain.3, file = "ncbi.uncertain.corr.uncertain.3.csv")

Combine the `uncertain` and `certain` species lists

names(ncbi.certain.JD.corr.3)
names(ncbi.uncertain.corr.uncertain.3)

ncbi.certain.JD.corr.3 <- ncbi.certain.JD.corr.3[,-c(1,2,3)]
ncbi.uncertain.corr.uncertain.3 <- ncbi.uncertain.corr.uncertain.3[,-c(1,2)]

match(names(ncbi.uncertain.corr.uncertain.3), names(ncbi.certain.JD.corr.3))

ncbi.uncertain.comb <- rbind(ncbi.uncertain.corr.uncertain.3, ncbi.certain.JD.corr.3)
dim(ncbi.uncertain.comb)
write.csv(ncbi.uncertain.comb, file = "ncbi.uncertain.comb.csv")

Read in tnrs.ncbi.certain:

tnrs.ncbi.certain <- read.csv("/home/oliver/Downloads/tnrs.ncbi.certain.csv",
                              stringsAsFactors=FALSE)
str(tnrs.ncbi.certain)

Resolve names that were reduced to genus-level by TNRS

Read in tnrs.ncbi.certain":

tnrs.ncbi.certain <- read.csv("/home/oliver/Downloads/tnrs.ncbi.certain.csv",
                              stringsAsFactors=FALSE)
str(tnrs.ncbi.certain)

To resolve names that were reduced to genus-level, identify Name_submitted where Overall_score == 0.5:

index0.5 <- tnrs.ncbi.certain[which(tnrs.ncbi.certain$Overall_score == 0.5), 2]
tpl0.5 <- tnrs.ncbi.certain$Name_submitted[tnrs.ncbi.certain$Overall_score == 0.5]
names(tpl0.5) <- index0.5
str(tpl0.5)
length(tpl0.5)
tpl0.5[1:100]
save(tpl0.5, file = "tpl0.5.Rdata")

Do some string cleaning to improve matching:

tpl0.5 <- gsub("[", "", tpl0.5, fixed = T)
tpl0.5 <- gsub("]", "", tpl0.5, fixed = T)
tpl0.5 <- gsub("|", "", tpl0.5, fixed = T)
tpl0.5 <- gsub("?", "", tpl0.5, fixed = T)
tpl0.5 <- gsub("+", "", tpl0.5, fixed = T)
tpl0.5 <- gsub(".", "", tpl0.5, fixed = T)
tpl0.5 <- gsub("<", "", tpl0.5, fixed = T)
tpl0.5 <- gsub("/", "", tpl0.5, fixed = T)

system.time(
tpl0.5.res <- TPL(tpl0.5, corr=T, diffchar = 2, max.distance = 1)
)

Extent each word by *, to allow for even 'fuzzier' matching:

write.csv(tpl0.5.res, file = "tpl0.5.res.csv")
tpl0.5.2 <- paste(gsub(" ", "* ", tpl0.5), "*", sep = "")
tpl0.5.2.res <- TPL(tpl0.5.2, corr=T, diffchar = 2, max.distance = 1)
write.csv(tpl0.5.2.res, file = "tpl0.5.2.res.csv")

Truncate each word to its first 5 (or 7) letters and extent words by *, to do more fuzzy matching:

tpl0.5.5 <- sapply(tpl0.5, simpleCap5)
tpl0.5.5.res <- TPL(tpl0.5.5, corr=T, diffchar = 2, max.distance = 1)
write.csv(tpl0.5.5.res, file = "tpl0.5.5.res.csv")

tpl0.5.7 <- sapply(tpl0.5, simpleCap7)
tpl0.5.7.res <- TPL(tpl0.5.7, corr=T, diffchar = 2, max.distance = 1)
write.csv(tpl0.5.7.res, file = "tpl0.5.7.res.csv")

tpl0.5.res <- read.csv("tpl0.5.res.csv", stringsAsFactors=FALSE)
tpl0.5.2.res <- read.csv("tpl0.5.2.res.csv", stringsAsFactors=FALSE)
tpl0.5.5.res <- read.csv("tpl0.5.5.res.csv", stringsAsFactors=FALSE)
tpl0.5.7.res <- read.csv("tpl0.5.7.res.csv", stringsAsFactors=FALSE)

Combine tpl.ncbi tables:

tpl.res.comb <- cbind(tpl0.5.res[,c(1,2,3,7,9,11,13)],
tpl0.5.2.res[,c(7,9,11,13)], tpl0.5.5.res[,c(7,9,11,13)],
tpl0.5.7.res[,c(7,9,11,13)])
                                                                                   
head(tpl.res.comb)
write.csv(tpl.res.comb, file = "tpl.res.comb.csv")
tpl.res.comb <- read.csv("tpl.res.comb.csv", stringsAsFactors=FALSE)

Manually add and fill in new columns to `tpl.res.comb.csv`:

Status.correct
Genus.correct
Species.correct

Combine `species.correct` and `genus.correct`

tpl.res.comb <- read.csv("tpl.res.comb.csv", stringsAsFactors=FALSE)
str(tpl.res.comb)
names(tpl.res.comb)
tpl.res.comb$name.correct <- paste(tpl.res.comb$Genus.correct,
tpl.res.comb$Species.correct)
tpl.res.comb.2 <- tpl.res.comb

Join `tnrs.ncbi.certain` and `tpl.res.comb.2`

names(tpl.res.comb.2)
names(tpl.res.comb.2)[2] <- "Name_number"
names(tnrs.ncbi.certain)

Get match correct

tnrs.ncbi.certain.0.5 <- join(tnrs.ncbi.certain[match(index0.5,
                                                      tnrs.ncbi.certain$Name_number),],
                              tpl.res.comb.2[,c(3,23:26)], by =
"Name_number")
write.csv(tnrs.ncbi.certain.0.5, file = "tnrs.ncbi.certain.0.5.csv")

Fill in extra columns in tnrs.ncbi.certain.0.5:

tnrs.ncbi.certain.0.5 <- read.csv("tnrs.ncbi.certain.0.5.csv", stringsAsFactors=FALSE)
str(tnrs.ncbi.certain.0.5)
names(tnrs.ncbi.certain.0.5)
str(tnrs.ncbi.certain.0.5$Genus.correct)

Fill Manual.matching:

tnrs.ncbi.certain.0.5$Manual.matching[tnrs.ncbi.certain.0.5$Genus.correct != ""] <- "x"

Fill Status.correct:

tnrs.ncbi.certain.0.5$Status.correct[tnrs.ncbi.certain.0.5$Status.correct == ""] <-
    tnrs.ncbi.certain.0.5$Taxonomic_status[tnrs.ncbi.certain.0.5$Status.correct == ""]

Fill name.correct:

tnrs.ncbi.certain.0.5$name.correct[tnrs.ncbi.certain.0.5$name.correct == " "] <-
    tnrs.ncbi.certain.0.5$Name_matched[tnrs.ncbi.certain.0.5$name.correct == " "]

write.csv(tnrs.ncbi.certain.0.5, file = "tnrs.ncbi.certain.0.5.csv")

Combine tnrs.ncbi.certain.0.5 with the remaining tnrs.ncbi.certain:

str(tnrs.ncbi.certain.0.5)
names(tnrs.ncbi.certain.0.5)

cert.0.5 <- tnrs.ncbi.certain[tnrs.ncbi.certain$Overall_score == 0.5,]
dim(cert.0.5)
str(cert.0.5)
cert.non.0.5 <- tnrs.ncbi.certain[tnrs.ncbi.certain$Overall_score != 0.5,]
dim(cert.non.0.5)
str(cert.non.0.5)
names(cert.non.0.5)

Add three more columns to cert.non.0.5 so that the columns to those in tnrs.ncbi.certain.0.5:

cert.non.0.5$Manual.matching <- NA
cert.non.0.5$Status.correct <- NA
cert.non.0.5$name.correct <- NA

Combine the two tables:

tnrs.ncbi.certain.comb <- rbind(tnrs.ncbi.certain.0.5[,c(3:41,44)], cert.non.0.5[,c(2:41)])
dim(tnrs.ncbi.certain.comb)
write.csv(tnrs.ncbi.certain.comb, file = "tnrs.ncbi.certain.comb.csv")

Merge the resolved species lists

Read files

`TPL.small.certain`

Contains TNRS results based on the five sources: TPL, GCC, ILDIS, Tropicos and USDA.

load("tnrs.tpl.certain.Rdata")
dim(tnrs.tpl.certain)

`Trop.small.certain`

Contains TNRS results based on the five sources Tropicos (ranked first) TPL, GCC, ILDIS & USDA

load("tnrs.trop.small.certain.Rdata")
dim(tnrs.trop.small.certain)

Combine the certain data sets:

tnrs.tpl.all.trop.certain <- rbind(tnrs.tpl.certain, tnrs.trop.small.certain)
dim(tnrs.tpl.all.trop.certain)

... and add the four additional columns:

names(tnrs.tpl.all.trop.certain)

tnrs.tpl.all.trop.certain$Manual.matching <- NA
tnrs.tpl.all.trop.certain$Status.correct <- NA
tnrs.tpl.all.trop.certain$name.correct <- NA
tnrs.tpl.all.trop.certain$rank.correct <- NA

Pick the respective `NCBI` data sets

... for the 8,177 certain species:

names(tnrs.ncbi.certain.comb)
tnrs.ncbi.certain.comb$rank.correct <- NA

Combine the with the big list above:

tnrs.tpl.all.trop.certain.2 <- rbind(tnrs.tpl.all.trop.certain, tnrs.ncbi.certain.comb)
dim(tnrs.tpl.all.trop.certain.2)
names(tnrs.tpl.all.trop.certain.2)

Pick the list containing the uncertain species:

names(ncbi.uncertain.comb)

Exclude columns JD and corrected

ncbi.uncertain.comb.2 <- ncbi.uncertain.comb[,-c(5,6)]
names(ncbi.uncertain.comb.2)
ncbi.uncertain.comb.2$rank.correct <- NA

Combine them big list containing the certain species with the uncertain species

tnrs.tpl.all.trop.tnrs.certain <- rbind(tnrs.tpl.all.trop.certain.2, ncbi.uncertain.comb.2)
write.csv(tnrs.tpl.all.trop.tnrs.certain, file = "tnrs.tpl.all.trop.tnrs.certain.csv")
save(tnrs.tpl.all.trop.tnrs.certain, file = "tnrs.tpl.all.trop.tnrs.certain.Rdata")

Correct one more species name:

tnrs.tpl.all.trop.tnrs.certain$name.correct[which(tnrs.tpl.all.trop.tnrs.certain$name.correct
                                                  == "ABIES NORDMANNIANA")] <-
    "Abies nordmanniana"

Complement the main columns in the backbone

dim(tnrs.tpl.all.trop.tnrs.certain)
names(tnrs.tpl.all.trop.tnrs.certain)

Fill in `rank.correct`

tnrs.tpl.all.trop.tnrs.certain.filled <- tnrs.tpl.all.trop.tnrs.certain

tnrs.tpl.all.trop.tnrs.certain.filled$rank.correct <- as.character(
    tnrs.tpl.all.trop.tnrs.certain.filled$rank.correct)

tnrs.tpl.all.trop.tnrs.certain.filled$rank.correct[
                                          is.na(tnrs.tpl.all.trop.tnrs.certain.filled$
                                                Manual.matching)] <-
    tnrs.tpl.all.trop.tnrs.certain.filled$Name_matched_rank[
                                              is.na(tnrs.tpl.all.trop.tnrs.certain.filled$
                                                    Manual.matching)]

Fill `status.correct`

tnrs.tpl.all.trop.tnrs.certain.filled$Status.correct[
                                          is.na(tnrs.tpl.all.trop.tnrs.certain.filled$
                                                Manual.matching)] <-
    tnrs.tpl.all.trop.tnrs.certain.filled$Taxonomic_status[
                                              is.na(tnrs.tpl.all.trop.tnrs.certain.filled$
                                                    Manual.matching)]

Fill `name.correct`

tnrs.tpl.all.trop.tnrs.certain.filled$name.correct[is.na(
                                          tnrs.tpl.all.trop.tnrs.certain.filled$
                                          Manual.matching)] <-
    tnrs.tpl.all.trop.tnrs.certain.filled$Accepted_name[is.na(
                                              tnrs.tpl.all.trop.tnrs.certain.filled$
                                              Manual.matching)]

Fill name.correct where status.correct == No opinion:

tnrs.tpl.all.trop.tnrs.certain.filled$name.correct[which(
                                          tnrs.tpl.all.trop.tnrs.certain.filled$
                                          Status.correct == "No opinion")] <-
    tnrs.tpl.all.trop.tnrs.certain.filled$Name_matched[which(
                                              tnrs.tpl.all.trop.tnrs.certain.filled$
                                              Status.correct == "No opinion")]

write.csv(tnrs.tpl.all.trop.tnrs.certain.filled,
          file = "tnrs.tpl.all.trop.tnrs.certain.filled.csv")

In tnrs.tpl.all.trop.tnrs.certain.filled resolve some species manually on the TPL webpage.

Fill in `names.short correct`

tnrs.tpl.all.trop.tnrs.certain.filled <- read.csv("tnrs.tpl.all.trop.tnrs.certain.filled.csv",
                                                  stringsAsFactors=FALSE)
names(tnrs.tpl.all.trop.tnrs.certain.filled)
str(tnrs.tpl.all.trop.tnrs.certain.filled)
tnrs.tpl.all.trop.tnrs.certain.filled$name.short.correct <- as.character(
    tnrs.tpl.all.trop.tnrs.certain.filled$name.short.correct)

Shorten names that have more than two words and where the second word is a `x`

Because some names consisted of > 2 words, the number of words in each name was counted:

wordcount <- sapply(gregexpr("\\S+", tnrs.tpl.all.trop.tnrs.certain.filled$name.correct),
                    length)

If a name has > 1 word, just keep the first two words:

tnrs.tpl.all.trop.tnrs.certain.filled$name.short.correct[wordcount>1] <-
    word(tnrs.tpl.all.trop.tnrs.certain.filled$name.correct[wordcount>1], 1, 2)

Fill in one-word names name.short.correct:

tnrs.tpl.all.trop.tnrs.certain.filled$name.short.correct[wordcount==1] <-
    tnrs.tpl.all.trop.tnrs.certain.filled$name.correct[wordcount==1]

Select names where the second word is a x (hybrids):

length(word(tnrs.tpl.all.trop.tnrs.certain.filled$name.short.correct[wordcount>1], 2)=="x")

write.csv(tnrs.tpl.all.trop.tnrs.certain.filled,
          file = "tnrs.tpl.all.trop.tnrs.certain.filled.csv")

There are 7,900 species with more than 2 words and where 2nd word is "x". For those names in name.short.correct (i) where the second word is a x and that (ii) had more than two words, pick the first and third word:

index <- word(tnrs.tpl.all.trop.tnrs.certain.filled$name.correct[wordcount>2], 2) == "x"
wo.x <- paste(word(tnrs.tpl.all.trop.tnrs.certain.filled$name.correct[wordcount>2][index],
1), word(tnrs.tpl.all.trop.tnrs.certain.filled$name.correct[wordcount>2][index], 3))

tnrs.tpl.all.trop.tnrs.certain.filled$name.short.correct[wordcount>2][index] <- wo.x

In `name.short correct` insert `NA` where `"No suitable matches found."`

tnrs.tpl.all.trop.tnrs.certain.filled <- read.csv("tnrs.tpl.all.trop.tnrs.certain.filled.csv",
                                                  stringsAsFactors=FALSE)
tnrs.tpl.all.trop.tnrs.certain.filled$name.short.correct[tnrs.tpl.all.trop.tnrs.certain.filled$
                                                         name.correct ==
                                                         "No suitable matches found."] <- NA

Fill in `rank.short.correct`

tnrs.tpl.all.trop.tnrs.certain.filled$rank.short.correct <-
    tnrs.tpl.all.trop.tnrs.certain.filled$rank.correct
table(tnrs.tpl.all.trop.tnrs.certain.filled$rank.short.correct)

Assign rank = species to the shortened names of subspecies, subvariety, variety and forma.

tnrs.tpl.all.trop.tnrs.certain.filled$rank.short.correct[tnrs.tpl.all.trop.tnrs.certain.filled$
                                                         rank.correct == "infraspecies"] <-
    "species"

tnrs.tpl.all.trop.tnrs.certain.filled$rank.short.correct[
                                          tnrs.tpl.all.trop.tnrs.certain.filled$
                                          rank.correct == "subspecies"] <- "species"

tnrs.tpl.all.trop.tnrs.certain.filled$rank.short.correct[
                                          tnrs.tpl.all.trop.tnrs.certain.filled$
                                          rank.correct == "subvariety"] <- "species"

tnrs.tpl.all.trop.tnrs.certain.filled$rank.short.correct[
                                          tnrs.tpl.all.trop.tnrs.certain.filled$
                                          rank.correct == "variety"] <- "species"

tnrs.tpl.all.trop.tnrs.certain.filled$rank.short.correct[
                                          tnrs.tpl.all.trop.tnrs.certain.filled$
                                          rank.correct == "forma"] <- "species"

Assign rank.short.correct = family to rank.correct == unknown since all of those names were at the family level.

tnrs.tpl.all.trop.tnrs.certain.filled$rank.short.correct[
                                          tnrs.tpl.all.trop.tnrs.certain.filled$
                                          rank.correct == "unknown"] <- "family"

write.csv(tnrs.tpl.all.trop.tnrs.certain.filled,
          file = "tnrs.tpl.all.trop.tnrs.certain.filled.csv")

Fix families

tnrs.tpl.all.trop.tnrs.certain.filled <- read.csv("../tnrs.tpl.all.trop.tnrs.certain.filled.csv",
                                                  stringsAsFactors=FALSE)

names(tnrs.tpl.all.trop.tnrs.certain.filled)

##  [1] "X.3"                             "X.2"                            
##  [3] "X.1"                             "X"                              
##  [5] "Name_number"                     "Name_submitted"                 
##  [7] "Overall_score"                   "Name_matched"                   
##  [9] "Name_matched_rank"               "Name_score"                     
## [11] "Name_matched_author"             "Name_matched_url"               
## [13] "Author_matched"                  "Author_score"                   
## [15] "Family_matched"                  "Family_score"                   
## [17] "Name_matched_accepted_family"    "Genus_matched"                  
## [19] "Genus_score"                     "Specific_epithet_matched"       
## [21] "Specific_epithet_score"          "Infraspecific_rank"             
## [23] "Infraspecific_epithet_matched"   "Infraspecific_epithet_score"    
## [25] "Infraspecific_rank_2"            "Infraspecific_epithet_2_matched"
## [27] "Infraspecific_epithet_2_score"   "Annotations"                    
## [29] "Unmatched_terms"                 "Taxonomic_status"               
## [31] "Accepted_name"                   "Accepted_name_author"           
## [33] "Accepted_name_rank"              "Accepted_name_url"              
## [35] "Accepted_name_species"           "Accepted_name_family"           
## [37] "Selected"                        "Source"                         
## [39] "Warnings"                        "Accepted_name_lsid"             
## [41] "user_id"                         "Manual.matching"                
## [43] "Status.correct"                  "name.correct"                   
## [45] "rank.correct"                    "family.correct"                 
## [47] "name.short.correct"              "rank.short.correct"

Fill in family names in family.correct for entries where the matched rank was family and that were not matched manually:

tnrs.tpl.all.trop.tnrs.certain.filled$family.correct <- as.character(
    tnrs.tpl.all.trop.tnrs.certain.filled$family.correct)

tnrs.tpl.all.trop.tnrs.certain.filled$family.correct[(
    is.na(tnrs.tpl.all.trop.tnrs.certain.filled$Manual.matching) &
    tnrs.tpl.all.trop.tnrs.certain.filled$rank.short.correct == "family")] <-
    tnrs.tpl.all.trop.tnrs.certain.filled$name.correct[(
        is.na(tnrs.tpl.all.trop.tnrs.certain.filled$Manual.matching) &
        tnrs.tpl.all.trop.tnrs.certain.filled$rank.short.correct == "family")]

tnrs.tpl.all.trop.tnrs.certain.filled$family.correct[(
    is.na(tnrs.tpl.all.trop.tnrs.certain.filled$Manual.matching))] <-
    tnrs.tpl.all.trop.tnrs.certain.filled$Name_matched_accepted_family[(
    is.na(tnrs.tpl.all.trop.tnrs.certain.filled$Manual.matching))]

write.csv(tnrs.tpl.all.trop.tnrs.certain.filled,
          file = "tnrs.tpl.all.trop.tnrs.certain.filled.csv")

Link backbone to original taxa names and `sPlot/TRY` code

splot.try3.code <- read.csv("spec.list.TRY3.sPlot2.csv", stringsAsFactors=FALSE)
str(splot.try3.code)

backbone.splot.try3 <- read.csv("../tnrs.tpl.all.trop.tnrs.certain.filled.small.csv",
                                stringsAsFactors=FALSE)

str(backbone.splot.try3)

## 'data.frame':    122901 obs. of  30 variables:
##  $ Name_number                 : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Name_submitted              : chr  "Spermatophyta sp." "Spermatophyta sp." "Chlorophytum sp. [1269]" "Echinochloa sp." ...
##  $ Overall_score               : num  0 0 0.9 0.9 0.9 0.9 0.9 0 0.9 0 ...
##  $ Name_matched                : chr  "No suitable matches found." "No suitable matches found." "Chlorophytum" "Echinochloa" ...
##  $ Name_matched_rank           : chr  "" "" "genus" "genus" ...
##  $ Name_score                  : num  0 0 1 1 1 1 1 0 1 0 ...
##  $ Family_score                : num  0 0 NA NA NA NA 1 0 1 0 ...
##  $ Name_matched_accepted_family: chr  "" "" "Asparagaceae" "Poaceae" ...
##  $ Genus_matched               : chr  "" "" "Chlorophytum" "Echinochloa" ...
##  $ Genus_score                 : num  0 0 1 1 1 1 NA 0 NA 0 ...
##  $ Specific_epithet_matched    : chr  "" "" "" "" ...
##  $ Specific_epithet_score      : num  0 0 NA NA NA NA NA 0 NA 0 ...
##  $ Unmatched_terms             : chr  "" "" "\"\"sp. [1269]" "\"\"sp." ...
##  $ Taxonomic_status            : chr  "" "" "Accepted" "Accepted" ...
##  $ Accepted_name               : chr  "" "" "Chlorophytum" "Echinochloa" ...
##  $ Accepted_name_author        : chr  "" "" "" "" ...
##  $ Accepted_name_rank          : chr  "" "" "genus" "genus" ...
##  $ Accepted_name_url           : chr  "" "" "http://www.theplantlist.org/tpl1.1/search?q=Chlorophytum" "http://www.theplantlist.org/tpl1.1/search?q=Echinochloa" ...
##  $ Accepted_name_species       : chr  "" "" "" "" ...
##  $ Accepted_name_family        : chr  "" "" "Asparagaceae" "Poaceae" ...
##  $ Selected                    : chr  "true" "true" "true" "true" ...
##  $ Source                      : chr  "" "" "tpl" "tpl" ...
##  $ Warnings                    : chr  " " " " " " " " ...
##  $ Manual.matching             : chr  NA NA NA NA ...
##  $ Status.correct              : chr  "No suitable matches found." "No suitable matches found." "Accepted" "Accepted" ...
##  $ name.correct                : chr  "No suitable matches found." "No suitable matches found." "Chlorophytum" "Echinochloa" ...
##  $ rank.correct                : chr  "higher" "higher" "genus" "genus" ...
##  $ family.correct              : chr  "" "" "Asparagaceae" "Poaceae" ...
##  $ name.short.correct          : chr  NA NA "Chlorophytum" "Echinochloa" ...
##  $ rank.short.correct          : chr  "higher" "higher" "genus" "genus" ...

backbone.splot.try3 <- join(splot.try3.code, backbone.splot.try3, by = "Name_number")
str(backbone.splot.try3)

table(backbone.splot.try3$Status.correct)

## 
##                   Accepted                 No opinion 
##                      95485                       4591 
## No suitable matches found.                    Synonym 
##                       1692                      20952 
##                 Unresolved 
##                        181

For constistency, assign Unresolved to status = Unresolved:

backbone.splot.try3$Status.correct[backbone.splot.try3$Status.correct == "No opinion"] <-
    "Unresolved"

write.csv(backbone.splot.try3, file = "backbone.splot.try3.csv")
save(backbone.splot.try3, file = "backbone.splot.try3.Rdata")

Some statistics

table(backbone.splot.try3$sPlot.TRY)

## < table of extent 0 >

table(backbone.splot.try3$Manual.matching)

## 
##    x 
## 1675

table(backbone.splot.try3$Status.correct)

## 
##                   Accepted                 No opinion 
##                      95485                       4591 
## No suitable matches found.                    Synonym 
##                       1692                      20952 
##                 Unresolved 
##                        181

length(unique(backbone.splot.try3$name.correct))-1

## [1] 90696

length(unique(backbone.splot.try3$family.correct))-1

## [1] 665

length(unique(backbone.splot.try3$name.short.correct))-1

## [1] 86528

table(backbone.splot.try3$rank.short.correct)

## 
##  family   genus  higher species 
##    1880   13383    1211  105818

Resolve the 7,701 additional species from the ad-hoc version sPlot2.1

Get the species in the sPlot-July2015-version (sPlot_2015_07_29_species) that do not match with the backbone in the April2015_Version (sPlot 2.0, sPlot_14_4_2015_species) that was use in matching procedure above. Apply the cleaning procedure, similar to the one used for sPlot 2.0.

miss.new <- read.csv("/home/oliver/Dokumente/PhD/PostPhD/IDiv/sDiv/sPlot/Analyses/Code/
Mismatches_29_07_2015_new.csv")

dim(miss.new)

## [1] 7701    1

There are 7,701 names in the sPlot version 29_07_2015 that were not in splot 2.0 (April 2015).

Name cleaning (spelling of ranks, name additions etc.)

OriginalNames <- as.character(miss.new$x)

OriginalNames <- gsub('*', '', OriginalNames, fixed=TRUE)
OriginalNames <- gsub('cf. ', '', OriginalNames, fixed=TRUE)
OriginalNames <- gsub('Cf. ', '', OriginalNames, fixed=TRUE)
OriginalNames <- gsub('[', '', OriginalNames, fixed=TRUE)
OriginalNames <- gsub(']', '', OriginalNames, fixed=TRUE)
OriginalNames <- gsub(' x ', ' ', OriginalNames, fixed=TRUE)
OriginalNames <- gsub('Ã—', '', OriginalNames, fixed=TRUE)
OriginalNames <- gsub('aff ', '', OriginalNames, fixed=TRUE)
OriginalNames <- gsub('(', '', OriginalNames, fixed=TRUE)
OriginalNames <- gsub(')', '', OriginalNames, fixed=TRUE)
OriginalNames <- gsub(' cf ', ' ', OriginalNames, fixed=TRUE)
OriginalNames <- gsub(' aff. ', ' ', OriginalNames, fixed=TRUE)
OriginalNames <- gsub('c‚e', 'ceae', OriginalNames, fixed=TRUE)
OriginalNames <- gsub('    ', ' ', OriginalNames, fixed=TRUE)
OriginalNames <- gsub('   ', ' ', OriginalNames, fixed=TRUE)
OriginalNames <- gsub('  ', ' ', OriginalNames, fixed=TRUE)
OriginalNames <- gsub('x-', '', OriginalNames, fixed=TRUE)
OriginalNames <- gsub('X-', '', OriginalNames, fixed=TRUE)
OriginalNames <- gsub('like ', '', OriginalNames, fixed=TRUE)  
OriginalNames <- gsub(',', '', OriginalNames, fixed=TRUE)  

library(stringr)
firstWordWithNumbers <- grepl('[0-9]', word(OriginalNames, 1))
numberOfWords <- sapply(gregexpr("\\W+", OriginalNames), length) + 1
OriginalNames[firstWordWithNumbers & numberOfWords > 1] <- sapply(OriginalNames[
    firstWordWithNumbers & numberOfWords > 1], function(x) substr(x, start=regexpr(
                                                                         pattern =' ',
                                                                         text=x)+1,
                                                                  stop=nchar(x)))

CleanedNames <- taxname.abbr(OriginalNames)
write.csv(CleanedNames, file = "CleanedNames.csv")
save(CleanedNames, file = "CleanedNames.Rdata")

Upload cleaned names to TRNS

On the TNRS application webpage, use all six sources (TPL first, see procedure above). Subsequently, use Tropicos first.

tnrs.tpl.new <- read.csv("/home/oliver/Downloads/missing_tpl.txt", sep = "\t",
                         stringsAsFactors = FALSE, skip = 0, head = T)[-1, ]
str(tnrs.tpl.new)
tnrs.tpl <- read.csv("tnrs.tpl.csv")

Select correctly resolved names (TPL first)

Assign index to first column:

tnrs.tpl$Name_number <- 1:length(tnrs.tpl$Name_number)

tnrs.tpl <- tnrs.tpl[,-1]

Family level

index.family <- tnrs.tpl[which(tnrs.tpl$Name_matched_rank == "family" &
                               (tnrs.tpl$Taxonomic_status == "Accepted" |
                                tnrs.tpl$Taxonomic_status == "Synonym") &
                               tnrs.tpl$Family_score > 0.88), 1]
length(index.family)

Repeat selection procedure for Name_matched_rank ==:

forma
genus
infraspecies
species
etc.

Create respective index:

Forma level

index.forma <- tnrs.tpl[which(tnrs.tpl$Name_matched_rank == "forma"), 1]
length(index.forma)

### Genus level

index.genus <- tnrs.tpl[which(tnrs.tpl$Name_matched_rank == "genus" &
                              tnrs.tpl$Taxonomic_status == "Accepted" &
                              tnrs.tpl$Genus_score > 0.83), 1]
length(index.genus)

Infraspecific level

index.infraspec <- tnrs.tpl[which(tnrs.tpl$Name_matched_rank == "infraspecies"), 1]
length(index.infraspec)

Species level

index.species <- tnrs.tpl[which(tnrs.tpl$Name_matched_rank == "species" &
                                (tnrs.tpl$Taxonomic_status == "Accepted" |
                                 tnrs.tpl$Taxonomic_status == "Synonym") &
                                tnrs.tpl$Genus_score > 0.78 & tnrs.tpl$Name_score > 0.93),1]
length(index.species1)

index.species <- tnrs.tpl[which(tnrs.tpl$Name_matched_rank == "species" &
                                (tnrs.tpl$Taxonomic_status == "Accepted" |
                                 tnrs.tpl$Taxonomic_status == "Synonym") &
                                tnrs.tpl$Specific_epithet_score > 0.78), 1]
length(index.species2)

index.species <- unique(c(index.species1, index.species2))
length(index.species)

Subspecies level

index.subspec <- tnrs.tpl[which(tnrs.tpl$Name_matched_rank == "subspecies" &
                                (tnrs.tpl$Taxonomic_status == "Accepted" |
                                 tnrs.tpl$Taxonomic_status == "Synonym")), 1]
length(index.subspec)

Variety level

index.variety <- tnrs.tpl[which(tnrs.tpl$Name_matched_rank == "variety" &
                                (tnrs.tpl$Taxonomic_status == "Accepted" |
                                 tnrs.tpl$Taxonomic_status == "Synonym")), 1]
length(index.variety)

Identifying `non-matched` species that are spermatophyta

index.spermatophyt <- tnrs.tpl[which(tnrs.tpl$Name_matched == "No suitable matches found." &
                                     word(tnrs.tpl$Name_submitted, 1) == "Spermatophyta"), 1]
length(index.spermatophyt)

Identify species that do not fulfill the search criteria

index.tpl <- c(index.family, index.forma, index.genus, index.species, index.subspec,
               index.variety, index.spermatophyt)
length((index.tpl))

tnrs.tpl.certain <- tnrs.tpl[index.tpl,]
dim(tnrs.tpl.certain)
save(tnrs.tpl.certain, file = "tnrs.tpl.certain.Rdata")
write.csv(tnrs.tpl.certain, file = "tnrs.tpl.certain.csv")

tnrs.tpl.uncertain <- tnrs.tpl[tnrs.tpl$Name_number %in% index.tpl == F, ]
dim(tnrs.tpl.uncertain)
save(tnrs.tpl.uncertain, file = "tnrs.tpl.uncertain.Rdata")
write.csv(tnrs.tpl.uncertain, file = "tnrs.tpl.uncertain.csv")

write.csv(tnrs.tpl.uncertain[,2], file = "tnrs.tpl.uncertain.upload.csv")

Select correctly resolved names (`Tropicos` first)

tnrs.trop <- read.csv("/home/oliver/Downloads/tnrs.trop.txt", sep = "\t")
str(tnrs.trop)

str(tnrs.trop)
tnrs.trop$Name_number <- 1:length(tnrs.trop$Name_number)

Reduce to the uncertain species from tnrs.tpl

tnrs.trop.small <- tnrs.trop[tnrs.trop$Name_number %in% index.tpl == F, ]
str(tnrs.trop.small)

save(tnrs.trop.small, file = "tnrs.trop.small.Rdata")
write.csv(tnrs.trop.small, file = "tnrs.trop.small.csv")

Family level

index.family <- tnrs.trop.small[which(tnrs.trop.small$Name_matched_rank == "family" &
                                      (tnrs.trop.small$Taxonomic_status == "Accepted"|
                                       tnrs.trop.small$Taxonomic_status == "No opinion"))
                              , 1]
length(index.family)

Forma level

index.forma <- tnrs.trop.small[which(tnrs.trop.small$Name_matched_rank == "forma" &
                                     (tnrs.trop.small$Taxonomic_status == "Accepted" |
                                      tnrs.trop.small$Taxonomic_status == "Synonym")),
                               1]
length(index.forma)

Genus level

index.genus <- tnrs.trop.small[which(tnrs.trop.small$Name_matched_rank == "genus" &
                                     (tnrs.trop.small$Taxonomic_status == "Accepted" |
                                      tnrs.trop.small$Taxonomic_status == "Synonym" |
                                      tnrs.trop.small$Taxonomic_status == "No opinion") &
                                     tnrs.trop.small$Genus_score > 0.83), 1]
length(index.genus)

Species level

index.species <- tnrs.trop.small[which(tnrs.trop.small$Name_matched_rank == "species" &
                                       (tnrs.trop.small$Taxonomic_status == "Accepted" |
                                        tnrs.trop.small$Taxonomic_status == "Synonym" |
                                        tnrs.trop.small$Taxonomic_status == "No opinion") &
                                       tnrs.trop.small$Specific_epithet_score > 0.77), 1]
length(index.species)

index.species2 <- tnrs.trop.small[which(tnrs.trop.small$Name_matched_rank == "species" &
                                        (tnrs.trop.small$Taxonomic_status == "Accepted" |
                                         tnrs.trop.small$Taxonomic_status == "Synonym") &
                                        tnrs.trop.small$Genus_score > 0.88 &
                                        tnrs.trop.small$Name_score > 0.49), 1]
length(index.species2)

index.species <- unique(c(index.species1, index.species2))
length(index.species)

Subspecies level

index.subspec <- tnrs.trop.small[which(tnrs.trop.small$Name_matched_rank == "subspecies"),
                                 1]
length(index.subspec)

Subvariety level

index.subvariety <- tnrs.trop.small[which(tnrs.trop.small$Name_matched_rank == "subvariety"),
                                    1]
length(index.subvariety)

Variety level

index.variety <- tnrs.trop.small[which(tnrs.trop.small$Name_matched_rank == "variety" &
                                       tnrs.trop.small$Taxonomic_status == "No opinion"),
                                 1]
length(index.variety)

index.all <- c(index.family, index.genus, index.species, index.subspec, index.variety,
               index.subvariety)
length((index.all))

tnrs.trop.small.certain <- tnrs.trop.small[tnrs.trop.small$Name_number %in%
                                           index.all == T,]
dim(tnrs.trop.small.certain)
save(tnrs.trop.small.certain, file = "tnrs.trop.small.certain.Rdata")
write.csv(tnrs.trop.small.certain, file = "tnrs.trop.small.certain.csv")

tnrs.trop.small.uncertain <- tnrs.trop.small[tnrs.trop.small$Name_number %in%
                                             index.all == F, ]
dim(tnrs.trop.small.uncertain)
save(tnrs.trop.small.uncertain, file = "tnrs.trop.small.uncertain.Rdata")
write.csv(tnrs.trop.small.uncertain, file = "tnrs.trop.small.uncertain.csv")

backbone.tpl.trop.certain <- rbind(tnrs.tpl.certain, tnrs.trop.small.certain)
str(backbone.tpl.trop.certain)

Kick out some columns that should not be in the final backbone

backbone.tpl.trop.certain.2 <- backbone.tpl.trop.certain[ ,c(1:6,12:17,25:35)]
colnames(backbone.tpl.trop.certain.2)

Add extra columns to match the old backbone

backbone.tpl.trop.certain.2$Manual.matching <- NA
backbone.tpl.trop.certain.2$Status.correct <- NA
backbone.tpl.trop.certain.2$name.correct <- NA
backbone.tpl.trop.certain.2$rank.correct <- NA
backbone.tpl.trop.certain.2$family.correct <- NA
backbone.tpl.trop.certain.2$name.short.correct <- NA
backbone.tpl.trop.certain.2$rank.short.correct <- NA
backbone.tpl.trop.certain.2$names.sPlot.TRY <- NA
backbone.tpl.trop.certain.2$names.corr.string <- NA
backbone.tpl.trop.certain.2$sPlot.TRY <- NA

Bring names in the same order as in the old backbone

backbone.tpl.trop.certain.3 <- backbone.tpl.trop.certain.2[,c(1,31:33,2:30)]
match(colnames(backbone.tpl.trop.certain.3), colnames(backbone.splot.try3))
identical(colnames(backbone.tpl.trop.certain.3), colnames(backbone.splot.try3)) # TRUE

Fill in empty columns

`Status.correct`

backbone.tpl.trop.certain.3$Status.correct <- backbone.tpl.trop.certain.3$Taxonomic_status
index <- which(backbone.tpl.trop.certain.3$Name_matched == "No suitable matches found.")
backbone.tpl.trop.certain.3$Status.correct[index] <- "No suitable matches found."

Rename no.opinion to unresolved

index <- which(backbone.tpl.trop.certain.3$Status.correct == "No opinion")
backbone.tpl.trop.certain.3$Status.correct[index] <- "Unresolved"

`Name.correct`

Use "Accepted name", if unresolved take "Name_matched")

backbone.tpl.trop.certain.3$name.correct <- backbone.tpl.trop.certain.3$Accepted_name
index <- which(backbone.tpl.trop.certain.3$Taxonomic_status != "Accepted" |
               backbone.tpl.trop.certain.3$Taxonomic_status != "Synonym")
backbone.tpl.trop.certain.3$name.correct[index] <- backbone.tpl.trop.certain.3$
    Name_matched[index]
index <- which(backbone.tpl.trop.certain.3$Name_matched == "No suitable matches found.")
backbone.tpl.trop.certain.3$name.correct[index] <- "No suitable matches found."

`Rank.correct`

backbone.tpl.trop.certain.3$rank.correct <- backbone.tpl.trop.certain.3$Accepted_name_rank
index <- which(backbone.tpl.trop.certain.3$Taxonomic_status != "Accepted" |
               backbone.tpl.trop.certain.3$Taxonomic_status != "Synonym")
backbone.tpl.trop.certain.3$rank.correct[index] <-
    backbone.tpl.trop.certain.3$Name_matched_rank[index]
index <- which(backbone.tpl.trop.certain.3$Name_matched == "No suitable matches found.")
backbone.tpl.trop.certain.3$rank.correct[index] <- NA

`Family.correct`

backbone.tpl.trop.certain.3$family.correct <- backbone.tpl.trop.certain.3$
    Accepted_name_family
index <- which(backbone.tpl.trop.certain.3$Taxonomic_status != "Accepted" |
               backbone.tpl.trop.certain.3$Taxonomic_status != "Synonym")
backbone.tpl.trop.certain.3$family.correct[index] <-
    backbone.tpl.trop.certain.3$Name_matched_accepted_family[index]
index <- which(backbone.tpl.trop.certain.3$Name_matched == "No suitable matches found.")
backbone.tpl.trop.certain.3$family.correct[index] <- NA

`Name.short.correct`

backbone.tpl.trop.certain.3$name.short.correct <- backbone.tpl.trop.certain.3$name.correct

index <- which(backbone.tpl.trop.certain.3$rank.correct == "subspecies" |
               backbone.tpl.trop.certain.3$rank.correct == "variety" |
               backbone.tpl.trop.certain.3$rank.correct == "forma"|
               backbone.tpl.trop.certain.3$rank.correct == "subvariety" |
               backbone.tpl.trop.certain.3$Name_matched_rank == "subspecies" |
               backbone.tpl.trop.certain.3$Name_matched_rank == "variety")

library(stringr)
backbone.tpl.trop.certain.3$name.short.correct[index] <-
    word(string = backbone.tpl.trop.certain.3$name.short.correct[index], start = 1, end = 2)

`Rank.short.correct`

backbone.tpl.trop.certain.3$rank.short.correct <- backbone.tpl.trop.certain.3$rank.correct

index <- which(backbone.tpl.trop.certain.3$rank.correct == "subspecies" |
               backbone.tpl.trop.certain.3$rank.correct == "variety" |
               backbone.tpl.trop.certain.3$rank.correct == "forma"|
               backbone.tpl.trop.certain.3$rank.correct == "subvariety")

backbone.tpl.trop.certain.3$rank.short.correct[index] <- "species"

Kick out some columns that should not be in the final backbone

tnrs.trop.small.uncertain.2 <- tnrs.trop.small.uncertain[ ,c(1:6,12:17,25:35)]
colnames(tnrs.trop.small.uncertain.2)

Add extra columns to match the old backbone

tnrs.trop.small.uncertain.2$Manual.matching <- NA
tnrs.trop.small.uncertain.2$Status.correct <- NA
tnrs.trop.small.uncertain.2$name.correct <- NA
tnrs.trop.small.uncertain.2$rank.correct <- NA
tnrs.trop.small.uncertain.2$family.correct <- NA
tnrs.trop.small.uncertain.2$name.short.correct <- NA
tnrs.trop.small.uncertain.2$rank.short.correct <- NA

tnrs.trop.small.uncertain.2$names.sPlot.TRY <- NA
tnrs.trop.small.uncertain.2$names.corr.string <- NA
tnrs.trop.small.uncertain.2$sPlot.TRY <- NA

Bring names in the same order as in the old backbone

tnrs.trop.small.uncertain.3 <- tnrs.trop.small.uncertain.2[,c(1,31:33,2:30)]
match(colnames(tnrs.trop.small.uncertain.3), colnames(backbone.splot.try3))
identical(colnames(tnrs.trop.small.uncertain.3), colnames(backbone.splot.try3)) # TRUE

tnrs.trop.small.uncertain.3[,c(28:29)] <- "No suitable matches found."

Merge with backbone for certain species

backbone.tpl.trop <- rbind(backbone.tpl.trop.certain.3, tnrs.trop.small.uncertain.3)
head(backbone.tpl.trop)
dim(backbone.tpl.trop)
write.csv(backbone.tpl.trop, file = "backbone.tpl.trop.csv")
backbone.tpl.trop <- read.csv("backbone.tpl.trop.csv")

Fill in remaining first three columns

Assing all entries to sPlot ("S")

backbone.tpl.trop$sPlot.TRY <- "S"

Assing all entries to sPlot-version 2.1 ("sPlot2b")

backbone.tpl.trop$Manual.matching <- "sPlot2b"
miss.clean.2 <- cbind(1:length(miss.clean[,1]), as.data.frame(miss.clean))
colnames(miss.clean.2)[1] <- "Name_number"

Join the TRNS results

backbone.tpl.trop <- left_join(miss.clean.2, backbone.tpl.trop, by = "Name_number")
head(backbone.tpl.trop)

colnames(backbone.tpl.trop)

backbone.tpl.trop.2 <- backbone.tpl.trop[,-c(4:6)]
colnames(backbone.tpl.trop.2)[2:3] <- c("names.sPlot.TRY", "names.corr.string")
match(colnames(backbone.tpl.trop.2), colnames(backbone.splot.try3))
identical(colnames(backbone.tpl.trop.2), colnames(backbone.splot.try3))

write.csv(backbone.tpl.trop.2, file = "backbone.tpl.trop.2.csv")
save(backbone.tpl.trop.2, file = "backbone.tpl.trop.2.Rdata")

Merge additional species with the big backbone for `sPlot 2.0` and `TRY 3.0`

backbone.splot2b.try3 <- rbind(backbone.splot.try3, backbone.tpl.trop.2)
dim(backbone.splot2b.try3)

Check whether the new backbone matches the sPlot 2.1 species data:

str(splot.species)
length(unique(splot.species$Matched_concept))
which(unique(splot.species$Matched_concept) %in% backbone.splot2b.try3$names.sPlot.TRY == F)
backbone.splot2.1.try3 <- backbone.splot2b.try3
write.csv(backbone.splot2.1.try3, file = "backbone.splot2.1.try3.csv")
save(backbone.splot2.1.try3, file = "backbone.splot2.1.try3.Rdata")

Tag vascular species in `backbone.splot2.1.try3`

Because sPlot, and to some extent TRY, contain a bunch of names belonging to non-vascular, those need to be tagged.

load("../backbone.splot2.1.try3.is.vascular.Rdata")

colnames(backbone.splot2.1.try3)

##  [1] "Name_number"                  "names.sPlot.TRY"             
##  [3] "names.corr.string"            "sPlot.TRY"                   
##  [5] "Name_submitted"               "Overall_score"               
##  [7] "Name_matched"                 "Name_matched_rank"           
##  [9] "Name_score"                   "Family_score"                
## [11] "Name_matched_accepted_family" "Genus_matched"               
## [13] "Genus_score"                  "Specific_epithet_matched"    
## [15] "Specific_epithet_score"       "Unmatched_terms"             
## [17] "Taxonomic_status"             "Accepted_name"               
## [19] "Accepted_name_author"         "Accepted_name_rank"          
## [21] "Accepted_name_url"            "Accepted_name_species"       
## [23] "Accepted_name_family"         "Selected"                    
## [25] "Source"                       "Warnings"                    
## [27] "Manual.matching"              "Status.correct"              
## [29] "name.correct"                 "rank.correct"                
## [31] "family.correct"               "name.short.correct"          
## [33] "rank.short.correct"           "is.vascular.species"         
## [35] "sPlot2.1.TRY"

Get families

fam <- unique(backbone.splot2.1.try3$family.correct)
head(fam)

## [1] ""             "Asparagaceae" "Poaceae"      "Fabaceae"    
## [5] "Polygalaceae" "Orchidaceae"

gbiffam <- sapply(fam[2:5], function(x) name_usage(name=x,
                                                   rank = 'FAMILY,', limit = 1)$data$phylum)
gbiffam[which(sapply(gbiffam, function(x) is.null(x)))] <- 'unknown'
fam$phylum <- unlist(gbiffam, use.names = TRUE)
table(fam$phylum)
write.csv(fam, file='family_affiliation_gbif.csv')
family_affiliation_gbif <- read.csv('../Florian_TaxStand/family_affiliation_gbif.csv')

table(family_affiliation_gbif$phylum)

## 
## Anthocerotophyta       Ascomycota    Basidiomycota        Bryophyta 
##                1               35               15               57 
##       Charophyta      Chlorophyta    Cyanobacteria      Glaucophyta 
##                3                6                2                1 
##  Marchantiophyta       Ochrophyta       Rhodophyta     Tracheophyta 
##               45               17                8              474 
##          unknown 
##                5

Add column `"is.vascular.species"` and set all families that correspond to `Tracheophyta` (`vascular plants == TRUE`)

table(is.na(backbone.splot2.1.try3$is.vascular.species))

## 
##  FALSE   TRUE 
## 123589   7013

Select families that belong to Tracheophyta:

fam.trach <- family_affiliation_gbif$Var1[family_affiliation_gbif$phylum == "Tracheophyta"]
dim(fam.trach)
ind.vasc <- backbone.splot2.1.try3$family.correct %in% fam.trach
backbone.splot2.1.try3$is.vascular.species <- ind.vasc
backbone.splot2.1.try3$is.vascular.species[backbone.splot2.1.try3$is.vascular.species ==
                                           FALSE] <- NA
backbone.splot2.1.try3.is.vascular <- backbone.splot2.1.try3

To obtain proper stats for sPlot2.1, kick out (or tag) the species that were only found in sPlot 2.0 but not in sPlot 2.1:

splot.species.july2015 <- read.csv("/home/oliver/Dokumente/PhD/PostPhD/IDiv/sDiv/sPlot/
Analyses/Data/Species/sPlot/sPlot_2015_07_29/sPlot_2015_07_29_species.csv", sep = "\t")
gc()

splot.species.july2015.spec <- as.character(unique(splot.species.july2015$Matched.concept))
str(splot.species.july2015.spec)
write.csv(splot.species.july2015.spec, file = "splot.species.july2015.spec.csv")

There are 86,432 unique, uncorrected names in sPlot 2.1. Check whether all of these names are in the new backbone?

load("backbone.splot2.1.try3.is.vascular.Rdata")

Export small backbone:

backbone.splot2.1.try3.small <- backbone.splot2.1.try3[,c(2,32)]
write.csv(backbone.splot2.1.try3.small, file = "backbone.splot2.1.try3.small.csv")

load("backbone.splot2.1.try3.Rdata")
ind <- (splot.species.july2015.spec %in% backbone.splot2.1.try3$names.sPlot.TRY)
splot.species.july2015.spec[ind==F]
table(ind)

But there are only 86,427 names here. Five species are missing! Why?

The following species are in sPlot2.1 (splot.species.july2015.spec) but not in the backbone (backbone.splot2.1.try):

"Strauch like Oraniquelocarpus"
"Buchenavia [GUNDINGA]"
"[ms554 Triumfetta Ahorn]"
"Dolichos holosericea"
"[Aeschynomene latifolia]"

Actually, they are in the backbone but just spelled differently, e.g. Strauch like ""Oraniquelocarpus"" or [ms554 Triumfetta "Ahorn"]. Probably those names got messed up, generate a vector of that 5 species that is consistent with the spelling used in the missing-species-list and in the final backbone.

miss7701 <- read.csv("/home/oliver/Dokumente/PhD/PostPhD/IDiv/sDiv/sPlot/Analyses/Code/
Mismatches_29_07_2015_new.csv", stringsAsFactors = F)
vec5 <- miss7701[c(49, 3165, 5793, 5814, 5823), 1]

Add these five species to the sPlot 2.1 species list:

splot.species.july2015.spec.5 <- c(splot.species.july2015.spec, vec5)

In the big backbone, identify species that are in "S" and "ST" and that are in our sPlot 2.1 species list (should be 86,432).

ind <- (splot.species.july2015.spec.5 %in% backbone.splot2.1.try3$names.sPlot.TRY)
splot.species.july2015 <- splot.species.july2015.spec.5[ind]
ind <- (splot.species.july2015 %in% backbone.splot2.1.try3$names.sPlot.TRY)
table(ind)

Great, these are the species in sPlot 2.1.

Select species in backbone that are in `sPlot2.1` (`splot.species.july2015`)

ind.splot2.1 <- which(backbone.splot2.1.try3$names.sPlot.TRY %in% splot.species.july2015)

There are 86,432 species in TRY, that are both only in TRY as well as in sPlot and TRY.

Select species that are in TRY

ind.try <- which(backbone.splot2.1.try3$sPlot.TRY=="T" | backbone.splot2.1.try3$sPlot.TRY=="ST")

60,273 species that are in TRY.

Species that are both in sPlot and TRY

intersect_all <- function(a,b,...){
  Reduce(intersect, list(a,b,...))
}

inter <- intersect_all(ind.splot2.1, ind.try)

24,844 names shared between sPlot and TRY.

Check whether the three numbers above are correct, and add new column sPlot2.1.TRY to backbone, that tags the entries that belong to sPlot 2.1 and TRY 3.0, respectively.

backbone.splot2.1.try3$ST <- NA

backbone.splot2.1.try3$ST[ind.splot2.1] <- "S"
backbone.splot2.1.try3$ST[ind.try] <- "T"
backbone.splot2.1.try3$ST[inter] <- "ST"
colnames(backbone.splot2.1.try3)[35] <- "sPlot2.1.TRY"
save(backbone.splot2.1.try3, file = "backbone.splot2.1.try3.is.vascular.Rdata")

Statistics for backbone combining names in `sPlot2.1` and `TRY3.0`

All taxon name entries

load("../backbone.splot2.1.try3.is.vascular.Rdata")

How many entries in the backbone were only found in the old sPlot-version 2.0 (sPlot_14_4_2015) but not in sPlot 2.1 sPlot_2015_07_19?

table(!is.na(backbone.splot2.1.try3$sPlot2.1.TRY))

## 
##  FALSE   TRUE 
##   8741 121861

There are 8,741 entries that do not belong to sPlot 2.1. In total 121,861 (unstandardized) name entries in sPlot 2.1 and TRY 3.0 combined. Index those species that were exclusiely found in the sPlot 2.0 but not in version 2.1, and remove them from the backbone (back2.1) to update the match statistics.

ind2.1 <- !is.na(backbone.splot2.1.try3$sPlot2.1.TRY)
back2.1 <- backbone.splot2.1.try3[ind2.1, ]

Database affiliations (sPlot 2.1 and TRY 3.0).

kable(t(table(back2.1$sPlot2.1.TRY)), caption = "Number of (standardized) name entries
unique to, or shared between TRY (S) and sPlot (T).")

S	ST	T
61588	24844	35429

60,273 of the total number of entries belong to TRY (incl. names that are only in TRY as well as in sPlot and TRY). 86,432 name entries belong to sPlot (incl. names that are only in TRY as well as in sPlot and TRY).

table(back2.1$Manual.matching)

## 
## sPlot2b       x 
##    7694    1432

1,432 name entries were matched manually (see above).

Taxonomic ranks:

kable(t(table(back2.1$rank.short.correct)), caption = "Number of (standardized) name entries
per taxonomic rank.")

family	genus	higher	species
1745	12373	1020	105777

Taxonomic status:

kable(t(table(back2.1$Status.correct)), caption = "Number of (standardized) name entries
that correspond to `Accepted`, `Synonyms` or Unresolved species, respecively.")

Accepted	No suitable matches found.	Synonym	Unresolved
94068	1902	21028	4863

Total number of unique standardized taxon names and families:

length(unique(back2.1$name.short.correct))-1 # minus 1 for NA

## [1] 86760

length(unique(back2.1$family.correct))-2 # minus 2 for "" and NA

## [1] 663

Number of entries corresponding to vascular plant species:

table(back2.1$is.vascular.species)

## 
##   TRUE 
## 115678

Number of duplicated entries after taxonomic standardization: Frequency of original (non-standardized) species names per resolved (non-standardized) name (excluding non-vascular and non-matched species).

df.count <- back2.1 %>%
    dplyr::filter(is.vascular.species == TRUE, !is.na(name.short.correct)) %>%
    dplyr::group_by(name.short.correct) %>%
    dplyr::summarise(n = n()) %>%
    dplyr::arrange(desc(n))

kable(df.count[c(1:3, 21:25, 101:110), ], , caption = "Number of unresolved, original name
enties per resolved name.")

Based on `unique` standardized names

Generate version of the backbone that only includes the unique resolved names in name.short.correct, and for the non-unique names, the first rows of duplicated name:

back2.1.uni <- back2.1[!duplicated(back2.1$name.short.correct), ]

Remove the first entry, which is NA:

back2.1.uni <- back2.1.uni[-1, ]

length(unique(back2.1.uni$name.short.correct))

## [1] 86761

There are 86,760 unique taxon names the in backbone. Exclude the non-vascular plant and non-matching taxon names:

df.uni <- back2.1.uni %>%
    dplyr::filter(is.vascular.species == TRUE, !is.na(name.short.correct))

df.uni had two names less than df.count, as they were accidentially tagged as non-vascular species names. Resolve that issue:

miss <- df.count$name.short.correct[(df.count$name.short.correct %in% df.uni$name.short.correct) == F]
which(df.count$name.short.correct %in% c("Arabis stellulata", "Coptidium pallasii"))
which(df.uni$name.short.correct %in% c("Arabis stellulata", "Coptidium pallasii"))
df.count[c(20319, 31478), ]
ind <- which(back2.1$name.short.correct %in% c("Arabis stellulata", "Coptidium pallasii"))

back2.1[ind, 31:34]

##       family.correct name.short.correct rank.short.correct
## 8853    Brassicaceae  Arabis stellulata            species
## 8932    Brassicaceae  Arabis stellulata            species
## 30191  Ranunculaceae Coptidium pallasii            species
## 96184  Ranunculaceae Coptidium pallasii            species
##       is.vascular.species
## 8853                 TRUE
## 8932                 TRUE
## 30191                TRUE
## 96184                TRUE

back2.1[ind[1], 31:34] <- back2.1[ind[2], 31:34]
back2.1[ind[3], 31:34] <- back2.1[ind[4], 31:34]

df.uni <- back2.1.uni %>%
    dplyr::filter(is.vascular.species == TRUE, !is.na(name.short.correct))

Now, run the stats for unique resolved names (excluding non-vascular and non-matching taxa):

length(df.uni$name.short.correct)

## [1] 83677

There are 83,679 unique (non-vascular plant) taxon names:

kable(t(table(df.uni$sPlot2.1.TRY)), caption = "Number of (standardized) vascular plant
taxon names per unique to, and shared between TRY (S) and sPlot (T).")

S	ST	T
34105	20414	29158

table(df.uni$Manual.matching)

## 
## sPlot2b       x 
##    1637     490

490 of those names were matched manually (see above).

Taxonomic ranks:

kable(t(table(df.uni$rank.short.correct)), caption = "Number of (standardized) name entries
per taxonomic rank.")

family	genus	higher	species
152	4343	13	79169

Taxonomic status:

kable(t(table(df.uni$Status.correct)), caption = "Number of (standardized) name entries that
correspond to `Accepted`, `Synonyms` or Unresolved species, respecively.")

Accepted	Synonym	Unresolved
69173	9998	4506

Total number of unique standardized taxon names and families:

length(unique(back2.1$name.short.correct))-1 # minus 1 for NA

## [1] 86760

length(unique(back2.1$family.correct))-2 # minus 2 for "" and NA

## [1] 663

Number of entries corresponding to vascular plant species:

table(back2.1$is.vascular.species)

## 
##   TRUE 
## 115678

Number of duplicated entries after taxonomic standardization: Frequency of original (non-standardized) species names per resolved (non-standardized) name (excluding non-vascular and non-matched species).

df.count <- back2.1 %>%
    dplyr::filter(is.vascular.species == TRUE, !is.na(name.short.correct)) %>%
    dplyr::group_by(name.short.correct) %>%
    dplyr::summarise(n = n()) %>%
    dplyr::arrange(desc(n))

kable(df.count[c(1:3, 21:25, 101:110), ], , caption = "Number of unresolved, original
name enties per resolved name.")

name.short.correct	n
Poaceae	254
Lauraceae	175
Asteraceae	146
Myrcia	47
Taraxacum	42
Pouteria	41
Malpighiaceae	39
Guarea	37
Annonaceae	17
Antimima	17
Brosimum	17
Centaurium erythraea	17
Chrysophyllum	17
Cinnamomum	17
Elymus hispidus	17
Euphorbia	17
Lamium galeobdolon	17
Mollinedia	17

Based on `unique` standardized names

Generate version of the backbone that only includes the unique resolved names in name.short.correct, and for the non-unique names, the first rows of duplicated name:

back2.1.uni <- back2.1[!duplicated(back2.1$name.short.correct), ]

Remove the first entry, which is NA:

back2.1.uni <- back2.1.uni[-1, ]
length(unique(back2.1.uni$name.short.correct))

There are 86,760 unique taxon names the in backbone. Exclude the non-vascular plant and non-matching taxon names:

df.uni <- back2.1.uni %>%
    dplyr::filter(is.vascular.species == TRUE, !is.na(name.short.correct))

df.uni had two names less than df.count, as they were accidentially tagged as non-vascular species names. Correct that:

miss <- df.count$name.short.correct[(df.count$name.short.correct %in% df.uni$name.short.correct) == F]
which(df.count$name.short.correct %in% c("Arabis stellulata", "Coptidium pallasii"))
which(df.uni$name.short.correct %in% c("Arabis stellulata", "Coptidium pallasii"))
df.count[c(20319, 31478), ]
ind <- which(back2.1$name.short.correct %in% c("Arabis stellulata", "Coptidium pallasii"))

back2.1[ind, 31:34]

##       family.correct name.short.correct rank.short.correct
## 8853    Brassicaceae  Arabis stellulata            species
## 8932    Brassicaceae  Arabis stellulata            species
## 30191  Ranunculaceae Coptidium pallasii            species
## 96184  Ranunculaceae Coptidium pallasii            species
##       is.vascular.species
## 8853                 TRUE
## 8932                 TRUE
## 30191                TRUE
## 96184                TRUE

back2.1[ind[1], 31:34] <- back2.1[ind[2], 31:34]
back2.1[ind[3], 31:34] <- back2.1[ind[4], 31:34]

df.uni <- back2.1.uni %>%
    dplyr::filter(is.vascular.species == TRUE, !is.na(name.short.correct))

Now, run the stats for unique resolved names (excluding non-vascular and non-matching taxa):

length(df.uni$name.short.correct)

There are 83,679 unique (non-vascular plant) taxon names:

Database affiliations (sPlot 2.1 and TRY 3.0).

kable(t(table(df.uni$sPlot2.1.TRY)), caption = "Number of (standardized) vascular
plant taxon names per unique to, and shared between TRY (S) and sPlot (T).")

S	ST	T
34105	20414	29158

table(df.uni$Manual.matching)

## 
## sPlot2b       x 
##    1637     490

1,432 of those names were matched manually (see above). Taxonomic ranks:

kable(t(table(df.uni$rank.short.correct)), caption = "Number of (standardized) vascular
plant taxon names per taxonomic rank.")

family	genus	higher	species
152	4343	13	79169

Taxonomic status:

kable(t(table(df.uni$Status.correct)), caption = "Number of (standardized) vascular plant
taxon names that correspond to `Accepted`, `Synonyms` or Unresolved species, respecively.")

Accepted	Synonym	Unresolved
69173	9998	4506

Stats for the corrected names in `sPlot` only:

df.uni.splot <- df.uni %>%
    dplyr::filter(is.vascular.species == TRUE, !is.na(name.short.correct), df.uni$sPlot2.1.TRY!= "T")

length((df.uni.splot$name.short.correct))

## [1] 54519

Database affiliations

kable(t(table(df.uni.splot$sPlot2.1.TRY)), caption = "Number of (standardized) vascular
plant taxon names per unique to sPlot (S), and shared between TRY and sPlot (ST).")

S	ST
34105	20414

table(df.uni.splot$Manual.matching)

## 
## sPlot2b       x 
##    1637     266

266 uniquenames in sPlot were matched manually (see above).

Taxonomic ranks:

kable(t(table(df.uni.splot$rank.short.correct)), caption = "Number of (standardized)
vascular plant taxon names per taxonomic rank.")

family	genus	higher	species
133	2447	9	51930

Taxonomic status:

kable(t(table(df.uni.splot$Status.correct)), caption = "Number of (standardized) vascular
plant taxon names that correspond to `Accepted`, `Synonyms` or Unresolved species, respecively.")

Accepted	Synonym	Unresolved
46009	5967	2543

Number of families in sPlot:

length(unique(df.uni.splot$family.correct))

## [1] 439

Done!

`R`-settings

sessionInfo()

## R version 3.4.1 (2017-06-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.5 LTS
## 
## Matrix products: default
## BLAS: /usr/lib/openblas-base/libopenblas.so.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.0
## 
## locale:
##  [1] LC_CTYPE=de_DE.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=de_DE.UTF-8        LC_COLLATE=de_DE.UTF-8    
##  [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=de_DE.UTF-8   
##  [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] bindrcpp_0.2      rmarkdown_1.6     rgbif_0.9.8      
##  [4] Taxonstand_2.0    pbapply_1.3-3     plyr_1.8.4       
##  [7] dplyr_0.7.2.9000  doParallel_1.0.10 iterators_1.0.8  
## [10] foreach_1.4.3     vegdata_0.9       foreign_0.8-69   
## [13] knitr_1.16        stringr_1.2.0    
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.12      highr_0.6         compiler_3.4.1   
##  [4] bindr_0.1         tools_3.4.1       digest_0.6.12    
##  [7] evaluate_0.10.1   jsonlite_1.5      tibble_1.3.3     
## [10] gtable_0.2.0      lattice_0.20-35   pkgconfig_2.0.1  
## [13] rlang_0.1.1.9000  yaml_2.1.14       xml2_1.1.1       
## [16] httr_1.2.1.9000   rgeos_0.3-23      rprojroot_1.2    
## [19] grid_3.4.1        glue_1.1.1        data.table_1.10.4
## [22] geoaxe_0.1.0      R6_2.2.2          oai_0.2.2        
## [25] XML_3.98-1.9      sp_1.2-5          whisker_0.3-2    
## [28] ggplot2_2.2.1     magrittr_1.5      backports_1.1.0  
## [31] htmltools_0.3.6   scales_0.4.1      codetools_0.2-15 
## [34] assertthat_0.2.0  colorspace_1.3-2  stringi_1.1.5    
## [37] lazyeval_0.2.0    munsell_0.4.3

References

Boyle, B., Hopkins, N., Lu, Z., Garay, J.A.R., Mozzherin, D., Rees, T., Matasci, N., Narro, M.L., Piel, W.H., Mckay, S.J. & others. (2013) The taxonomic name resolution service: An online tool for automated standardization of plant names. BMC Bioinformatics, 14, 16.

Chase, M. & Reveal, J. (2009) A phylogenetic classification of the land plants to accompany apgiii. Botanical Journal of the Linnean Society, 161, 122–127.

Federhen, S. (2010) The NCBI Handbook [Internet]. (eds J. McEntyre & J. Ostell) Available from: http://www.ncbi.nlm.nih.gov/guide/taxonomy/; National Center for Biotechnology Information, Bethesda, MD, USA, [Accessed: 25 Oct 2011].

Flann, C. (2009) Global Compositae Checklist. [Accessed 2 Apr 2013]. Available from: http://compositae.landcareresearch.co.nz/.

International legume database and information service. (2006) [Internet, Accessed 21 Aug 2015]. Available from: http://www.ildis.org/LegumeWeb.

iPlant Collaborative. (2015) The Taxonomic Name Resolution Service [Internet]. Version 4.0 [Accessed: 20 Sep 2015]. Available from: http://tnrs.iplantcollaborative.org/; National Center for Biotechnology Information.

Missouri Botanical Garden. (2013) Tropicos.org [Internet, Accessed 19 Dec 2014]. Available from: http://www.tropicos.org.

The Plant List. (2013) [Internet]. Version 1.1. [Accessed: 19 Aug 2015]. Available from: http://www.theplantlist.org/.

USDA, NRCS. (2012) The Plants Database [Internet]. [Accessed: 17 Jan 2015]. Available from: http://plants.usda.gov; National Plant Data Team, Greensboro, NC, USA.

[1] sPlot_14_4_2015_species no longer exists, but will be used for documentation purposes here. This is the version of splot to generate the backbone for to match splot 2.0 to try 3.0. The following workflow describes the steps needed for generating a backbone for sPlot_14_4_2015 and TRY 3.0. The backbone used in the third sPlot workshop contains the 7,701 additional (and/or different) species names in sPlot 2.1 (sPlot_2015_07_19_species) that had to be standardized in a ad-hoc fashion prior to the workshop.

[2] This species list only contains 122,901 species. The final backbone contains 7,701 additional (and/or different) species from from the ad-hoc version sPlot_2015_07_19.

[3] These were temporary files that were deleted after the name matching procedure.

[4] The full original tnrs.tpl table, including all 123,000+ species, has been overwritten, and now instead only contains the 121,000+ species from sPlot 2.1 and TRY 3.0.

[5] Note, the single temporary result files from TNRS are no longer available.

[6] Temporary file that no longer exists.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
LICENSE		LICENSE
README.md		README.md
Taxback.Rmd		Taxback.Rmd
Taxback.html		Taxback.html
Taxback.md		Taxback.md
Taxback.pdf		Taxback.pdf
bibliography.bib		bibliography.bib
journal-of-ecology.csl		journal-of-ecology.csl

License

oliverpurschke/Taxonomic_Backbone

Folders and files

Latest commit

History

Repository files navigation

Taxonomic backbone for sPlot 2.1 and TRY 3.0

Combine taxon names from sPlot 2.0 and TRY 3.0

Load required packages

Read in taxon names from sPlot and TRY

Reading sPlot Species

Read in TRY 3.0 species list

Combine species lists

Add species' database affiliation

A-priori cleaning of names

Manual cleaning

String manipulation routines

Substitute a-priorily cleaned names with corrected names

Do some more cleaning on CleanedNames.JD

Substitute old CleanedNames.JD with new string-manipulated one:

Match names against Taxonomic Name Resolution Service (TNRS)

Slice CleanedNames into chunks

TNRS settings

Sources for name resolution

Family Classification

Retrieve results

Manually inspecting name matching results

Read and combine TNRS result files

Set up doParallel cluster to speed up the reading process

Read and combine the single files

Select correctly resolved names

General procedure

Family level

Create index for correct names

Forma level

Genus level

Infraspecies level

Species level

Subspecies level

Variety level

Identifying "non-matched" species that are spermatophyta

Select certain or uncertain names

Resolve 'uncertain' names using Tropicos

Read in 'TNRS-Tropicos' results

Reduce to set of uncertain species tnrs.tpl

Selecting correctly resolved names

Family level

Forma level

Genus level

Species level

Subspecies level

Variety level

Select certain and uncertain names

Resolve ‘uncertain’ names using NCBI

Read in ‘TNRS-NCBI’ results

Selecting correctly resolved names in TNRS_NCBI

Family level

Genus level

Species level

Variety level

Select certain or uncertain names

Resolve remaining uncertain names

Identify certain and uncertain species

Using The Plant List matching tools for unresolved names

Lists of names for TPL

Tag names that could not be resolved

Check ncbi.certain

Assign No suitable matches found. to non-correctable species

Combine the uncertain and certain species lists

Resolve names that were reduced to genus-level by TNRS

Manually add and fill in new columns to tpl.res.comb.csv:

Combine species.correct and genus.correct

Join tnrs.ncbi.certain and tpl.res.comb.2

Get match correct

Merge the resolved species lists

Read files

TPL.small.certain

Trop.small.certain

Pick the respective NCBI data sets

Complement the main columns in the backbone

Do some more cleaning on `CleanedNames.JD`

Substitute old `CleanedNames.JD` with new string-manipulated one:

Slice `CleanedNames` into chunks

Set up `doParallel` cluster to speed up the reading process

Select `certain` or `uncertain` names

Resolve 'uncertain' names using `Tropicos`

Reduce to set of uncertain species `tnrs.tpl`

Select `certain` and `uncertain` names

Resolve ‘uncertain’ names using `NCBI`

Selecting correctly resolved names in `TNRS_NCBI`

Select `certain` or `uncertain` names

Resolve remaining `uncertain` names

Identify `certain` and `uncertain` species

Using `The Plant List` matching tools for unresolved names

Lists of names for `TPL`

Check `ncbi.certain`

Assign `No suitable matches found.` to non-correctable species

Combine the `uncertain` and `certain` species lists

Manually add and fill in new columns to `tpl.res.comb.csv`:

Combine `species.correct` and `genus.correct`

Join `tnrs.ncbi.certain` and `tpl.res.comb.2`

`TPL.small.certain`

`Trop.small.certain`

Pick the respective `NCBI` data sets

Fill in `rank.correct`

Fill `status.correct`

Fill `name.correct`

Fill in `names.short correct`

Shorten names that have more than two words and where the second word is a `x`

In `name.short correct` insert `NA` where `"No suitable matches found."`

Fill in `rank.short.correct`

Link backbone to original taxa names and `sPlot/TRY` code

Identifying `non-matched` species that are spermatophyta

Select correctly resolved names (`Tropicos` first)

`Status.correct`

`Name.correct`

`Rank.correct`

`Family.correct`

`Name.short.correct`

`Rank.short.correct`

Merge additional species with the big backbone for `sPlot 2.0` and `TRY 3.0`

Tag vascular species in `backbone.splot2.1.try3`