rBiodiversidata/Plants Data Paper Scripts at master · bienflorencia/rBiodiversidata

Name	Name	Last commit message	Last commit date
parent directory ..
README_files/figure-markdown_strict	README_files/figure-markdown_strict
DataPaper_Scripts.Rmd	DataPaper_Scripts.Rmd
README.html	README.html
README.md	README.md

The following scripts were used to generate the analyses and figures of the Data Paper “Biodiversidata: A novel dataset for the vascular plant species diversity in Uruguay”, published in the Biodiversity Data Journal.

The dataset Biodiversidata_Plants_1.0.0 provides primary biodiversity data on extant vascular plant (native and introduced) species recorded within the country area of Uruguay. This is Biodiversidata's second open-access set of data on the biodiversity of the country.

You can find the data here

We will see the data's:

Taxonomic coverage
Temporal coverage
Geographic coverage

To run this code you will need the following R packages:

library(extrafont)
library(lubridate)
library(gridExtra)
library(ggpubr)
library(sf)
library(tidyverse)

First, let's explore the data!

Let's start by reading the data from the database. Once you have downloaded the dataset, the data will look like this (a tibble with 12,470 rows and 61 columns):

Biodiversidata_Plants

## # A tibble: 12,470 x 61
##    license occurrenceID basisOfRecord recordedBy establishmentMe~
##    <chr>   <chr>        <chr>         <chr>      <chr>           
##  1 https:~ Biodiversid~ HumanObserva~ Andrés Go~ introducida     
##  2 https:~ Biodiversid~ HumanObserva~ Andrés Go~ nativa          
##  3 https:~ Biodiversid~ HumanObserva~ Andrés Go~ nativa          
##  4 https:~ Biodiversid~ HumanObserva~ Andrés Go~ nativa          
##  5 https:~ Biodiversid~ HumanObserva~ Andrés Go~ nativa          
##  6 https:~ Biodiversid~ HumanObserva~ Andrés Go~ nativa          
##  7 https:~ Biodiversid~ HumanObserva~ Andrés Go~ nativa          
##  8 https:~ Biodiversid~ HumanObserva~ Andrés Go~ nativa          
##  9 https:~ Biodiversid~ HumanObserva~ Andrés Go~ nativa          
## 10 https:~ Biodiversid~ HumanObserva~ Andrés Go~ nativa          
## # ... with 12,460 more rows, and 56 more variables:
## #   previousIdentifications <chr>, eventDate <chr>, year <dbl>, month <dbl>,
## #   day <dbl>, higherGeography <chr>, continent <chr>, country <chr>,
## #   countryCode <chr>, stateProvince <chr>, locality <chr>,
## #   decimalLatitude <dbl>, decimalLongitude <dbl>, geodeticDatum <chr>,
## #   coordinateUncertaintyInMeters <dbl>, coordinatePrecision <dbl>,
## #   georeferencedBy <chr>, identifiedBy <chr>, taxonID <chr>,
## #   scientificName <chr>, nameAccordingTo <chr>, higherClassification <chr>,
## #   kingdom <chr>, phylum <chr>, class <chr>, order <chr>, family <chr>,
## #   species <chr>, genus <chr>, specificEpithet <chr>,
## #   infraspecificEpithet <chr>, taxonRank <chr>,
## #   scientificNameAuthorship <chr>, institutionCode <chr>,
## #   collectionCode <chr>, catalogNumber <chr>, recordNumber <chr>,
## #   associatedReferences <chr>, verbatimLocality <chr>,
## #   georeferenceRemarks <chr>, vernacularName <chr>, locationAccordingTo <chr>,
## #   georeferencedDate <date>, georeferenceSources <chr>,
## #   georeferenceVerificationStatus <chr>, organismRemarks <chr>,
## #   verbatimLatitude <chr>, verbatimLongitude <chr>,
## #   verbatimCoordinateSystem <chr>, locationRemarks <chr>,
## #   measurementType <chr>, measurementValue <chr>,
## #   measurementDeterminedBy <chr>, measurementRemarks <chr>, language <chr>,
## #   otherCatalogNumbers <chr>

What's the total amount of records and species in the database?

To get sumarised data out of our database, such as the total number of species or the total number of records with date of collection, we use the function summarise together with n_distinct. We can group_by any column we want, for example, by the taxonomic class of each record or by any specific group. To get the Total values, we can also summarise without grouping the data.

Biodiversidata_Plants %>%
  mutate(group=ifelse(class=='Magnoliopsida'|class=='Liliopsida', 'Angiosperms',
                            ifelse(class=='Gnetopsida'|class=='Pinopsida', 'Gymnosperms',
                                   ifelse(class=='Polypodiopsida', 'Ferns', 
                                          'Lycophytes')))) %>% 
  group_by(group) %>% # use this or comment depending on the output you want
  summarise(N= n(), # total number of records
            S= n_distinct(species), # number of species
            wDate= sum(!(is.na(year))), # records that have a year of collection
            last30y = sum(year>1990, na.rm = T), # records that were collected in the last 30 years
            wCoordinates= sum(!(is.na(decimalLatitude))))  # records that have a decimal latitude

Here's a table that summarises all the data, binding the information extracted for the groupped data and for the total data. To produce the table we are using knitr::kable.

Group	Number of Occurrence Records	Number of Species	Records with Date (%)	Records from the last 30 years (%)	Records with Coordinates (%)
Lycophytes	13	6	13 (100)	11 (84.6)	13 (100)
Ferns	540	78	540 (100)	508 (94.1)	540 (100)
Gymnosperms	48	5	41 (85.4)	39 (81.2)	48 (100)
Angiosperms	11869	1559	10527 (88.7)	9585 (80.8)	11869 (100)
Total	12470	1648	11121 (89.2)	10143 (81.3)	12470 (100)

General overview

To get other useful information to fill the metadata section, we will also summarise fields such as:

knitr::kable(Biodiversidata_Plants %>% 
  unite('species', genus, specificEpithet, na.rm = TRUE, sep = ' ', remove = FALSE) %>% 
  mutate(places=str_c(decimalLatitude, decimalLongitude)) %>% # unique point locations
  summarise(recordsT= n(),
            speciesN= n_distinct(species),
            familyN= n_distinct(family),
            genusN= n_distinct(genus),
            placesN= n_distinct(places),
            StartDate=min(eventDate, na.rm = T),
            EndDate=max(eventDate, na.rm = T),
            North=max(decimalLatitude),
            South=min(decimalLatitude),
            East=max(decimalLongitude),
            West=min(decimalLongitude),
            departamentos=n_distinct(stateProvince)) %>% 
  rename('Occurrence Records'=recordsT, 'Number of Species'=speciesN, 
         'Families'=familyN, 'Genera'=genusN,
         'Localities (lat/lon)'=placesN, 'Start Date'=StartDate,
         'End Date'=EndDate, 'States (Departamentos)'=departamentos),  align = 'r')

Occurrence Records	Number of Species	Families	Genera	Localities (lat/lon)	Start Date	End Date	North	South	East	West	States (Departamentos)
12470	1648	160	760	4567	1877-02	2020-05-21	-30.17587	-34.9751	-53.2636	-58.43	19

1) Taxonomic coverage

The database includes 12,470 records from 160 families and 760 genera, representing 1,648 species: with 6 species of the Lycopodiopsida class (Lycophytes), 48 of Gnetopsida and Pinopsida classes (Gymnosperms), 78 of the Polypodiopsida class (Ferns), and 1,145 of the Magnoliopsida and Liliopsida classes (Angiosperms). Also, the database includes 1,379 native species and 270 introduced, you can see that in the column establishmentMeans.

What's the number of records/species/families per group?

Group	Records	Species	Families
Lycophytes	13	6	2
Ferns	540	78	15
Gymnosperms	48	5	4
Angiosperms	11869	1559	139

To simplify the plots we are going to keep only the families of Angiosperms and Ferns with more than 50 a 20 records respectively. We will use this variable familiesToPlot to filter families in the following plots:

familiesToPlot <- Biodiversidata_Plants %>% 
  group_by(class, family) %>% count() %>% 
  mutate(classLevels=ifelse(class=='Magnoliopsida'|class=='Liliopsida', 'Angiosperms',
                            ifelse(class=='Gnetopsida'|class=='Pinopsida', 'Gymnosperms',
                                   ifelse(class=='Polypodiopsida', 'Ferns', 
                                          'Lycophytes')))) %>% 
  filter(classLevels!='Angiosperms' | classLevels=='Angiosperms'& n>=50) %>% 
  filter(classLevels!='Ferns' | classLevels=='Ferns'& n>=20) %>% pull(family)

Now we plot the number of records by family, faceting the plots by group.

grid.arrange(Biodiversidata_Plants %>% 
  group_by(class, family) %>% count() %>% 
  mutate(classLevels=ifelse(class=='Magnoliopsida'|class=='Liliopsida', 'Angiosperms',
                            ifelse(class=='Gnetopsida'|class=='Pinopsida', 'Gymnosperms',
                                   ifelse(class=='Polypodiopsida', 'Ferns', 
                                          'Lycophytes')))) %>% 
  filter(classLevels!='Angiosperms' & family %in% familiesToPlot) %>% 
  mutate(classLevels=factor(classLevels, levels=c('Lycophytes',
                                                  'Ferns', 
                                                  'Gymnosperms',
                                                  'Angiosperms'))) %>% 
  ggbarplot(x = "family", y = "n",
            fill = "classLevels",             
            color = "white",      
            drop= T,
            palette = c("#D7191C", "#FDAE61", "#ABDDA4"),   
            sort.val = "desc",         
            x.text.angle = 90,          
            legend = "none",
            ylab = 'Number of Records', xlab = '') + 
  scale_y_continuous(breaks=scales::breaks_extended(8))+
  theme(text=element_text(family='Calibri', size = 10)) +
  facet_wrap(~classLevels, nrow=2, ncol=3, scales = "free", drop=TRUE),
  Biodiversidata_Plants %>% 
    group_by(class, family) %>% count() %>% 
    mutate(classLevels=ifelse(class=='Magnoliopsida'|class=='Liliopsida', 'Angiosperms',
                              ifelse(class=='Gnetopsida'|class=='Pinopsida', 'Gymnosperms',
                                     ifelse(class=='Polypodiopsida', 'Ferns', 
                                            'Lycophytes')))) %>% 
    filter(classLevels=='Angiosperms' & family %in% familiesToPlot) %>% 
    mutate(classLevels=factor(classLevels, levels=c('Lycophytes',
                                                    'Ferns', 
                                                    'Gymnosperms',
                                                    'Angiosperms'))) %>% 
    ggbarplot(x = "family", y = "n",
              fill = "classLevels",             
              color = "white",      
              drop= T,
              palette = "#2B83BA",   
              sort.val = "desc",         
              x.text.angle = 90,          
              legend = "none",
              ylab = 'Number of Records', xlab = 'Family') + 
    scale_y_continuous(breaks=scales::breaks_extended(8))+
    theme(text=element_text(family='Calibri', size = 10)) +
    facet_wrap(~classLevels, nrow=2, ncol=3, scales = "free", drop=TRUE))

2) Temporal coverage

The occurrence records included in Biodiversidata's plants database cover samples reported in Uruguay during the period of 1877–2020. How are these records distributed over time within the groups?

Biodiversidata_Plants %>% 
  mutate(classLevels=ifelse(class=='Magnoliopsida'|class=='Liliopsida', 'Angiosperms',
                            ifelse(class=='Gnetopsida'|class=='Pinopsida', 'Gymnosperms',
                                   ifelse(class=='Polypodiopsida', 'Ferns', 
                                          'Lycophytes')))) %>% 
  mutate(classLevels=factor(classLevels, levels=c('Lycophytes',
                                                  'Ferns', 
                                                  'Gymnosperms',
                                                  'Angiosperms'))) %>% 
  filter(family %in% familiesToPlot) %>% 
  ggplot(aes(x=year, y=family, color=classLevels)) +
  geom_point(show.legend = FALSE) +
  scale_colour_brewer(palette ='Spectral')+
  facet_grid(classLevels~., scales = "free", space= 'free_y', switch='x' ,drop=TRUE) +
  theme_bw() +
  theme(text=element_text(family='Calibri', size=10)) +
  theme(strip.text.y = element_text(size=10, angle=0)) +
  labs(x='Year', y='Family', color = '')

3) Geographic coverage

To finish let's map the records.

First, we need to download Uruguay's shapefile data with the level 1 for administrative boundaries. We can use the package geouy to download this data or GADM.

Departamentos <- load_geouy('Departamentos') # We will use this one
Uruguay <- getData("GADM", country="UY", level=1)

Also, we will need a grid of 25x25km unit-size to plot sampling effort (you can create one using the st_make_grid function).

Grid_UY25 <- st_read('Grid_UY25.shp')

Finally, we are going to convert the plants data into an sf object.

Biodiversidata_Plants_GIS <- Biodiversidata_Plants %>% 
  as.data.frame %>% 
  sf::st_as_sf(coords = c("decimalLongitude", "decimalLatitude")) %>% 
  st_set_crs(4326) %>% st_transform(32721) # this CRS and transformation is specifically for Uruguay

Sampling effort analysis

To measure sampling effort we perform a spatial filter using st_join, with the gridded map as a base to count the occurence records that are contained within each grid-cell.

Grid_UY25_Plants <- st_join(x=Grid_UY25, 
                                  y=(Biodiversidata_Plants_GIS %>% 
                                       select(species)), # to simplify we just keep the species field
                                  left=TRUE, join = st_contains) %>% 
  group_by(Id) %>% 
  summarise(samplingEffort=ifelse(n_distinct(species, na.rm = TRUE)==0, 0, n())) # records per grid

Now we map the occurrence records and the distribution of the sampling effort.

occurrenceRecords <- ggplot() +
  geom_sf(data = Departamentos, color='black', fill='white') +
  geom_sf(data = Biodiversidata_Plants_GIS, size=1, alpha=0.5) +
  theme_bw() +
  theme(text=element_text(size=10))

samplingEffort <- ggplot() + 
  geom_sf(data = Grid_UY25_Plants, aes(fill=samplingEffort)) +
  scale_fill_binned(n.breaks=5, type = "viridis", show.limits = TRUE) +
  geom_sf(data = Grid_UY25_Plants %>% filter(samplingEffort==0), aes(fill=N), fill='white', alpha=0.8)+
  geom_sf(data = Departamentos, color='black', fill=NA) +
  labs(fill = 'Sampling Effort') +
  theme_bw() +
  theme(text=element_text(size=10))

gridExtra::grid.arrange(occurrenceRecords, samplingEffort, nrow=1)

For the Data Paper we used Inkscape and ArcGis 10.5 to improve this maps.

And that's all !

If you use the data please cite as: Florencia Grattarola, Andrés González, Patricia Mai, Laura Cappuccio, César Fagúndez-Pachón, Florencia Rossi, Franco Teixeira de Mello, Lucía Urtado & Daniel Pincheira-Donoso. (2020). Biodiversidata: A novel dataset for the vascular plant species diversity in Uruguay. Biodiversity Data Journal 8: e56850. https://doi.org/10.3897/BDJ.8.e56850

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plants Data Paper Scripts

Plants Data Paper Scripts

README_files/figure-markdown_strict

README_files/figure-markdown_strict

DataPaper_Scripts.Rmd

DataPaper_Scripts.Rmd

README.html

README.html

README.md

README.md

README.md

First, let's explore the data!

What's the total amount of records and species in the database?

General overview

1) Taxonomic coverage

What's the number of records/species/families per group?

2) Temporal coverage

3) Geographic coverage

Sampling effort analysis

And that's all !

Files

Plants Data Paper Scripts

Directory actions

More options

Directory actions

More options

Latest commit

History

Plants Data Paper Scripts

Folders and files

parent directory

First, let's explore the data!

What's the total amount of records and species in the database?

General overview

1) Taxonomic coverage

What's the number of records/species/families per group?

2) Temporal coverage

3) Geographic coverage

Sampling effort analysis

And that's all !