vignettes/rerddap.Rmd

---
title: rerddap introduction
author: Scott Chamberlain
date: "2022-09-30"
output: rmarkdown::html_vignette
vignette: >
    %\VignetteIndexEntry{rerddap introduction}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
---


`rerddap` is a general purpose R client for working with ERDDAP servers. ERDDAP is a server built on top of OPenDAP, which serves some NOAA data. You can get gridded data ([griddap](https://upwell.pfeg.noaa.gov/erddap/griddap/documentation.html)), which lets you query from gridded datasets, or table data ([tabledap](https://upwell.pfeg.noaa.gov/erddap/tabledap/documentation.html)) which lets you query from tabular datasets. In terms of how we interface with them, there are similarties, but some differences too. We try to make a similar interface to both data types in `rerddap`.

## NetCDF

`rerddap` supports NetCDF format, and is the default when using the `griddap()` function. NetCDF is a binary file format, and will have a much smaller footprint on your disk than csv. The binary file format means it's harder to inspect, but the `ncdf4` package makes it easy to pull data out and write data back into a NetCDF file. Note the the file extension for NetCDF files is `.nc`. Whether you choose NetCDF or csv for small files won't make much of a difference, but will with large files.

## Caching

Data files downloaded are cached in a single hidden directory `~/.rerddap` on your machine. It's hidden so that you don't accidentally delete the data, but you can still easily delete the data if you like.

When you use `griddap()` or `tabledap()` functions, we construct a MD5 hash from the base URL, and any query parameters - this way each query is separately cached. Once we have the hash, we look in `~/.rerddap` for a matching hash. If there's a match we use that file on disk - if no match, we make a http request for the data to the ERDDAP server you specify.

## ERDDAP servers

You can get a data.frame of ERDDAP servers using the function `servers()`. The list of ERDDAP servers is drawn from the *Awesome ERDDAP* page maintained by the Irish Marine Institute .  If you know of more ERDDAP servers, follow the instructions on that page to add the server.

## Install

Stable version from CRAN


```r
install.packages("rerddap")
```

Or, the development version from GitHub


```r
remotes::install_github("ropensci/rerddap")
```


```r
library("rerddap")
```

## Search

First, you likely want to search for data, specify either `griddadp` or `tabledap`


```r
ed_search(query = 'size', which = "table")
#> # A tibble: 41 × 2
#>    title                                                                 datas…¹
#>    <chr>                                                                 <chr>  
#>  1 CCE Prey Size and Hard Part Size Regressions                          mmtdPr…
#>  2 CCE Teleost Prey Size and Hard Part Size Measurements                 mmtdTe…
#>  3 CalCOFI Larvae Sizes                                                  erdCal…
#>  4 Seabird Prey Size                                                     cciea_…
#>  5 CCE Non-Teleost Prey Size and Hard Part Size Measurements             mmtdNo…
#>  6 Channel Islands, Kelp Forest Monitoring, Size and Frequency, Natural… erdCin…
#>  7 File Names from the AWS S3 noaa-goes16 Bucket                         awsS3N…
#>  8 File Names from the AWS S3 noaa-goes17 Bucket                         awsS3N…
#>  9 PacIOOS Beach Camera 001: Waikiki, Oahu, Hawaii                       BEACHC…
#> 10 PacIOOS Beach Camera 003: Waimea Bay, Oahu, Hawaii                    BEACHC…
#> # … with 31 more rows, and abbreviated variable name ¹​dataset_id
```


```r
ed_search(query = 'size', which = "grid")
#> # A tibble: 54 × 2
#>    title                                                                 datas…¹
#>    <chr>                                                                 <chr>  
#>  1 Audio data from a local source.                                       testGr…
#>  2 Main Hawaiian Islands Multibeam Bathymetry Synthesis: 50-m Bathymetry hmrg_b…
#>  3 Main Hawaiian Islands Multibeam Bathymetry Synthesis: 50-m Bathymetr… hmrg_b…
#>  4 Coastal Upwelling Transport Index (CUTI), Daily                       erdCUT…
#>  5 SST smoothed frontal gradients                                        FRD_SS…
#>  6 Coastal Upwelling Transport Index (CUTI), Monthly                     erdCUT…
#>  7 SST smoothed frontal gradients, Lon0360                               FRD_SS…
#>  8 Biologically Effective Upwelling Transport Index (BEUTI), Daily       erdBEU…
#>  9 Biologically Effective Upwelling Transport Index (BEUTI), Monthly     erdBEU…
#> 10 monthly mean psi from the NCEP Reanalysis (psi.mon.ltm), 0001         noaa_p…
#> # … with 44 more rows, and abbreviated variable name ¹​dataset_id
```

There is now a convenience function to search over a list of ERDDAP servers,  designed to work with the function `servers()`


```r
global_search(query, server_list, which_service)
#> Error in check_arg(query, "character"): object 'query' not found
```

## Information

Then you can get information on a single dataset


```r
info('erdCalCOFIlrvsiz')
#> <ERDDAP info> erdCalCOFIlrvsiz 
#>  Base URL: https://upwell.pfeg.noaa.gov/erddap 
#>  Dataset Type: tabledap 
#>  Variables:  
#>      calcofi_species_code: 
#>          Range: 19, 946 
#>      common_name: 
#>      cruise: 
#>      itis_tsn: 
#>      larvae_10m2: 
...
```

## griddap (gridded) data

First, get information on a dataset to see time range, lat/long range, and variables.


```r
(out <- info('erdMBchla1day'))
#> <ERDDAP info> erdMBchla1day 
#>  Base URL: https://upwell.pfeg.noaa.gov/erddap 
#>  Dataset Type: griddap 
#>  Dimensions (range):  
#>      time: (2006-01-01T12:00:00Z, 2022-09-28T12:00:00Z) 
#>      altitude: (0.0, 0.0) 
#>      latitude: (-45.0, 65.0) 
#>      longitude: (120.0, 320.0) 
#>  Variables:  
#>      chlorophyll: 
#>          Units: mg m-3
```

Then query for gridded data using the `griddap()` function


```r
(res <- griddap(out,
  time = c('2015-01-01','2015-01-03'),
  latitude = c(14, 15),
  longitude = c(125, 126)
))
#> <ERDDAP griddap> erdMBchla1day
#>    Path: [~/Library/Caches/R/rerddap/4d844aa48552049c3717ac94ced5f9b8.nc]
#>    Last updated: [2022-09-30 09:34:02]
#>    File size:    [0.03 mb]
#>    Dimensions (dims/vars):   [4 X 1]
#>    Dim names: time, altitude, latitude, longitude
#>    Variable names: Chlorophyll Concentration in Sea Water
#>    data.frame (rows/columns):   [5043 X 5]
#> # A tibble: 5,043 × 5
#>    longitude latitude altitude time                 chlorophyll
#>        <dbl>    <dbl>    <dbl> <chr>                      <dbl>
#>  1      125        14        0 2015-01-01T12:00:00Z          NA
#>  2      125.       14        0 2015-01-01T12:00:00Z          NA
#>  3      125.       14        0 2015-01-01T12:00:00Z          NA
#>  4      125.       14        0 2015-01-01T12:00:00Z          NA
#>  5      125.       14        0 2015-01-01T12:00:00Z          NA
#>  6      125.       14        0 2015-01-01T12:00:00Z          NA
#>  7      125.       14        0 2015-01-01T12:00:00Z          NA
#>  8      125.       14        0 2015-01-01T12:00:00Z          NA
#>  9      125.       14        0 2015-01-01T12:00:00Z          NA
#> 10      125.       14        0 2015-01-01T12:00:00Z          NA
#> # … with 5,033 more rows
```

The output of `griddap()` is a list that you can explore further. Get the summary


```r
res$summary
#> $filename
#> [1] "~/Library/Caches/R/rerddap/4d844aa48552049c3717ac94ced5f9b8.nc"
#> 
#> $writable
#> [1] FALSE
#> 
#> $id
#> [1] 65536
#> 
#> $error
#> [1] FALSE
#> 
#> $safemode
#> [1] FALSE
#> 
...
```

Get the dimension variables


```r
names(res$summary$dim)
#> [1] "time"      "altitude"  "latitude"  "longitude"
```

Get the data.frame (beware: you may want to just look at the `head` of the data.frame if large)


```r
head(res$data)
#>   longitude latitude altitude                 time chlorophyll
#> 1   125.000       14        0 2015-01-01T12:00:00Z          NA
#> 2   125.025       14        0 2015-01-01T12:00:00Z          NA
#> 3   125.050       14        0 2015-01-01T12:00:00Z          NA
#> 4   125.075       14        0 2015-01-01T12:00:00Z          NA
#> 5   125.100       14        0 2015-01-01T12:00:00Z          NA
#> 6   125.125       14        0 2015-01-01T12:00:00Z          NA
```

## tabledap (tabular) data


```r
(out <- info('erdCalCOFIlrvsiz'))
#> <ERDDAP info> erdCalCOFIlrvsiz 
#>  Base URL: https://upwell.pfeg.noaa.gov/erddap 
#>  Dataset Type: tabledap 
#>  Variables:  
#>      calcofi_species_code: 
#>          Range: 19, 946 
#>      common_name: 
#>      cruise: 
#>      itis_tsn: 
#>      larvae_10m2: 
...
```


```r
(dat <- tabledap('erdCalCOFIlrvsiz', fields=c('latitude','longitude','larvae_size',
  'scientific_name'), 'time>=2011-01-01', 'time<=2011-12-31'))
#> <ERDDAP tabledap> erdCalCOFIlrvsiz
#>    Path: [~/Library/Caches/R/rerddap/db7389db5b5b0ed9c426d5c13bc43d18.csv]
#>    Last updated: [2022-09-30 09:34:05]
#>    File size:    [0.05 mb]
#> # A tibble: 1,304 × 4
#>    latitude  longitude  larvae_size scientific_name     
#>    <chr>     <chr>      <chr>       <chr>               
#>  1 32.956665 -117.305   4.5         Engraulis mordax    
#>  2 32.91     -117.4     5.0         Merluccius productus
#>  3 32.511665 -118.21167 2.0         Merluccius productus
#>  4 32.511665 -118.21167 3.0         Merluccius productus
#>  5 32.511665 -118.21167 5.5         Merluccius productus
#>  6 32.511665 -118.21167 6.0         Merluccius productus
#>  7 32.511665 -118.21167 2.8         Merluccius productus
#>  8 32.511665 -118.21167 3.0         Sardinops sagax     
#>  9 32.511665 -118.21167 5.0         Sardinops sagax     
#> 10 32.511665 -118.21167 2.5         Engraulis mordax    
#> # … with 1,294 more rows
```

Since both `griddap()` and `tabledap()` give back data.frame's, it's easy to do downstream manipulation. For example, we can use `dplyr` to filter, summarize, group, and sort:


```r
library("dplyr")
dat$larvae_size <- as.numeric(dat$larvae_size)
dat %>%
  group_by(scientific_name) %>%
  summarise(mean_size = mean(larvae_size)) %>%
  arrange(desc(mean_size))
#> # A tibble: 7 × 2
#>   scientific_name       mean_size
#>   <chr>                     <dbl>
#> 1 Anoplopoma fimbria        23.3 
#> 2 Engraulis mordax           9.26
#> 3 Sardinops sagax            7.28
#> 4 Merluccius productus       5.48
#> 5 Tactostoma macropus        5   
#> 6 Scomber japonicus          3.4 
#> 7 Trachurus symmetricus      3.29
```