-
Notifications
You must be signed in to change notification settings - Fork 14
/
rerddap.Rmd
282 lines (225 loc) · 10.4 KB
/
rerddap.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
---
title: rerddap introduction
author: Scott Chamberlain
date: "2022-09-30"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{rerddap introduction}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
`rerddap` is a general purpose R client for working with ERDDAP servers. ERDDAP is a server built on top of OPenDAP, which serves some NOAA data. You can get gridded data ([griddap](https://upwell.pfeg.noaa.gov/erddap/griddap/documentation.html)), which lets you query from gridded datasets, or table data ([tabledap](https://upwell.pfeg.noaa.gov/erddap/tabledap/documentation.html)) which lets you query from tabular datasets. In terms of how we interface with them, there are similarties, but some differences too. We try to make a similar interface to both data types in `rerddap`.
## NetCDF
`rerddap` supports NetCDF format, and is the default when using the `griddap()` function. NetCDF is a binary file format, and will have a much smaller footprint on your disk than csv. The binary file format means it's harder to inspect, but the `ncdf4` package makes it easy to pull data out and write data back into a NetCDF file. Note the the file extension for NetCDF files is `.nc`. Whether you choose NetCDF or csv for small files won't make much of a difference, but will with large files.
## Caching
Data files downloaded are cached in a single hidden directory `~/.rerddap` on your machine. It's hidden so that you don't accidentally delete the data, but you can still easily delete the data if you like.
When you use `griddap()` or `tabledap()` functions, we construct a MD5 hash from the base URL, and any query parameters - this way each query is separately cached. Once we have the hash, we look in `~/.rerddap` for a matching hash. If there's a match we use that file on disk - if no match, we make a http request for the data to the ERDDAP server you specify.
## ERDDAP servers
You can get a data.frame of ERDDAP servers using the function `servers()`. The list of ERDDAP servers is drawn from the *Awesome ERDDAP* page maintained by the Irish Marine Institute . If you know of more ERDDAP servers, follow the instructions on that page to add the server.
## Install
Stable version from CRAN
```r
install.packages("rerddap")
```
Or, the development version from GitHub
```r
remotes::install_github("ropensci/rerddap")
```
```r
library("rerddap")
```
## Search
First, you likely want to search for data, specify either `griddadp` or `tabledap`
```r
ed_search(query = 'size', which = "table")
#> # A tibble: 41 × 2
#> title datas…¹
#> <chr> <chr>
#> 1 CCE Prey Size and Hard Part Size Regressions mmtdPr…
#> 2 CCE Teleost Prey Size and Hard Part Size Measurements mmtdTe…
#> 3 CalCOFI Larvae Sizes erdCal…
#> 4 Seabird Prey Size cciea_…
#> 5 CCE Non-Teleost Prey Size and Hard Part Size Measurements mmtdNo…
#> 6 Channel Islands, Kelp Forest Monitoring, Size and Frequency, Natural… erdCin…
#> 7 File Names from the AWS S3 noaa-goes16 Bucket awsS3N…
#> 8 File Names from the AWS S3 noaa-goes17 Bucket awsS3N…
#> 9 PacIOOS Beach Camera 001: Waikiki, Oahu, Hawaii BEACHC…
#> 10 PacIOOS Beach Camera 003: Waimea Bay, Oahu, Hawaii BEACHC…
#> # … with 31 more rows, and abbreviated variable name ¹dataset_id
```
```r
ed_search(query = 'size', which = "grid")
#> # A tibble: 54 × 2
#> title datas…¹
#> <chr> <chr>
#> 1 Audio data from a local source. testGr…
#> 2 Main Hawaiian Islands Multibeam Bathymetry Synthesis: 50-m Bathymetry hmrg_b…
#> 3 Main Hawaiian Islands Multibeam Bathymetry Synthesis: 50-m Bathymetr… hmrg_b…
#> 4 Coastal Upwelling Transport Index (CUTI), Daily erdCUT…
#> 5 SST smoothed frontal gradients FRD_SS…
#> 6 Coastal Upwelling Transport Index (CUTI), Monthly erdCUT…
#> 7 SST smoothed frontal gradients, Lon0360 FRD_SS…
#> 8 Biologically Effective Upwelling Transport Index (BEUTI), Daily erdBEU…
#> 9 Biologically Effective Upwelling Transport Index (BEUTI), Monthly erdBEU…
#> 10 monthly mean psi from the NCEP Reanalysis (psi.mon.ltm), 0001 noaa_p…
#> # … with 44 more rows, and abbreviated variable name ¹dataset_id
```
There is now a convenience function to search over a list of ERDDAP servers, designed to work with the function `servers()`
```r
global_search(query, server_list, which_service)
#> Error in check_arg(query, "character"): object 'query' not found
```
## Information
Then you can get information on a single dataset
```r
info('erdCalCOFIlrvsiz')
#> <ERDDAP info> erdCalCOFIlrvsiz
#> Base URL: https://upwell.pfeg.noaa.gov/erddap
#> Dataset Type: tabledap
#> Variables:
#> calcofi_species_code:
#> Range: 19, 946
#> common_name:
#> cruise:
#> itis_tsn:
#> larvae_10m2:
...
```
## griddap (gridded) data
First, get information on a dataset to see time range, lat/long range, and variables.
```r
(out <- info('erdMBchla1day'))
#> <ERDDAP info> erdMBchla1day
#> Base URL: https://upwell.pfeg.noaa.gov/erddap
#> Dataset Type: griddap
#> Dimensions (range):
#> time: (2006-01-01T12:00:00Z, 2022-09-28T12:00:00Z)
#> altitude: (0.0, 0.0)
#> latitude: (-45.0, 65.0)
#> longitude: (120.0, 320.0)
#> Variables:
#> chlorophyll:
#> Units: mg m-3
```
Then query for gridded data using the `griddap()` function
```r
(res <- griddap(out,
time = c('2015-01-01','2015-01-03'),
latitude = c(14, 15),
longitude = c(125, 126)
))
#> <ERDDAP griddap> erdMBchla1day
#> Path: [~/Library/Caches/R/rerddap/4d844aa48552049c3717ac94ced5f9b8.nc]
#> Last updated: [2022-09-30 09:34:02]
#> File size: [0.03 mb]
#> Dimensions (dims/vars): [4 X 1]
#> Dim names: time, altitude, latitude, longitude
#> Variable names: Chlorophyll Concentration in Sea Water
#> data.frame (rows/columns): [5043 X 5]
#> # A tibble: 5,043 × 5
#> longitude latitude altitude time chlorophyll
#> <dbl> <dbl> <dbl> <chr> <dbl>
#> 1 125 14 0 2015-01-01T12:00:00Z NA
#> 2 125. 14 0 2015-01-01T12:00:00Z NA
#> 3 125. 14 0 2015-01-01T12:00:00Z NA
#> 4 125. 14 0 2015-01-01T12:00:00Z NA
#> 5 125. 14 0 2015-01-01T12:00:00Z NA
#> 6 125. 14 0 2015-01-01T12:00:00Z NA
#> 7 125. 14 0 2015-01-01T12:00:00Z NA
#> 8 125. 14 0 2015-01-01T12:00:00Z NA
#> 9 125. 14 0 2015-01-01T12:00:00Z NA
#> 10 125. 14 0 2015-01-01T12:00:00Z NA
#> # … with 5,033 more rows
```
The output of `griddap()` is a list that you can explore further. Get the summary
```r
res$summary
#> $filename
#> [1] "~/Library/Caches/R/rerddap/4d844aa48552049c3717ac94ced5f9b8.nc"
#>
#> $writable
#> [1] FALSE
#>
#> $id
#> [1] 65536
#>
#> $error
#> [1] FALSE
#>
#> $safemode
#> [1] FALSE
#>
...
```
Get the dimension variables
```r
names(res$summary$dim)
#> [1] "time" "altitude" "latitude" "longitude"
```
Get the data.frame (beware: you may want to just look at the `head` of the data.frame if large)
```r
head(res$data)
#> longitude latitude altitude time chlorophyll
#> 1 125.000 14 0 2015-01-01T12:00:00Z NA
#> 2 125.025 14 0 2015-01-01T12:00:00Z NA
#> 3 125.050 14 0 2015-01-01T12:00:00Z NA
#> 4 125.075 14 0 2015-01-01T12:00:00Z NA
#> 5 125.100 14 0 2015-01-01T12:00:00Z NA
#> 6 125.125 14 0 2015-01-01T12:00:00Z NA
```
## tabledap (tabular) data
```r
(out <- info('erdCalCOFIlrvsiz'))
#> <ERDDAP info> erdCalCOFIlrvsiz
#> Base URL: https://upwell.pfeg.noaa.gov/erddap
#> Dataset Type: tabledap
#> Variables:
#> calcofi_species_code:
#> Range: 19, 946
#> common_name:
#> cruise:
#> itis_tsn:
#> larvae_10m2:
...
```
```r
(dat <- tabledap('erdCalCOFIlrvsiz', fields=c('latitude','longitude','larvae_size',
'scientific_name'), 'time>=2011-01-01', 'time<=2011-12-31'))
#> <ERDDAP tabledap> erdCalCOFIlrvsiz
#> Path: [~/Library/Caches/R/rerddap/db7389db5b5b0ed9c426d5c13bc43d18.csv]
#> Last updated: [2022-09-30 09:34:05]
#> File size: [0.05 mb]
#> # A tibble: 1,304 × 4
#> latitude longitude larvae_size scientific_name
#> <chr> <chr> <chr> <chr>
#> 1 32.956665 -117.305 4.5 Engraulis mordax
#> 2 32.91 -117.4 5.0 Merluccius productus
#> 3 32.511665 -118.21167 2.0 Merluccius productus
#> 4 32.511665 -118.21167 3.0 Merluccius productus
#> 5 32.511665 -118.21167 5.5 Merluccius productus
#> 6 32.511665 -118.21167 6.0 Merluccius productus
#> 7 32.511665 -118.21167 2.8 Merluccius productus
#> 8 32.511665 -118.21167 3.0 Sardinops sagax
#> 9 32.511665 -118.21167 5.0 Sardinops sagax
#> 10 32.511665 -118.21167 2.5 Engraulis mordax
#> # … with 1,294 more rows
```
Since both `griddap()` and `tabledap()` give back data.frame's, it's easy to do downstream manipulation. For example, we can use `dplyr` to filter, summarize, group, and sort:
```r
library("dplyr")
dat$larvae_size <- as.numeric(dat$larvae_size)
dat %>%
group_by(scientific_name) %>%
summarise(mean_size = mean(larvae_size)) %>%
arrange(desc(mean_size))
#> # A tibble: 7 × 2
#> scientific_name mean_size
#> <chr> <dbl>
#> 1 Anoplopoma fimbria 23.3
#> 2 Engraulis mordax 9.26
#> 3 Sardinops sagax 7.28
#> 4 Merluccius productus 5.48
#> 5 Tactostoma macropus 5
#> 6 Scomber japonicus 3.4
#> 7 Trachurus symmetricus 3.29
```