Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster R package vignettes #266

Draft
wants to merge 12 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
32 changes: 31 additions & 1 deletion R-packages/covidcast/DEVELOP.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ A short checklist for submitting pull requests:
2. If you have added any new features (new functions, new options, etc.), add a
brief description to `NEWS.md` to the next listed version number. Also ensure
that new functions or datasets are listed in the reference in `_pkgdown.yml`
so they appear in a good place in the documentation website.
so they appear in a good place in the function reference page.
3. If you changed any documentation, rebuild the documentation with
`devtools::document()` and then `pkgdown::build_site()`. (This can be slow,
because our vignettes take a long time to build.)
Expand All @@ -25,6 +25,17 @@ when you open a pull request, as part of running `R CMD check`. Failed tests and
check errors will both result in the build failing and errors being visible in
the pull request.

### Testing API access

Unit tests should *not* require Internet access. This poses a problem for
testing of functions like `covidcast_signal()`. In testing, it is also helpful
to be able to test different custom responses from the API server.

The [httptest package](https://enpiar.com/r/httptest/index.html) solves these
problems by hooking in to the `httr` package and replacing API requests with the
contents of specially named files. See `tests/testthat/test-covidcast.R` for
details and examples.

### Testing plots

We test our plots and maps with the [vdiffr](https://github.com/r-lib/vdiffr)
Expand Down Expand Up @@ -78,3 +89,22 @@ After changing a vignette or documentation, you'll need to rebuild the
documentation. Use `devtools::document()` to do this. Then
`pkgdown::build_site(".")` (from within the package directory) will rebuild the
HTML documentation site.

### Vignettes

Our vignettes can take quite a while to render; much of this time is spent in
downloading data from the COVIDcast API. This can make the package inconvenient
to install, take up time on our CI servers, and may cause issues when we
eventually submit to CRAN.

We hence use [httptest's vignette
features](https://enpiar.com/r/httptest/articles/vignettes.html) to record API
requests and serve them from static JSON files in the repository, instead of
actual API queries. This cuts the time dramatically. See the httptest
documentation for more details on how this works, and ensure that any new
vignettes use this feature.

To re-download all data for the vignettes, simply delete the corresponding
folder in `vignettes/` and re-build the vignette. For example, delete
`vignettes/plotting-signals/` and then rebuild `plotting-signals.Rmd` to have
httptest re-download and save the data.
115 changes: 66 additions & 49 deletions R-packages/covidcast/R/covidcast.R
Original file line number Diff line number Diff line change
Expand Up @@ -437,18 +437,21 @@ covidcast_signals <- function(data_source, signal,
#'
#' @export
covidcast_meta <- function() {
meta <- .request(list(source='covidcast_meta', cached="true"))
meta <- .request(
list(source = "covidcast_meta",
format = "csv"))

if (meta$message != "success") {
abort(paste0("Failed to obtain metadata: ", meta$message, "."),
err_msg = meta$message,
class = "covidcast_meta_fetch_failed")
if (nchar(meta) == 0) {
abort("Failed to obtain metadata", class = "covidcast_meta_fetch_failed")
}

meta <- meta$epidata %>%
dplyr::mutate(min_time = as.Date(as.character(.data$min_time), format = "%Y%m%d"),
max_time = as.Date(as.character(.data$max_time), format = "%Y%m%d"),
max_issue = as.Date(as.character(.data$max_issue), format = "%Y%m%d"))
meta <- read.csv(textConnection(meta), stringsAsFactors = FALSE) %>%
dplyr::mutate(min_time = as.Date(as.character(.data$min_time),
format = "%Y%m%d"),
max_time = as.Date(as.character(.data$max_time),
format = "%Y%m%d"),
max_issue = as.Date(as.character(.data$max_issue),
format = "%Y%m%d"))

class(meta) <- c("covidcast_meta", "data.frame")
return(meta)
Expand Down Expand Up @@ -529,7 +532,7 @@ covidcast_days <- function(data_source, signal, start_day, end_day, geo_type,
for (i in seq(ndays + 1)) {
query_day <- start_day + i - 1
day_str <- date_to_string(query_day)
dat[[i]] <- covidcast(data_source = data_source,
response <- covidcast(data_source = data_source,
signal = signal,
time_type = "day",
geo_type = geo_type,
Expand All @@ -538,49 +541,50 @@ covidcast_days <- function(data_source, signal, start_day, end_day, geo_type,
as_of = as_of,
issues = issues,
lag = lag)

if (is.null(response)) {
warn(paste0("Fetching ", signal, " from ", data_source, " for ",
query_day, " in geography '", geo_value, "': no results"),
data_source = data_source,
signal = signal,
day = query_day,
geo_value = geo_value,
class = "covidcast_fetch_failed")

next
}

dat[[i]] <- response

summary <- sprintf(
"Fetched day %s: %s, %s, num_entries = %s",
"Fetched day %s: num_entries = %s",
query_day,
dat[[i]]$result,
dat[[i]]$message,
nrow(dat[[i]]$epidata)
nrow(dat[[i]])
)
if (length(summary) != 0) {
message(summary)
}
if (dat[[i]]$message == "success") {
returned_geo_values <- dat[[i]]$epidata$geo_value
if (!identical("*", geo_value)) {
missed_geos <- setdiff(tolower(geo_value),
tolower(returned_geo_values))
if (length(missed_geos) > 0) {
missed_geos_str <- paste0(missed_geos, collapse = ", ")
warn(sprintf("Data not fetched for some geographies on %s: %s",
query_day, missed_geos_str),
data_source = data_source,
signal = signal,
day = query_day,
geo_value = geo_value,
api_msg = dat[[i]]$message,
class = "covidcast_missing_geo_values"
)
}

if (nrow(dat[[i]]) > 0 && !identical("*", geo_value)) {
returned_geo_values <- dat[[i]]$geo_value
missed_geos <- setdiff(tolower(geo_value),
tolower(returned_geo_values))

if (length(missed_geos) > 0) {
missed_geos_str <- paste0(missed_geos, collapse = ", ")
warn(sprintf("Data not fetched for some geographies on %s: %s",
query_day, missed_geos_str),
data_source = data_source,
signal = signal,
day = query_day,
geo_value = geo_value,
class = "covidcast_missing_geo_values"
)
}
} else {
warn(paste0("Fetching ", signal, " from ", data_source, " for ",
query_day, " in geography '", geo_value, "': ",
dat[[i]]$message),
data_source = data_source,
signal = signal,
day = query_day,
geo_value = geo_value,
api_msg = dat[[i]]$message,
class = "covidcast_fetch_failed")
}
}

df <- dat %>%
purrr::map("epidata") %>% # just want $epidata part
purrr::map(purrr::compact) %>% # remove the list elements that are NULL
dplyr::bind_rows() # make this into a data frame

Expand All @@ -604,22 +608,24 @@ covidcast_days <- function(data_source, signal, start_day, end_day, geo_type,
covidcast <- function(data_source, signal, time_type, geo_type, time_values,
geo_value, as_of, issues, lag) {
# Check parameters
if(missing(data_source) || missing(signal) || missing(time_type) ||
if (missing(data_source) || missing(signal) || missing(time_type) ||
missing(geo_type) || missing(time_values) || missing(geo_value)) {
stop("`data_source`, `signal`, `time_type`, `geo_type`, `time_values`, ",
"and `geo_value` are all required.")
}

# Set up request
params <- list(
source = 'covidcast',
source = "covidcast",
data_source = data_source,
signal = signal,
time_type = time_type,
geo_type = geo_type,
time_values = .list(time_values),
geo_value = geo_value
geo_value = geo_value,
format = "csv"
)

if (length(params$geo_value) > 1) {
params$geo_values <- paste0(params$geo_value, collapse = ",") #convert to string
params$geo_value <- NULL
Expand All @@ -644,8 +650,19 @@ covidcast <- function(data_source, signal, time_type, geo_type, time_values,
params$lag <- lag
}

# Make the API call
return(.request(params))
# Make the API call. If the API returns a non-200 status code, indicating e.g.
# a database error, .request() raises an error. It returns an empty string if
# there are no results for our query.
response <- .request(params)
if (nchar(response) == 0) {
# empty if no results
return(NULL)
}

# geo_value must be read as character so FIPS codes are returned as character,
# not numbers (with leading 0s potentially removed)
return(read.csv(textConnection(response), stringsAsFactors = FALSE,
colClasses = c("geo_value" = "character")))
}

# Helper function to cast values and/or ranges to strings
Expand Down Expand Up @@ -673,8 +690,8 @@ covidcast <- function(data_source, signal, time_type, geo_type, time_values,

httr::stop_for_status(response, task = "fetch data from API")

return(jsonlite::fromJSON(httr::content(response, as = "text",
encoding = "utf-8")))
return(httr::content(response, as = "text",
encoding = "utf-8"))
}

# This is the date format expected by the API
Expand Down
12 changes: 6 additions & 6 deletions R-packages/covidcast/R/utils.R
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,9 @@ latest_issue <- function(df) {
attrs <- attrs[!(names(attrs) %in% c("row.names", "names"))]

df <- df %>%
dplyr::group_by(.data$geo_value, .data$time_value) %>%
dplyr::filter(.data$issue == max(.data$issue)) %>%
dplyr::ungroup()
dplyr::arrange(dplyr::desc(.data$issue)) %>%
dplyr::distinct(.data$geo_value, .data$time_value,
.keep_all = TRUE)

attributes(df) <- c(attributes(df), attrs)

Expand All @@ -41,9 +41,9 @@ earliest_issue <- function(df) {
attrs <- attrs[!(names(attrs) %in% c("row.names", "names"))]

df <- df %>%
dplyr::group_by(.data$geo_value, .data$time_value) %>%
dplyr::filter(.data$issue == min(.data$issue)) %>%
dplyr::ungroup()
dplyr::arrange(.data$issue) %>%
dplyr::distinct(.data$geo_value, .data$time_value,
.keep_all = TRUE)

attributes(df) <- c(attributes(df), attrs)

Expand Down
7 changes: 7 additions & 0 deletions R-packages/covidcast/_pkgdown.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,13 @@ home:
- text: View the COVIDcast map
href: https://covidcast.cmu.edu/

repo:
url:
home: https://github.com/cmu-delphi/covidcast/tree/main/R-packages/covidcast
source: https://github.com/cmu-delphi/covidcast/blob/main/R-packages/covidcast/
issue: https://github.com/cmu-delphi/covidcast/issues
user: https://github.com/

reference:
- title: Fetch data
desc: Fetch signals and metadata from the COVIDcast API
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
geo_value,signal,time_value,issue,lag,value,stderr,sample_size
01000,bar-not-found,20200101,20200102,1,1.0,0.1,2.0
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
geo_value,signal,time_value,direction,issue,lag,value,stderr,sample_size
01001,bar,20200110,,20200111,1,91.2,0.8,114.2
01002,bar,20200110,,20200111,1,99.1,0.2,217.8
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
signal,geo_value,value,time_value,issue,lag,sample_size,stderr
bar,pa,1,20200101,20200101,0,1,1
bar,tx,1,20200101,20200101,0,1,1

This file was deleted.

This file was deleted.

This file was deleted.

This file was deleted.

This file was deleted.

This file was deleted.

Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
data_source,signal,time_type,geo_type,min_time,max_time,min_value,max_value,max_issue
foo,bar,day,county,20200101,20200102,0,10,20200404
foo,bar,day,state,20201002,20201003,0,10,20201101