Use CSV format for API access in R package #271

capnrefsmmat · 2020-11-11T23:39:08Z

These changes make the R package download data from the API using its CSV download format. This format is much more compact than JSON. This also improves tests related to the data fetch behavior, since these improvements make me more confident in this PR's correctness.

There would be nominal bandwidth improvements, but the main benefit is for #266: I plan to use httptest's ability to make vignettes use static data instead of requesting data from the API. The JSON files for big API downloads are huge, while the CSVs will be much more compact, preventing the repo from bloating.

This depends on cmu-delphi/delphi-epidata#281; the servers should support compression for CSVs before we start directing clients to to the CSV downloads.

sgsmob

A lot of comments here based on me not really knowing what is going on here. But this is in pretty good shape.

R-packages/covidcast/R/covidcast.R

sgsmob · 2020-11-12T04:16:20Z

R-packages/covidcast/R/covidcast.R

-                          as_of = as_of,
-                          issues = issues,
-                          lag = lag)
+    res <- covidcast(data_source = data_source,


Not really keen on the variable name "res" but I don't have a better, more descriptive suggestion off the top of my head.

How about response?

sgsmob · 2020-11-12T04:20:17Z

R-packages/covidcast/R/covidcast.R

    summary <- sprintf(
-      "Fetched day %s: %s, %s, num_entries = %s",
+      "Fetched day %s: num_entries = %s",


Does the change to this line represent a change to the semantics of the output or is it just a change in compatibility between the two object types? That is to say, the old dat[[i]] has attributes result, message, and epidata--does the new dat[[i]] have something different?

Good question, and perhaps we should document this more clearly in a comment. The old format would return error codes in the JSON. The CSV format does not have a field for error messages; instead it uses HTTP status codes, so the

httr::stop_for_status(response, task = "fetch data from API")

line in .request will do the main error reporting if we don't get HTTP 200 back.

sgsmob · 2020-11-12T04:22:05Z

R-packages/covidcast/R/covidcast.R

-               )
-        }
+
+    if (nrow(dat[[i]]) > 0 && !identical("*", geo_value)) {


Why are we using identical() and not just ==?

If geo_value is a vector, "*" == geo_value will be a vector of logicals instead of one boolean value.

sgsmob · 2020-11-12T04:23:05Z

R-packages/covidcast/R/covidcast.R

-        }
+
+    if (nrow(dat[[i]]) > 0 && !identical("*", geo_value)) {
+      returned_geo_values <- dat[[i]]$geo_value


Are there multiple geovalues here? Should this just be renamed returned_geo_value?

Yes, there are multiple, since dat[[i]]$geo_value is a column of a data frame and may have multiple entries

sgsmob · 2020-11-12T04:23:37Z

R-packages/covidcast/R/covidcast.R

@@ -645,7 +651,16 @@ covidcast <- function(data_source, signal, time_type, geo_type, time_values,
  }

  # Make the API call
-  return(.request(params))
+  res <- .request(params)


Still not thrilled with res.

sgsmob · 2020-11-12T04:24:22Z

R-packages/covidcast/tests/testthat/api.covidcast.cmu.edu/epidata/api.php-32641f.csv

@@ -0,0 +1,3 @@
+geo_value,signal,time_value,direction,issue,lag,value,stderr,sample_size
+01001,bar,20200110,,20200111,1,91.2,0.8,114.2
+01002,bar,20200110,,20200111,1,99.1,0.2,217.8


Add a newline here.

sgsmob · 2020-11-12T04:24:30Z

R-packages/covidcast/tests/testthat/api.covidcast.cmu.edu/epidata/api.php-f49e8f.csv

@@ -0,0 +1,3 @@
+geo_value,signal,time_value,direction,issue,lag,value,stderr,sample_size
+31001,bar,20200112,,20200113,1,81.2,0.8,314.2
+31002,bar,20200112,,20200113,1,89.1,0.2,417.8


sgsmob · 2020-11-12T04:28:16Z

R-packages/covidcast/tests/testthat/test-covidcast.R

  test_that("covidcast_signal stops when end_day < start_day", {
-    # reusing api.php-da6974.json
+    # reusing api.php-d2e163.json for metadata


Is this still json?

Good catch, and the filename was wrong too (the tooling for httptest filename tracking is pretty clunky)

capnrefsmmat · 2020-11-12T18:16:09Z

@sgsmob Planning question. This PR, #266, and #216 together represent a decent amount of improvements and new features. It may be worth designating these as a new release.

Since we have people use devtools::install_github(), merging these PRs makes them instantly available to new users. Do you think we should merge these PRs in a coordinated way? Just dribble them out? Get formal and have a development branch that we merge to main when we want to "release"?

sgsmob · 2020-11-12T19:12:58Z

I don't have a good intuition around releases. @benjaminysmith and @tildechris might have a better sense for it or at least some insight from the practices we have adopted in engineering.

capnrefsmmat · 2020-11-13T02:09:39Z

I propose something like

I create a release branch
I merge these PRs into the release branch
We make a PR for it so CI can test
Merge in one big batch when it's done and we're ready (including changelog, etc.)

chinandrew · 2020-11-13T02:20:02Z

I propose something like

I create a release branch

I merge these PRs into the release branch

We make a PR for it so CI can test

Merge in one big batch when it's done and we're ready (including changelog, etc.)

We were discussing branching strategies for the indicator repos and landed on something like gitflow, where everyone merges continuously into one branch, and releases are cut and merged into the production branch whenever we want. Could see a similar thing working here too.

benjaminysmith

Conceptually this makes sense, but I am a but nervous that this is a pretty substantial change to the way the API client works -- how can we be confident that it works as a drop-in replacement? Testing in our own code? (that might be fine -- just calling it out)

Another option would be to have this as an alternative method in parallel at first (maybe that can be enabled through a flag or parameter) and then switch over to using it by default once it has been tested in our code. What do you think?

benjaminysmith · 2020-11-13T19:09:14Z

R-packages/covidcast/R/covidcast.R

+
+  # geo_value must be read as character so FIPS codes are returned as character,
+  # not numbers (with leading 0s potentially removed)
+  return(read.csv(textConnection(response), stringsAsFactors = FALSE,


Are there any possible issues with large data sizes here? (which might work differently with CSV vs json)

textConnection does copy the response, and I don't know how jsonlite's performance compares to R's native CSV reader. Not sure how we'd test this, though; maybe benchmarking a call that returns a particularly large dataset?

My guess is that the smaller size of CSV responses outweighs any change in performance, but that's just a guess

capnrefsmmat · 2020-11-13T19:26:15Z

Conceptually this makes sense, but I am a but nervous that this is a pretty substantial change to the way the API client works -- how can we be confident that it works as a drop-in replacement? Testing in our own code? (that might be fine -- just calling it out)

Yeah, I'm nervous too. That's why I've added more tests here, and also verified the vignettes run. But since we're brand-new to testing for this package, it's entirely possible there's something I've missed.

I think I'd be most comfortable if we prepare a release branch and have some Delphi members use it for their work for a week and see if any bugs arise. Any objections to the idea of preparing a release branch for that purpose? We could move to that model for future work, where we always merge to a dev branch and beta-test before release. I think that's in line with @chinandrew's suggestion.

benjaminysmith

Looks good to me as sounds like you are considering safe rollout options. i think the main concerns are 1) detecting if there are problems when this goes out (I suspect this would rely on user feedback); 2) having a way for users to roll back. Since this is installed through devtools, is there an easy way to revert to the previous version? if not might be worth putting this behind a parameter or ffalg.

capnrefsmmat · 2020-11-18T13:24:14Z

devtools::install_github takes a ref argument, which can be a branch, commit, or tag, so we can always point back to a prior commit or to a different branch if needed.

I'll need to point this branch at r-pkg-devel and then fix some tests after it goes in, probably after #275.

The CSV format is much more compact (does not repeat field names for every row), and more naturally fits with R anyway. Alter the relevant tests to serve CSVs. I've verified all vignettes build with these changes.

It should not be possible to have two signals with the same source, signal, time_type, and geo_type. This will cause a query for that signal to have two metadata rows attached to the covidcast_signal data frame, which will confuse everything.

Fetching multiple days is important.

capnrefsmmat · 2020-11-30T15:26:41Z

I believe this is ready for re-review. Changes since last time:

I rebased this on top of Batch requests when requesting multiple days of data #275, which required some changes to tests and tweaks to the covidcast_days function now that it batches requests.
I updated the NEWS file to describe this change (and Batch requests when requesting multiple days of data #275).

cc @JedGrabman, since this touches your batching code

capnrefsmmat requested a review from sgsmob November 11, 2020 23:39

sgsmob suggested changes Nov 12, 2020

View reviewed changes

capnrefsmmat requested a review from ryantibs as a code owner November 12, 2020 15:18

sgsmob approved these changes Nov 12, 2020

View reviewed changes

capnrefsmmat removed the request for review from ryantibs November 12, 2020 16:33

capnrefsmmat mentioned this pull request Nov 12, 2020

Batch requests when requesting multiple days of data #275

Merged

benjaminysmith reviewed Nov 13, 2020

View reviewed changes

capnrefsmmat added the covidcast R package label Nov 14, 2020

benjaminysmith approved these changes Nov 18, 2020

View reviewed changes

capnrefsmmat added 7 commits November 25, 2020 11:06

Convert API calls to request CSV format for data, instead of JSON

7c5abc6

The CSV format is much more compact (does not repeat field names for every row), and more naturally fits with R anyway. Alter the relevant tests to serve CSVs. I've verified all vignettes build with these changes.

Switch covidcast_meta to request CSVs as well

88cebdc

Correct error in metadata test

50cc950

It should not be possible to have two signals with the same source, signal, time_type, and geo_type. This will cause a query for that signal to have two metadata rows attached to the covidcast_signal data frame, which will confuse everything.

Add an additional test of covidcast_signal

cbe3345

Fetching multiple days is important.

Address review comments

a28e025

Fix tests

4d11aa9

Fix vignette

0e56d9a

capnrefsmmat force-pushed the r-csv-api branch from 677f4cd to 0e56d9a Compare November 25, 2020 17:01

capnrefsmmat requested a review from krivard as a code owner November 25, 2020 17:01

capnrefsmmat changed the base branch from main to r-pkg-devel November 25, 2020 17:02

capnrefsmmat removed the request for review from krivard November 25, 2020 17:02

Update NEWS for new features

e12eb5f

capnrefsmmat requested a review from sgsmob November 30, 2020 15:26

sgsmob approved these changes Nov 30, 2020

View reviewed changes

capnrefsmmat merged commit 5890c7f into r-pkg-devel Nov 30, 2020

statsmaths deleted the r-csv-api branch March 19, 2021 20:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use CSV format for API access in R package #271

Use CSV format for API access in R package #271

capnrefsmmat commented Nov 11, 2020

sgsmob left a comment

sgsmob Nov 12, 2020

capnrefsmmat Nov 12, 2020

sgsmob Nov 12, 2020

capnrefsmmat Nov 12, 2020

sgsmob Nov 12, 2020

capnrefsmmat Nov 12, 2020

sgsmob Nov 12, 2020

capnrefsmmat Nov 12, 2020

sgsmob Nov 12, 2020

sgsmob Nov 12, 2020

sgsmob Nov 12, 2020

sgsmob Nov 12, 2020

capnrefsmmat Nov 12, 2020

capnrefsmmat commented Nov 12, 2020

sgsmob commented Nov 12, 2020

capnrefsmmat commented Nov 13, 2020

chinandrew commented Nov 13, 2020

benjaminysmith left a comment

benjaminysmith Nov 13, 2020

capnrefsmmat Nov 13, 2020

capnrefsmmat commented Nov 13, 2020

benjaminysmith left a comment

capnrefsmmat commented Nov 18, 2020

capnrefsmmat commented Nov 30, 2020

Use CSV format for API access in R package #271

Use CSV format for API access in R package #271

Conversation

capnrefsmmat commented Nov 11, 2020

sgsmob left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

capnrefsmmat commented Nov 12, 2020

sgsmob commented Nov 12, 2020

capnrefsmmat commented Nov 13, 2020

chinandrew commented Nov 13, 2020

benjaminysmith left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

capnrefsmmat commented Nov 13, 2020

benjaminysmith left a comment

Choose a reason for hiding this comment

capnrefsmmat commented Nov 18, 2020

capnrefsmmat commented Nov 30, 2020