Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use CSV format for API access in R package #271

Merged
merged 8 commits into from
Nov 30, 2020
Merged

Conversation

capnrefsmmat
Copy link
Contributor

These changes make the R package download data from the API using its CSV download format. This format is much more compact than JSON. This also improves tests related to the data fetch behavior, since these improvements make me more confident in this PR's correctness.

There would be nominal bandwidth improvements, but the main benefit is for #266: I plan to use httptest's ability to make vignettes use static data instead of requesting data from the API. The JSON files for big API downloads are huge, while the CSVs will be much more compact, preventing the repo from bloating.

This depends on cmu-delphi/delphi-epidata#281; the servers should support compression for CSVs before we start directing clients to to the CSV downloads.

Copy link
Contributor

@sgsmob sgsmob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of comments here based on me not really knowing what is going on here. But this is in pretty good shape.

R-packages/covidcast/R/covidcast.R Show resolved Hide resolved
as_of = as_of,
issues = issues,
lag = lag)
res <- covidcast(data_source = data_source,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really keen on the variable name "res" but I don't have a better, more descriptive suggestion off the top of my head.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about response?

summary <- sprintf(
"Fetched day %s: %s, %s, num_entries = %s",
"Fetched day %s: num_entries = %s",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the change to this line represent a change to the semantics of the output or is it just a change in compatibility between the two object types? That is to say, the old dat[[i]] has attributes result, message, and epidata--does the new dat[[i]] have something different?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, and perhaps we should document this more clearly in a comment. The old format would return error codes in the JSON. The CSV format does not have a field for error messages; instead it uses HTTP status codes, so the

 httr::stop_for_status(response, task = "fetch data from API")

line in .request will do the main error reporting if we don't get HTTP 200 back.

)
}

if (nrow(dat[[i]]) > 0 && !identical("*", geo_value)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we using identical() and not just ==?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If geo_value is a vector, "*" == geo_value will be a vector of logicals instead of one boolean value.

}

if (nrow(dat[[i]]) > 0 && !identical("*", geo_value)) {
returned_geo_values <- dat[[i]]$geo_value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there multiple geovalues here? Should this just be renamed returned_geo_value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there are multiple, since dat[[i]]$geo_value is a column of a data frame and may have multiple entries

@@ -645,7 +651,16 @@ covidcast <- function(data_source, signal, time_type, geo_type, time_values,
}

# Make the API call
return(.request(params))
res <- .request(params)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not thrilled with res.

@@ -0,0 +1,3 @@
geo_value,signal,time_value,direction,issue,lag,value,stderr,sample_size
01001,bar,20200110,,20200111,1,91.2,0.8,114.2
01002,bar,20200110,,20200111,1,99.1,0.2,217.8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a newline here.

@@ -0,0 +1,3 @@
geo_value,signal,time_value,direction,issue,lag,value,stderr,sample_size
31001,bar,20200112,,20200113,1,81.2,0.8,314.2
31002,bar,20200112,,20200113,1,89.1,0.2,417.8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Newline

test_that("covidcast_signal stops when end_day < start_day", {
# reusing api.php-da6974.json
# reusing api.php-d2e163.json for metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still json?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, and the filename was wrong too (the tooling for httptest filename tracking is pretty clunky)

@capnrefsmmat capnrefsmmat removed the request for review from ryantibs November 12, 2020 16:33
@capnrefsmmat
Copy link
Contributor Author

@sgsmob Planning question. This PR, #266, and #216 together represent a decent amount of improvements and new features. It may be worth designating these as a new release.

Since we have people use devtools::install_github(), merging these PRs makes them instantly available to new users. Do you think we should merge these PRs in a coordinated way? Just dribble them out? Get formal and have a development branch that we merge to main when we want to "release"?

@sgsmob
Copy link
Contributor

sgsmob commented Nov 12, 2020

I don't have a good intuition around releases. @benjaminysmith and @tildechris might have a better sense for it or at least some insight from the practices we have adopted in engineering.

@capnrefsmmat
Copy link
Contributor Author

I propose something like

  1. I create a release branch
  2. I merge these PRs into the release branch
  3. We make a PR for it so CI can test
  4. Merge in one big batch when it's done and we're ready (including changelog, etc.)

@chinandrew
Copy link
Collaborator

I propose something like

  1. I create a release branch
  2. I merge these PRs into the release branch
  3. We make a PR for it so CI can test
  4. Merge in one big batch when it's done and we're ready (including changelog, etc.)

We were discussing branching strategies for the indicator repos and landed on something like gitflow, where everyone merges continuously into one branch, and releases are cut and merged into the production branch whenever we want. Could see a similar thing working here too.

Copy link
Contributor

@benjaminysmith benjaminysmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conceptually this makes sense, but I am a but nervous that this is a pretty substantial change to the way the API client works -- how can we be confident that it works as a drop-in replacement? Testing in our own code? (that might be fine -- just calling it out)

Another option would be to have this as an alternative method in parallel at first (maybe that can be enabled through a flag or parameter) and then switch over to using it by default once it has been tested in our code. What do you think?


# geo_value must be read as character so FIPS codes are returned as character,
# not numbers (with leading 0s potentially removed)
return(read.csv(textConnection(response), stringsAsFactors = FALSE,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any possible issues with large data sizes here? (which might work differently with CSV vs json)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

textConnection does copy the response, and I don't know how jsonlite's performance compares to R's native CSV reader. Not sure how we'd test this, though; maybe benchmarking a call that returns a particularly large dataset?

My guess is that the smaller size of CSV responses outweighs any change in performance, but that's just a guess

@capnrefsmmat
Copy link
Contributor Author

Conceptually this makes sense, but I am a but nervous that this is a pretty substantial change to the way the API client works -- how can we be confident that it works as a drop-in replacement? Testing in our own code? (that might be fine -- just calling it out)

Yeah, I'm nervous too. That's why I've added more tests here, and also verified the vignettes run. But since we're brand-new to testing for this package, it's entirely possible there's something I've missed.

I think I'd be most comfortable if we prepare a release branch and have some Delphi members use it for their work for a week and see if any bugs arise. Any objections to the idea of preparing a release branch for that purpose? We could move to that model for future work, where we always merge to a dev branch and beta-test before release. I think that's in line with @chinandrew's suggestion.

Copy link
Contributor

@benjaminysmith benjaminysmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me as sounds like you are considering safe rollout options. i think the main concerns are 1) detecting if there are problems when this goes out (I suspect this would rely on user feedback); 2) having a way for users to roll back. Since this is installed through devtools, is there an easy way to revert to the previous version? if not might be worth putting this behind a parameter or ffalg.

@capnrefsmmat
Copy link
Contributor Author

devtools::install_github takes a ref argument, which can be a branch, commit, or tag, so we can always point back to a prior commit or to a different branch if needed.

I'll need to point this branch at r-pkg-devel and then fix some tests after it goes in, probably after #275.

The CSV format is much more compact (does not repeat field names for
every row), and more naturally fits with R anyway.

Alter the relevant tests to serve CSVs. I've verified all vignettes
build with these changes.
It should not be possible to have two signals with the same source,
signal, time_type, and geo_type. This will cause a query for that signal
to have two metadata rows attached to the covidcast_signal data frame,
which will confuse everything.
Fetching multiple days is important.
@capnrefsmmat capnrefsmmat changed the base branch from main to r-pkg-devel November 25, 2020 17:02
@capnrefsmmat capnrefsmmat removed the request for review from krivard November 25, 2020 17:02
@capnrefsmmat
Copy link
Contributor Author

I believe this is ready for re-review. Changes since last time:

cc @JedGrabman, since this touches your batching code

@capnrefsmmat capnrefsmmat merged commit 5890c7f into r-pkg-devel Nov 30, 2020
@statsmaths statsmaths deleted the r-csv-api branch March 19, 2021 20:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants