-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use CSV format for API access in R package #271
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of comments here based on me not really knowing what is going on here. But this is in pretty good shape.
R-packages/covidcast/R/covidcast.R
Outdated
as_of = as_of, | ||
issues = issues, | ||
lag = lag) | ||
res <- covidcast(data_source = data_source, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really keen on the variable name "res" but I don't have a better, more descriptive suggestion off the top of my head.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about response
?
R-packages/covidcast/R/covidcast.R
Outdated
summary <- sprintf( | ||
"Fetched day %s: %s, %s, num_entries = %s", | ||
"Fetched day %s: num_entries = %s", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the change to this line represent a change to the semantics of the output or is it just a change in compatibility between the two object types? That is to say, the old dat[[i]]
has attributes result
, message
, and epidata
--does the new dat[[i]]
have something different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, and perhaps we should document this more clearly in a comment. The old format would return error codes in the JSON. The CSV format does not have a field for error messages; instead it uses HTTP status codes, so the
httr::stop_for_status(response, task = "fetch data from API")
line in .request
will do the main error reporting if we don't get HTTP 200 back.
R-packages/covidcast/R/covidcast.R
Outdated
) | ||
} | ||
|
||
if (nrow(dat[[i]]) > 0 && !identical("*", geo_value)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we using identical()
and not just ==
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If geo_value
is a vector, "*" == geo_value
will be a vector of logicals instead of one boolean value.
R-packages/covidcast/R/covidcast.R
Outdated
} | ||
|
||
if (nrow(dat[[i]]) > 0 && !identical("*", geo_value)) { | ||
returned_geo_values <- dat[[i]]$geo_value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there multiple geovalues here? Should this just be renamed returned_geo_value
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there are multiple, since dat[[i]]$geo_value
is a column of a data frame and may have multiple entries
R-packages/covidcast/R/covidcast.R
Outdated
@@ -645,7 +651,16 @@ covidcast <- function(data_source, signal, time_type, geo_type, time_values, | |||
} | |||
|
|||
# Make the API call | |||
return(.request(params)) | |||
res <- .request(params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still not thrilled with res
.
@@ -0,0 +1,3 @@ | |||
geo_value,signal,time_value,direction,issue,lag,value,stderr,sample_size | |||
01001,bar,20200110,,20200111,1,91.2,0.8,114.2 | |||
01002,bar,20200110,,20200111,1,99.1,0.2,217.8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a newline here.
@@ -0,0 +1,3 @@ | |||
geo_value,signal,time_value,direction,issue,lag,value,stderr,sample_size | |||
31001,bar,20200112,,20200113,1,81.2,0.8,314.2 | |||
31002,bar,20200112,,20200113,1,89.1,0.2,417.8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Newline
test_that("covidcast_signal stops when end_day < start_day", { | ||
# reusing api.php-da6974.json | ||
# reusing api.php-d2e163.json for metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still json?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, and the filename was wrong too (the tooling for httptest filename tracking is pretty clunky)
@sgsmob Planning question. This PR, #266, and #216 together represent a decent amount of improvements and new features. It may be worth designating these as a new release. Since we have people use |
I don't have a good intuition around releases. @benjaminysmith and @tildechris might have a better sense for it or at least some insight from the practices we have adopted in engineering. |
I propose something like
|
We were discussing branching strategies for the indicator repos and landed on something like gitflow, where everyone merges continuously into one branch, and releases are cut and merged into the production branch whenever we want. Could see a similar thing working here too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conceptually this makes sense, but I am a but nervous that this is a pretty substantial change to the way the API client works -- how can we be confident that it works as a drop-in replacement? Testing in our own code? (that might be fine -- just calling it out)
Another option would be to have this as an alternative method in parallel at first (maybe that can be enabled through a flag or parameter) and then switch over to using it by default once it has been tested in our code. What do you think?
|
||
# geo_value must be read as character so FIPS codes are returned as character, | ||
# not numbers (with leading 0s potentially removed) | ||
return(read.csv(textConnection(response), stringsAsFactors = FALSE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any possible issues with large data sizes here? (which might work differently with CSV vs json)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
textConnection
does copy the response, and I don't know how jsonlite's performance compares to R's native CSV reader. Not sure how we'd test this, though; maybe benchmarking a call that returns a particularly large dataset?
My guess is that the smaller size of CSV responses outweighs any change in performance, but that's just a guess
Yeah, I'm nervous too. That's why I've added more tests here, and also verified the vignettes run. But since we're brand-new to testing for this package, it's entirely possible there's something I've missed. I think I'd be most comfortable if we prepare a release branch and have some Delphi members use it for their work for a week and see if any bugs arise. Any objections to the idea of preparing a release branch for that purpose? We could move to that model for future work, where we always merge to a dev branch and beta-test before release. I think that's in line with @chinandrew's suggestion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me as sounds like you are considering safe rollout options. i think the main concerns are 1) detecting if there are problems when this goes out (I suspect this would rely on user feedback); 2) having a way for users to roll back. Since this is installed through devtools, is there an easy way to revert to the previous version? if not might be worth putting this behind a parameter or ffalg.
I'll need to point this branch at |
The CSV format is much more compact (does not repeat field names for every row), and more naturally fits with R anyway. Alter the relevant tests to serve CSVs. I've verified all vignettes build with these changes.
It should not be possible to have two signals with the same source, signal, time_type, and geo_type. This will cause a query for that signal to have two metadata rows attached to the covidcast_signal data frame, which will confuse everything.
Fetching multiple days is important.
677f4cd
to
0e56d9a
Compare
I believe this is ready for re-review. Changes since last time:
cc @JedGrabman, since this touches your batching code |
These changes make the R package download data from the API using its CSV download format. This format is much more compact than JSON. This also improves tests related to the data fetch behavior, since these improvements make me more confident in this PR's correctness.
There would be nominal bandwidth improvements, but the main benefit is for #266: I plan to use httptest's ability to make vignettes use static data instead of requesting data from the API. The JSON files for big API downloads are huge, while the CSVs will be much more compact, preventing the repo from bloating.
This depends on cmu-delphi/delphi-epidata#281; the servers should support compression for CSVs before we start directing clients to to the CSV downloads.