Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster R package vignettes #266

Draft
wants to merge 12 commits into
base: main
Choose a base branch
from
Draft

Faster R package vignettes #266

wants to merge 12 commits into from

Conversation

capnrefsmmat
Copy link
Contributor

@capnrefsmmat capnrefsmmat commented Nov 11, 2020

With some judicious optimization of latest_issue, we can make the correlation vignette several hundred times faster to build. I'm also switching it to be at the state level to save on data download time.

See the commit message in 4e9875c for an explanation of the speed improvements.

This is a work in progress; my goal is to make the vignettes fast enough that we can change the install instructions to build them. On my machine they take about 2 minutes to build, about two-thirds of which is waiting for data from the API; we can probably cut that down by using fewer datasets. Failing that, we can at least build the vignettes as part of the CI, so we don't push changes that completely break the vignettes.

@capnrefsmmat
Copy link
Contributor Author

@ryantibs This isn't done, but you may like it just because it makes your correlation analyses ludicrously faster (along with anything using latest_issue). Maybe that will help some forecasting or modeling work that uses covidcast_cor for exploration.

@capnrefsmmat capnrefsmmat marked this pull request as draft November 11, 2020 17:02
The CSV format is much more compact (does not repeat field names for
every row), and more naturally fits with R anyway.

Alter the relevant tests to serve CSVs. I've verified all vignettes
build with these changes.
It should not be possible to have two signals with the same source,
signal, time_type, and geo_type. This will cause a query for that signal
to have two metadata rows attached to the covidcast_signal data frame,
which will confuse everything.
Fetching multiple days is important.
Profiling revealed that latest_issue was responsible for a large portion
of the time taken in building correlation-utils.Rmd (apart from
downloading the data). Much of this time was spent in dplyr::filter.

Rather than grouping by geography and time, we can use dplyr::distinct,
knowing that each geo_value and time_value should appear only once per
issue date. By taking the first or last (after sorting by issue date),
we get the desired result.

dplyr does not document algorithmic details, so I can't easily give O(n)
notation here. Algorithmic details notwithstanding, the results are
extraordinary:

> nrow(d)
[1] 203360
> system.time(latest_issue_old(d))
   user  system elapsed
  6.395   0.037   6.465
> system.time(latest_issue(d))
   user  system elapsed
  0.025   0.003   0.027
Fetching the county data took a large portion of the time required to
build the vignette, particularly after the fixes to latest_issue in
b0f7e7b.
Always run your tests before pushing...
By providing the `repo` block with a link pointing to
R-packages/covidcast/, pkgdown can build the correct URLs.
Also modify several vignettes to download only the necessary amount of
data, thus reducing the file size of these CSV files.
@ryantibs
Copy link
Member

@capnrefsmmat Nice job. Generally lgtm but just a few thoughts for you to consider.

  • This will return a covidcast_signal data frame with scrambled rows. Would it be worth reordering? This doesn't affect any of our downstream uses like plotting or correlations, but may confuse a user if they call latest_issue() directly.

  • As far as I understand, latest_issue() doesn't check the class? So it could in principle be applied to a covidcast_signal_long data frame? And in this case it wouldn't return the right result, since you're masking by geo_value and time_value in the call to dplyr::distinct(). Thus I would either check the class in latest_issue() ... or include data_source and signal as masking variables.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants