New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster R package vignettes #266
base: main
Are you sure you want to change the base?
Conversation
@ryantibs This isn't done, but you may like it just because it makes your correlation analyses ludicrously faster (along with anything using |
The CSV format is much more compact (does not repeat field names for every row), and more naturally fits with R anyway. Alter the relevant tests to serve CSVs. I've verified all vignettes build with these changes.
It should not be possible to have two signals with the same source, signal, time_type, and geo_type. This will cause a query for that signal to have two metadata rows attached to the covidcast_signal data frame, which will confuse everything.
Fetching multiple days is important.
Profiling revealed that latest_issue was responsible for a large portion of the time taken in building correlation-utils.Rmd (apart from downloading the data). Much of this time was spent in dplyr::filter. Rather than grouping by geography and time, we can use dplyr::distinct, knowing that each geo_value and time_value should appear only once per issue date. By taking the first or last (after sorting by issue date), we get the desired result. dplyr does not document algorithmic details, so I can't easily give O(n) notation here. Algorithmic details notwithstanding, the results are extraordinary: > nrow(d) [1] 203360 > system.time(latest_issue_old(d)) user system elapsed 6.395 0.037 6.465 > system.time(latest_issue(d)) user system elapsed 0.025 0.003 0.027
Fetching the county data took a large portion of the time required to build the vignette, particularly after the fixes to latest_issue in b0f7e7b.
Always run your tests before pushing...
By providing the `repo` block with a link pointing to R-packages/covidcast/, pkgdown can build the correct URLs.
Also modify several vignettes to download only the necessary amount of data, thus reducing the file size of these CSV files.
c03fc3f
to
30a5c2f
Compare
@capnrefsmmat Nice job. Generally lgtm but just a few thoughts for you to consider.
|
With some judicious optimization of
latest_issue
, we can make the correlation vignette several hundred times faster to build. I'm also switching it to be at the state level to save on data download time.See the commit message in 4e9875c for an explanation of the speed improvements.
This is a work in progress; my goal is to make the vignettes fast enough that we can change the install instructions to build them. On my machine they take about 2 minutes to build, about two-thirds of which is waiting for data from the API; we can probably cut that down by using fewer datasets. Failing that, we can at least build the vignettes as part of the CI, so we don't push changes that completely break the vignettes.