Faster R package vignettes #266

capnrefsmmat · 2020-11-11T02:55:29Z

With some judicious optimization of latest_issue, we can make the correlation vignette several hundred times faster to build. I'm also switching it to be at the state level to save on data download time.

See the commit message in 4e9875c for an explanation of the speed improvements.

This is a work in progress; my goal is to make the vignettes fast enough that we can change the install instructions to build them. On my machine they take about 2 minutes to build, about two-thirds of which is waiting for data from the API; we can probably cut that down by using fewer datasets. Failing that, we can at least build the vignettes as part of the CI, so we don't push changes that completely break the vignettes.

capnrefsmmat · 2020-11-11T02:57:50Z

@ryantibs This isn't done, but you may like it just because it makes your correlation analyses ludicrously faster (along with anything using latest_issue). Maybe that will help some forecasting or modeling work that uses covidcast_cor for exploration.

The CSV format is much more compact (does not repeat field names for every row), and more naturally fits with R anyway. Alter the relevant tests to serve CSVs. I've verified all vignettes build with these changes.

It should not be possible to have two signals with the same source, signal, time_type, and geo_type. This will cause a query for that signal to have two metadata rows attached to the covidcast_signal data frame, which will confuse everything.

Fetching multiple days is important.

Profiling revealed that latest_issue was responsible for a large portion of the time taken in building correlation-utils.Rmd (apart from downloading the data). Much of this time was spent in dplyr::filter. Rather than grouping by geography and time, we can use dplyr::distinct, knowing that each geo_value and time_value should appear only once per issue date. By taking the first or last (after sorting by issue date), we get the desired result. dplyr does not document algorithmic details, so I can't easily give O(n) notation here. Algorithmic details notwithstanding, the results are extraordinary: > nrow(d) [1] 203360 > system.time(latest_issue_old(d)) user system elapsed 6.395 0.037 6.465 > system.time(latest_issue(d)) user system elapsed 0.025 0.003 0.027

Fetching the county data took a large portion of the time required to build the vignette, particularly after the fixes to latest_issue in b0f7e7b.

Always run your tests before pushing...

By providing the `repo` block with a link pointing to R-packages/covidcast/, pkgdown can build the correct URLs.

Also modify several vignettes to download only the necessary amount of data, thus reducing the file size of these CSV files.

ryantibs · 2020-11-14T22:48:14Z

@capnrefsmmat Nice job. Generally lgtm but just a few thoughts for you to consider.

This will return a covidcast_signal data frame with scrambled rows. Would it be worth reordering? This doesn't affect any of our downstream uses like plotting or correlations, but may confuse a user if they call latest_issue() directly.
As far as I understand, latest_issue() doesn't check the class? So it could in principle be applied to a covidcast_signal_long data frame? And in this case it wouldn't return the right result, since you're masking by geo_value and time_value in the call to dplyr::distinct(). Thus I would either check the class in latest_issue() ... or include data_source and signal as masking variables.

capnrefsmmat marked this pull request as draft November 11, 2020 17:02

capnrefsmmat added 4 commits November 11, 2020 17:32

Convert API calls to request CSV format for data, instead of JSON

91137da

The CSV format is much more compact (does not repeat field names for every row), and more naturally fits with R anyway. Alter the relevant tests to serve CSVs. I've verified all vignettes build with these changes.

Switch covidcast_meta to request CSVs as well

bcf0191

Correct error in metadata test

60f6cc7

It should not be possible to have two signals with the same source, signal, time_type, and geo_type. This will cause a query for that signal to have two metadata rows attached to the covidcast_signal data frame, which will confuse everything.

Add an additional test of covidcast_signal

b10569f

Fetching multiple days is important.

capnrefsmmat mentioned this pull request Nov 11, 2020

Use CSV format for API access in R package #271

Merged

capnrefsmmat added 8 commits November 12, 2020 10:18

Address review comments

677f4cd

Add simple tests for latest_issue and earliest_issue

37e8a07

Do our correlation analysis at the state, not county, level

7f7fd89

Fetching the county data took a large portion of the time required to build the vignette, particularly after the fixes to latest_issue in b0f7e7b.

Fix test thinkos

bf86746

Always run your tests before pushing...

Fix source links in pkgdown documentation

4e8b0ef

By providing the `repo` block with a link pointing to R-packages/covidcast/, pkgdown can build the correct URLs.

Switch all vignettes to use httptest to record API requests

0ebe6ed

Also modify several vignettes to download only the necessary amount of data, thus reducing the file size of these CSV files.

Document the process of testing the API and storing vignette inputs

30a5c2f

capnrefsmmat force-pushed the faster-vignettes branch from c03fc3f to 30a5c2f Compare November 14, 2020 19:16

capnrefsmmat added the covidcast R package label Nov 14, 2020

This was referenced Nov 14, 2020

Rename vignettes #267

Merged

Rollup of R package changes #281

Merged

capnrefsmmat mentioned this pull request Nov 14, 2020

Be more defensive in earliest_issue and latest_issue, export them #282

Closed

capnrefsmmat mentioned this pull request Mar 2, 2021

Questions and Roadmap for covidcast on CRAN #470

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster R package vignettes #266

Faster R package vignettes #266

capnrefsmmat commented Nov 11, 2020 •

edited

capnrefsmmat commented Nov 11, 2020

ryantibs commented Nov 14, 2020

Faster R package vignettes #266

Are you sure you want to change the base?

Faster R package vignettes #266

Conversation

capnrefsmmat commented Nov 11, 2020 • edited

capnrefsmmat commented Nov 11, 2020

ryantibs commented Nov 14, 2020

capnrefsmmat commented Nov 11, 2020 •

edited