Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Take advantage of nested prediction cards format to save memory/decrease runtime? #613

Open
nmdefries opened this issue Jan 18, 2023 · 0 comments
Labels
enhancement New feature or request evalcast package

Comments

@nmdefries
Copy link
Contributor

nmdefries commented Jan 18, 2023

The evalcast-killcards branch specifically moved to an unnested format because it seemed cleaner / more convenient. However, nesting has some benefits, like using less memory since we don't have to store duplicate values in the nest_by fields. It could also make some of the scoring logic simpler.

Using nesting just for the forecast scoring might be as inefficient / RAM-exploding as a group_split. Sharing the nesting work across enough subprocesses could make it worth it, e.g. perhaps if such nesting/nest_bying is done before the join with the actuals or more generally the nested form is shared across all error measure calculations, it might offer some offsetting performance gains.

Comparing speed of nested/unnested approaches for scoring-like calculations (credit: @brookslogan):

Timing some simple operations to see if we'd expect speed gains [using an unnest approach in error measures]:

library(tidyverse)
N = 3000L; M = 23; asdf = tibble(k = seq_len(N), y = rnorm(N), pred = map(seq_len(N), ~ tibble(q=(1:M)/(M+1), v = sort(rnorm(M)))))
unnested = asdf %>% unnest(pred)
microbenchmark::microbenchmark(asdf %>% rowwise() %>% summarize(k, e = mean(y - pred$q * pred$v)),
                               asdf %>% unnest(pred) %>% mutate(dev = y - q*v) %>% group_by(k) %>% summarize(e = mean(dev)),
                               asdf %>% unnest(pred) %>% group_by(k) %>% summarize(e = mean(y - q*v)),
                               unnested %>% mutate(dev = y - q*v) %>% group_by(k) %>% summarize(e = mean(dev)))
#> Unit: milliseconds
#>                                                                                                 expr
#>                                   asdf %>% rowwise() %>% summarize(k, e = mean(y - pred$q * pred$v))
#>  asdf %>% unnest(pred) %>% mutate(dev = y - q * v) %>% group_by(k) %>%      summarize(e = mean(dev))
#>                        asdf %>% unnest(pred) %>% group_by(k) %>% summarize(e = mean(y -      q * v))
#>                    unnested %>% mutate(dev = y - q * v) %>% group_by(k) %>% summarize(e = mean(dev))
#>       min       lq     mean   median       uq       max neval
#>  37.25541 39.74052 41.62194 41.41595 42.82244  57.75427   100
#>  36.90822 38.21231 41.35320 39.29544 41.82904 111.32808   100
#>  39.90252 41.90083 45.60959 44.20522 46.16257 115.62528   100
#>  17.80969 18.36073 19.87239 19.11819 20.29674  36.05479   100

Expect comparisons to vary based on N, M, and complexity of the calculation. But the above makes me pessimistic about unnest unless the unnesting can be shared across eval metrics or the nested form can be avoided altogether, and even then, it's not that much faster. I'm not sure the latter is possible, because nesting might save RAM by avoiding repeating values in all the other columns (which is part of why I was suggesting unnesting&evaluating chunks of predictions at a time, not all). Although if there are faster versions of unnest or the summarize, then maybe some better speed gains could be realized.

From our prior discussion, using a nested form isn't a clear winner, partly because the unnesting step is slow. We didn't look at differences in memory usage of different approaches, though.

Is this worth pursuing/investigating more? I (@nmdefries) am not familiar with the prior (pre-evalcast-killcards) nested format, so I don't know what's been tried before.

@nmdefries nmdefries added enhancement New feature or request evalcast package labels Jan 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request evalcast package
Projects
None yet
Development

No branches or pull requests

1 participant