Take advantage of nested prediction cards format to save memory/decrease runtime? #613

nmdefries · 2023-01-18T13:21:21Z

The evalcast-killcards branch specifically moved to an unnested format because it seemed cleaner / more convenient. However, nesting has some benefits, like using less memory since we don't have to store duplicate values in the nest_by fields. It could also make some of the scoring logic simpler.

Using nesting just for the forecast scoring might be as inefficient / RAM-exploding as a group_split. Sharing the nesting work across enough subprocesses could make it worth it, e.g. perhaps if such nesting/nest_bying is done before the join with the actuals or more generally the nested form is shared across all error measure calculations, it might offer some offsetting performance gains.

Comparing speed of nested/unnested approaches for scoring-like calculations (credit: @brookslogan):

Timing some simple operations to see if we'd expect speed gains [using an unnest approach in error measures]:
library(tidyverse)
N = 3000L; M = 23; asdf = tibble(k = seq_len(N), y = rnorm(N), pred = map(seq_len(N), ~ tibble(q=(1:M)/(M+1), v = sort(rnorm(M)))))
unnested = asdf %>% unnest(pred)
microbenchmark::microbenchmark(asdf %>% rowwise() %>% summarize(k, e = mean(y - pred$q * pred$v)),
                               asdf %>% unnest(pred) %>% mutate(dev = y - q*v) %>% group_by(k) %>% summarize(e = mean(dev)),
                               asdf %>% unnest(pred) %>% group_by(k) %>% summarize(e = mean(y - q*v)),
                               unnested %>% mutate(dev = y - q*v) %>% group_by(k) %>% summarize(e = mean(dev)))
#> Unit: milliseconds
#>                                                                                                 expr
#>                                   asdf %>% rowwise() %>% summarize(k, e = mean(y - pred$q * pred$v))
#>  asdf %>% unnest(pred) %>% mutate(dev = y - q * v) %>% group_by(k) %>%      summarize(e = mean(dev))
#>                        asdf %>% unnest(pred) %>% group_by(k) %>% summarize(e = mean(y -      q * v))
#>                    unnested %>% mutate(dev = y - q * v) %>% group_by(k) %>% summarize(e = mean(dev))
#>       min       lq     mean   median       uq       max neval
#>  37.25541 39.74052 41.62194 41.41595 42.82244  57.75427   100
#>  36.90822 38.21231 41.35320 39.29544 41.82904 111.32808   100
#>  39.90252 41.90083 45.60959 44.20522 46.16257 115.62528   100
#>  17.80969 18.36073 19.87239 19.11819 20.29674  36.05479   100
Expect comparisons to vary based on N, M, and complexity of the calculation. But the above makes me pessimistic about unnest unless the unnesting can be shared across eval metrics or the nested form can be avoided altogether, and even then, it's not that much faster. I'm not sure the latter is possible, because nesting might save RAM by avoiding repeating values in all the other columns (which is part of why I was suggesting unnesting&evaluating chunks of predictions at a time, not all). Although if there are faster versions of unnest or the summarize, then maybe some better speed gains could be realized.

From our prior discussion, using a nested form isn't a clear winner, partly because the unnesting step is slow. We didn't look at differences in memory usage of different approaches, though.

Is this worth pursuing/investigating more? I (@nmdefries) am not familiar with the prior (pre-evalcast-killcards) nested format, so I don't know what's been tried before.

The text was updated successfully, but these errors were encountered:

nmdefries added enhancement New feature or request evalcast package labels Jan 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Take advantage of nested prediction cards format to save memory/decrease runtime? #613

Take advantage of nested prediction cards format to save memory/decrease runtime? #613

nmdefries commented Jan 18, 2023 •

edited

Take advantage of nested prediction cards format to save memory/decrease runtime? #613

Take advantage of nested prediction cards format to save memory/decrease runtime? #613

Comments

nmdefries commented Jan 18, 2023 • edited

nmdefries commented Jan 18, 2023 •

edited