Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiment with roll_by family #137

Open
mitchelloharawild opened this issue Jul 26, 2019 · 9 comments
Open

Experiment with roll_by family #137

mitchelloharawild opened this issue Jul 26, 2019 · 9 comments

Comments

@mitchelloharawild
Copy link
Member

slide_by(), stretch_by(), tile_by() which integrates with dplyr's mutate(), summarise(), etc.

@mitchelloharawild
Copy link
Member Author

While the data.frame method for these functions should operate on row indices, it is possible for the tbl_ts method to also support time-aware inputs such as durations (#130).

@earowang
Copy link
Member

The main issue I’m concerning is the mable object once modeling the slided tsibble, model can’t be lazily evaluated.

@earowang
Copy link
Member

I talked to Hadley about this. He thinks it's a potentially good approach, more like a lazily evaluated grouped data frame. He also mentioned the tidymodels bootstrapping approach https://davisvaughan.github.io/strapgod/articles/dplyr-support.html

@mitchelloharawild
Copy link
Member Author

That sounds promising.
There will need to be more work than I had hoped to get mutate(), filter(), etc. to behave as expected - so a new class with dplyr methods will be required.

@mitchelloharawild
Copy link
Member Author

While I think this functionality is useful, I want to wait to see more developments with slurrr (#143) before working on this further.

The original motivating issue with the window "id" being part of the key structure has now been resolved in fabletools by supporting disjoint hierarchies.

Further, if slurrr is the package location going further, this functionality should belong in that package. Maybe @DavisVaughan can consider it's implementation/use, and tsibble can later add a tbl_ts method that supports time aware sliding inputs.

@earowang
Copy link
Member

earowang commented Aug 15, 2019 via email

@DavisVaughan
Copy link

DavisVaughan commented Aug 15, 2019

Oooh I had an idea for this too yesterday. It does use the lazy grouping feature of dplyr, but would require a new data frame subclass.

I'm not quite convinced that it would be worth it over a version of slide() that has an .index argument, which I mention in #143. They would essentially compute the same thing, but the size of the slide() result would be vec_size(.x) and the size of the slide_by result would be vec_size(unique(.index)).

I think it would be pretty hard (not impossible) to get another df subclass that would be able to 1) add and remove virtual groups correctly 2) be able to subclass on top of tsibble or a data frame, and know how to drop back to a pure tsibble after some kind of summary operation where the virtual groups are materialized. I think its much "easier" to let slide() be unaware of dplyr, but still do what you'd want here.

This is just a partial example of how it might look.

> pedestrian_small
# A tsibble: 240 x 6 [1h] <Australia/Melbourne>
# Key:       Sensor [1]
   Sensor         Date_Time           Date        Time Count       ym
   <chr>          <dttm>              <date>     <int> <int>    <mth>
 1 Birrarung Marr 2015-01-01 00:00:00 2015-01-01     0  1630 2015 Jan
 2 Birrarung Marr 2015-01-01 01:00:00 2015-01-01     1   826 2015 Jan
 3 Birrarung Marr 2015-01-01 02:00:00 2015-01-01     2   567 2015 Jan
 4 Birrarung Marr 2015-01-01 03:00:00 2015-01-01     3   264 2015 Jan
 5 Birrarung Marr 2015-01-01 04:00:00 2015-01-01     4   139 2015 Jan
 6 Birrarung Marr 2015-01-01 05:00:00 2015-01-01     5    77 2015 Jan
 7 Birrarung Marr 2015-01-01 06:00:00 2015-01-01     6    44 2015 Jan
 8 Birrarung Marr 2015-01-01 07:00:00 2015-01-01     7    56 2015 Jan
 9 Birrarung Marr 2015-01-01 08:00:00 2015-01-01     8   113 2015 Jan
10 Birrarung Marr 2015-01-01 09:00:00 2015-01-01     9   166 2015 Jan
# … with 230 more rows
> 
> unique(pedestrian_small$ym)
 [1] "2015 Jan" "2015 Feb" "2015 Mar" "2015 Apr" "2015 May" "2015 Jun" "2015 Jul" "2015 Aug" "2015 Sep" "2015 Oct"
[11] "2015 Nov" "2015 Dec" "2016 Jan" "2016 Feb" "2016 Mar" "2016 Apr" "2016 May" "2016 Jun" "2016 Jul" "2016 Aug"
[21] "2016 Sep" "2016 Oct" "2016 Nov" "2016 Dec"
> 
> pedestrian_small_slide_grouped <- slide_by(pedestrian_small, ym, before = 1, after = 1, step = 1)
> pedestrian_small_slide_grouped
# A tibble: 240 x 6
# Groups:   .start, .stop [24]
   Sensor         Date_Time           Date        Time Count       ym
   <chr>          <dttm>              <date>     <int> <int>    <mth>
 1 Birrarung Marr 2015-01-01 00:00:00 2015-01-01     0  1630 2015 Jan
 2 Birrarung Marr 2015-01-01 01:00:00 2015-01-01     1   826 2015 Jan
 3 Birrarung Marr 2015-01-01 02:00:00 2015-01-01     2   567 2015 Jan
 4 Birrarung Marr 2015-01-01 03:00:00 2015-01-01     3   264 2015 Jan
 5 Birrarung Marr 2015-01-01 04:00:00 2015-01-01     4   139 2015 Jan
 6 Birrarung Marr 2015-01-01 05:00:00 2015-01-01     5    77 2015 Jan
 7 Birrarung Marr 2015-01-01 06:00:00 2015-01-01     6    44 2015 Jan
 8 Birrarung Marr 2015-01-01 07:00:00 2015-01-01     7    56 2015 Jan
 9 Birrarung Marr 2015-01-01 08:00:00 2015-01-01     8   113 2015 Jan
10 Birrarung Marr 2015-01-01 09:00:00 2015-01-01     9   166 2015 Jan
# … with 230 more rows
> 
> pedestrian_small_slide_grouped %>%
+   summarise(
+     ym_start = first(ym),
+     ym_end = last(ym),
+     MonthlyCount = mean(Count)
+   )
# A tibble: 24 x 5
# Groups:   .start [23]
   .start     .stop      ym_start   ym_end MonthlyCount
   <date>     <date>        <mth>    <mth>        <dbl>
 1 2015-01-01 2015-02-01 2015 Jan 2015 Feb         227.
 2 2015-01-01 2015-03-01 2015 Jan 2015 Mar         180.
 3 2015-02-01 2015-04-01 2015 Feb 2015 Apr         112.
 4 2015-03-01 2015-05-01 2015 Mar 2015 May         147.
 5 2015-04-01 2015-06-01 2015 Apr 2015 Jun         165.
 6 2015-05-01 2015-07-01 2015 May 2015 Jul         160.
 7 2015-06-01 2015-08-01 2015 Jun 2015 Aug         120.
 8 2015-07-01 2015-09-01 2015 Jul 2015 Sep         130.
 9 2015-08-01 2015-10-01 2015 Aug 2015 Oct         141 
10 2015-09-01 2015-11-01 2015 Sep 2015 Nov         168.
# … with 14 more rows
> 
> pedestrian_small_slide_along <- mutate(
+   pedestrian_small,
+   col = slide_along_impl(Count, ~mean(.x), .index = ym, .ptype = dbl(), .before = 1, .after = 1)
+ )
> 
> pedestrian_small_slide_along
# A tsibble: 240 x 7 [1h] <Australia/Melbourne>
# Key:       Sensor [1]
   Sensor         Date_Time           Date        Time Count       ym   col
   <chr>          <dttm>              <date>     <int> <int>    <mth> <dbl>
 1 Birrarung Marr 2015-01-01 00:00:00 2015-01-01     0  1630 2015 Jan  227.
 2 Birrarung Marr 2015-01-01 01:00:00 2015-01-01     1   826 2015 Jan  227.
 3 Birrarung Marr 2015-01-01 02:00:00 2015-01-01     2   567 2015 Jan  227.
 4 Birrarung Marr 2015-01-01 03:00:00 2015-01-01     3   264 2015 Jan  227.
 5 Birrarung Marr 2015-01-01 04:00:00 2015-01-01     4   139 2015 Jan  227.
 6 Birrarung Marr 2015-01-01 05:00:00 2015-01-01     5    77 2015 Jan  227.
 7 Birrarung Marr 2015-01-01 06:00:00 2015-01-01     6    44 2015 Jan  227.
 8 Birrarung Marr 2015-01-01 07:00:00 2015-01-01     7    56 2015 Jan  227.
 9 Birrarung Marr 2015-01-01 08:00:00 2015-01-01     8   113 2015 Jan  227.
10 Birrarung Marr 2015-01-01 09:00:00 2015-01-01     9   166 2015 Jan  227.
# … with 230 more rows
> 
> # group by ym and slice(1) to get it down to the right number of rows
> pedestrian_small_slide_along %>%
+   group_by(ym) %>%
+   slice(1)
# A tsibble: 24 x 7 [1h] <Australia/Melbourne>
# Key:       Sensor [1]
# Groups:    ym [24]
   Sensor         Date_Time           Date        Time Count       ym   col
   <chr>          <dttm>              <date>     <int> <int>    <mth> <dbl>
 1 Birrarung Marr 2015-01-01 00:00:00 2015-01-01     0  1630 2015 Jan  227.
 2 Birrarung Marr 2015-02-01 00:00:00 2015-02-01     0   178 2015 Feb  180.
 3 Birrarung Marr 2015-03-01 00:00:00 2015-03-01     0    69 2015 Mar  112.
 4 Birrarung Marr 2015-04-01 00:00:00 2015-04-01     0    25 2015 Apr  147.
 5 Birrarung Marr 2015-05-01 00:00:00 2015-05-01     0    13 2015 May  165.
 6 Birrarung Marr 2015-06-01 00:00:00 2015-06-01     0     9 2015 Jun  160.
 7 Birrarung Marr 2015-07-01 00:00:00 2015-07-01     0    21 2015 Jul  120.
 8 Birrarung Marr 2015-08-01 00:00:00 2015-08-01     0    86 2015 Aug  130.
 9 Birrarung Marr 2015-09-01 00:00:00 2015-09-01     0    12 2015 Sep  141 
10 Birrarung Marr 2015-10-01 00:00:00 2015-10-01     0    22 2015 Oct  168.
# … with 14 more rows

@DavisVaughan
Copy link

Another option is a dplyr verb slide-arize() that would internally alter the virtual groups. This would mean there is no need for a subclass, which I'm all for. I'm sure it could be refined a bit to work right with tsibble objects. Notice how much it simplifies the example from the window function vignette

(it breaks the yearmonth class too when binding groups together, but maybe that can be fixed)

set.seed(123)

date <- as.Date(c("2019-01-01", "2019-01-02", "2019-02-01", "2019-02-02", "2019-02-03", "2019-03-01", "2019-04-01", "2019-04-02"))
idx <- yearmonth(date)

tib <- tibble(
  g = c("A", "A", "A", "B", "B", "B", "B", "B"),
  x = rnorm(length(g)),
  date = date,
  idx = idx
)

tib
#> # A tibble: 8 x 4
#>   g           x date            idx
#>   <chr>   <dbl> <date>        <mth>
#> 1 A     -0.560  2019-01-01 2019 Jan
#> 2 A     -0.230  2019-01-02 2019 Jan
#> 3 A      1.56   2019-02-01 2019 Feb
#> 4 B      0.0705 2019-02-02 2019 Feb
#> 5 B      0.129  2019-02-03 2019 Feb
#> 6 B      1.72   2019-03-01 2019 Mar
#> 7 B      0.461  2019-04-01 2019 Apr
#> 8 B     -1.27   2019-04-02 2019 Apr

tib %>%
  slidearize(y = mean(x), .index = idx, .before = 1)
#> # A tibble: 4 x 2
#>        idx      y
#>      <mth>  <dbl>
#> 1 2019 Jan -0.395
#> 2 2019 Feb  0.194
#> 3 2019 Mar  0.868
#> 4 2019 Apr  0.304

tib %>%
  group_by(g) %>%
  slidearize(y = mean(x), .index = idx, .before = 1)
#> # A tibble: 5 x 3
#> # Groups:   g [2]
#>   g     idx             y
#>   <chr> <date>      <dbl>
#> 1 A     2019-01-01 -0.395
#> 2 A     2019-02-01  0.256
#> 3 B     2019-02-01 -0.395
#> 4 B     2019-03-01  0.256
#> 5 B     2019-04-01  0.586

as_tibble(pedestrian) %>%
  mutate(ym = yearmonth(Date)) %>%
  slidearize(Count = mean(Count), .index = ym, .before = 1, .after = 1)
#> # A tibble: 24 x 2
#>          ym Count
#>       <mth> <dbl>
#>  1 2015 Jan  559.
#>  2 2015 Feb  626.
#>  3 2015 Mar  658.
#>  4 2015 Apr  678.
#>  5 2015 May  637.
#>  6 2015 Jun  651.
#>  7 2015 Jul  627.
#>  8 2015 Aug  635.
#>  9 2015 Sep  646.
#> 10 2015 Oct  661.
#> # … with 14 more rows

as_tibble(pedestrian) %>%
  mutate(ym = yearmonth(Date)) %>%
  group_by(Sensor) %>%
  slidearize(Count = mean(Count), .index = ym, .before = 1, .after = 1)
#> # A tibble: 95 x 3
#> # Groups:   Sensor [4]
#>    Sensor         ym         Count
#>    <chr>          <date>     <dbl>
#>  1 Birrarung Marr 2015-01-01  592.
#>  2 Birrarung Marr 2015-02-01  634.
#>  3 Birrarung Marr 2015-03-01  546.
#>  4 Birrarung Marr 2015-04-01  554.
#>  5 Birrarung Marr 2015-05-01  397.
#>  6 Birrarung Marr 2015-06-01  429.
#>  7 Birrarung Marr 2015-07-01  390.
#>  8 Birrarung Marr 2015-08-01  398.
#>  9 Birrarung Marr 2015-09-01  392.
#> 10 Birrarung Marr 2015-10-01  539.
#> # … with 85 more rows

pedestrian %>%
  mutate(ym = yearmonth(Date_Time)) %>%
  tidyr::nest(data = c(-Sensor, -ym)) %>%
  group_by(Sensor) %>%
  mutate(Monthly_MA = slide_dbl(data, ~ mean(.$Count, na.rm = TRUE), .size = 3, .align = "center", .bind = TRUE
  ))
#> # A tibble: 95 x 4
#> # Groups:   Sensor [4]
#>    Sensor               ym data                Monthly_MA
#>    <chr>             <mth> <list>                   <dbl>
#>  1 Birrarung Marr 2015 Jan <tsibble [744 × 4]>        NA 
#>  2 Birrarung Marr 2015 Feb <tsibble [672 × 4]>       634.
#>  3 Birrarung Marr 2015 Mar <tsibble [744 × 4]>       546.
#>  4 Birrarung Marr 2015 Apr <tsibble [720 × 4]>       554.
#>  5 Birrarung Marr 2015 May <tsibble [144 × 4]>       397.
#>  6 Birrarung Marr 2015 Jun <tsibble [720 × 4]>       429.
#>  7 Birrarung Marr 2015 Jul <tsibble [744 × 4]>       390.
#>  8 Birrarung Marr 2015 Aug <tsibble [744 × 4]>       398.
#>  9 Birrarung Marr 2015 Sep <tsibble [720 × 4]>       392.
#> 10 Birrarung Marr 2015 Oct <tsibble [119 × 4]>       539.
#> # … with 85 more rows
make_virtual_group_info <- function(index, rows, before, after, step, sym) {
  index <- vec_slice(index, rows)
  index_split <- vec_split_id(index)
  endpoints <- locate_endpoints(before, after, step, index_split$key) # the only special thing
  new_rows <- purrr::map2(endpoints$starts, endpoints$stops, ~vec_c(!!!vec_slice(index_split$id, seq2(.x, .y))))
  tibble(!!sym := index_split$key, !!expr(.rows) := new_rows)
}

slidearize <- function(.data, ...) {
  UseMethod("slidearize")
}

slidearize.tbl_df <- function(.data, ..., .index, .before = 0L, .after = 0L, .step = 1L) {
  .index_chr <- tidyselect::vars_select(names(.data), !!enquo(.index))
  .index_sym <- sym(.index_chr)
  .index <- dplyr::pull(.data, .index_chr)

  group_info <- make_virtual_group_info(
    .index,
    vec_seq_along(.index),
    before = .before,
    after = .after,
    step = .step,
    sym = .index_sym
  )

  .data <- new_grouped_df(.data, group_info)
  summarise(.data, ...)
}

slidearize.grouped_df <- function(.data, ..., .index, .before = 0L, .after = 0L, .step = 1L) {
  .index_chr <- tidyselect::vars_select(names(.data), !!enquo(.index))
  .index_sym <- sym(.index_chr)
  .index <- dplyr::pull(.data, .index_chr)
  group_info <- dplyr::group_data(.data)
  group_rows <- group_info[[".rows"]]

  group_info[[".rows"]] <- lapply(
    group_rows,
    make_virtual_group_info,
    index = .index,
    before = .before,
    after = .after,
    step = .step,
    sym = .index_sym
  )

  group_info <- tidyr::unnest(group_info, cols = .rows)

  attr(.data, "groups") <- group_info

  summarise(.data, ...)
}

@mitchelloharawild
Copy link
Member Author

If we were to have a family of roll_by() functions I'd expect them to work with mutate(), filter(), etc. To achieve this a subclass would be required unfortunately (upon confirming this, I put the idea on hold for a bit).

While slidarise() is easy to make, it feels limited to me. Almost like a shortcut for data %>% slide_by() %>% summarise().

For instance, a common use of rolling windows is to compare the raw data with a rolling mean.

library(tsibble)
library(dplyr)
library(ggplot2)
tsibbledata::gafa_stock %>% 
  group_by(Symbol) %>% 
  mutate(Close_MA = slide_dbl(Close, mean, .size = 7, .align = "c"))

#> # A tsibble: 5,032 x 9 [!]
#> # Key:       Symbol [4]
#> # Groups:    Symbol [4]
#>    Symbol Date        Open  High   Low Close Adj_Close    Volume Close_MA
#>    <chr>  <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>     <dbl>    <dbl>
#>  1 AAPL   2014-01-02  79.4  79.6  78.9  79.0      67.0  58671200     NA  
#>  2 AAPL   2014-01-03  79.0  79.1  77.2  77.3      65.5  98116900     NA  
#>  3 AAPL   2014-01-06  76.8  78.1  76.2  77.7      65.9 103152700     NA  
#>  4 AAPL   2014-01-07  77.8  78.0  76.8  77.1      65.4  79302300     77.4
#>  5 AAPL   2014-01-08  77.0  77.9  77.0  77.6      65.8  64632400     77.0
#>  6 AAPL   2014-01-09  78.1  78.1  76.5  76.6      65.0  69787200     77.1
#>  7 AAPL   2014-01-10  77.1  77.3  75.9  76.1      64.5  76244000     77.4
#>  8 AAPL   2014-01-13  75.7  77.5  75.7  76.5      64.9  94623200     77.7
#>  9 AAPL   2014-01-14  76.9  78.1  76.8  78.1      66.1  83140400     77.6
#> 10 AAPL   2014-01-15  79.1  80.0  78.8  79.6      67.5  97909700     77.9
#> # … with 5,022 more rows

The theoretical equivalent with roll_by would be

library(tsibble)
library(dplyr)
library(ggplot2)
tsibbledata::gafa_stock %>% 
  group_by(Symbol) %>% 
  slide_by(.size = 7, .align = "c") %>% 
  mutate(Close_MA = mean(Close))

Any differences in dimension are padded according to .fill = NA. Of course things like .align are up for improvement (as done in slurrr). If using summarise() then the dimension would be reduced to the number of rolling windows.

earowang added a commit that referenced this issue Sep 11, 2019
earowang added a commit that referenced this issue Sep 11, 2019
@earowang earowang removed this from the v0.9.0 milestone Nov 30, 2019
@mitchelloharawild mitchelloharawild removed their assignment Mar 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants