Clarify documentation for and sanity-check `nafill_buffer` usage #320

brookslogan · 2024-04-18T23:24:24Z

The args list help entries for the canned forecasters, including ones that don't use lags, is:

nafill_buffer: At predict time, recent values of the training data are
          used to create a forecast. However, these can be 'NA' due to,
          e.g., data latency issues. By default, any missing values
          will get filled with less recent data. Setting this value to
          'NULL' will result in 1 extra recent row (beyond those
          required for lag creation) to be used. Note that we require
          at least 'min(lags)' rows of recent data per 'geo_value' to
          create a prediction. For this reason, setting 'nafill_buffer
          < min(lags)' will be treated as _additional_ allowed recent
          data rather than the total amount of recent data to examine.

This confused me, and I think there may be some implementation bug(s).

Requests:

State the default value, Inf, and what it does.
State what "regular" values (non-NULL, not < min(lags)) do.
"Setting this value to 'NULL' will result in 1 extra recent row (beyond those required for lag creation) to be used." ---
- An extra row from where, for what? --- I think I understand now after a while of thinking about the name and looking at the implementation, but things could be clearer.
- Why would this be expected to work so much that it's a special value?
- It looks like "those required for lag creation" corresponds to max_lags + max_horizon, but factors into keep in x <- dplyr::filter(x, forecast_date - time_value <= keep), which
  - if forecast_date is the default max time_value, seems like more than is required for lag creation.
  - if forecast_date has been passed as the actual forecast date, makes it seem that max_horizon should be something around the reporting latency instead, although maybe this is off by one since keep is at minimum min_required + 1 not min_required.
  - (Also, this type of filter calls into question the naming n_recent.)
- Actually, trying to use this, I get

out <- arx_forecaster(
  case_death_rate_subset, "death_rate",
  "death_rate",
  args_list = arx_args_list(
    ahead = 16,
    lags = 7,
    n_training = 30,
    nafill_buffer = NULL
  )
)
#> Error in if (is.finite(nafill_buffer)) arg_is_pos_int(nafill_buffer, allow_null = TRUE) : 
#>   argument is of length zero

("For this reason, setting 'nafill_buffer < min(lags)' will be treated as additional allowed recent data rather than the total amount of recent data to examine" --- still sounds like a footgun to have < min(lags) and >= min(lags) do different things.)
Note/remind that this NA filling is not used in the training process?
For forecasters that don't use lags, provide an explanation that doesn't mention lags?

The text was updated successfully, but these errors were encountered:

brookslogan · 2024-04-18T23:29:54Z

Maybe we don't have to address this, if the related code and documentation is removed as part of #293.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify documentation for and sanity-check `nafill_buffer` usage #320

Clarify documentation for and sanity-check `nafill_buffer` usage #320

brookslogan commented Apr 18, 2024 •

edited

brookslogan commented Apr 18, 2024

Clarify documentation for and sanity-check nafill_buffer usage #320

Clarify documentation for and sanity-check nafill_buffer usage #320

Comments

brookslogan commented Apr 18, 2024 • edited

brookslogan commented Apr 18, 2024

Clarify documentation for and sanity-check `nafill_buffer` usage #320

Clarify documentation for and sanity-check `nafill_buffer` usage #320

brookslogan commented Apr 18, 2024 •

edited