Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify documentation for and sanity-check nafill_buffer usage #320

Open
brookslogan opened this issue Apr 18, 2024 · 1 comment
Open

Clarify documentation for and sanity-check nafill_buffer usage #320

brookslogan opened this issue Apr 18, 2024 · 1 comment

Comments

@brookslogan
Copy link
Contributor

brookslogan commented Apr 18, 2024

The args list help entries for the canned forecasters, including ones that don't use lags, is:

nafill_buffer: At predict time, recent values of the training data are
          used to create a forecast. However, these can be 'NA' due to,
          e.g., data latency issues. By default, any missing values
          will get filled with less recent data. Setting this value to
          'NULL' will result in 1 extra recent row (beyond those
          required for lag creation) to be used. Note that we require
          at least 'min(lags)' rows of recent data per 'geo_value' to
          create a prediction. For this reason, setting 'nafill_buffer
          < min(lags)' will be treated as _additional_ allowed recent
          data rather than the total amount of recent data to examine.

This confused me, and I think there may be some implementation bug(s).

Requests:

  • State the default value, Inf, and what it does.
  • State what "regular" values (non-NULL, not < min(lags)) do.
  • "Setting this value to 'NULL' will result in 1 extra recent row (beyond those required for lag creation) to be used." ---
    • An extra row from where, for what? --- I think I understand now after a while of thinking about the name and looking at the implementation, but things could be clearer.
    • Why would this be expected to work so much that it's a special value?
    • It looks like "those required for lag creation" corresponds to max_lags + max_horizon, but factors into keep in x <- dplyr::filter(x, forecast_date - time_value <= keep), which
      • if forecast_date is the default max time_value, seems like more than is required for lag creation.
      • if forecast_date has been passed as the actual forecast date, makes it seem that max_horizon should be something around the reporting latency instead, although maybe this is off by one since keep is at minimum min_required + 1 not min_required.
      • (Also, this type of filter calls into question the naming n_recent.)
    • Actually, trying to use this, I get
out <- arx_forecaster(
  case_death_rate_subset, "death_rate",
  "death_rate",
  args_list = arx_args_list(
    ahead = 16,
    lags = 7,
    n_training = 30,
    nafill_buffer = NULL
  )
)
#> Error in if (is.finite(nafill_buffer)) arg_is_pos_int(nafill_buffer, allow_null = TRUE) : 
#>   argument is of length zero
  • ("For this reason, setting 'nafill_buffer < min(lags)' will be treated as additional allowed recent data rather than the total amount of recent data to examine" --- still sounds like a footgun to have < min(lags) and >= min(lags) do different things.)
  • Note/remind that this NA filling is not used in the training process?
  • For forecasters that don't use lags, provide an explanation that doesn't mention lags?
@brookslogan
Copy link
Contributor Author

Maybe we don't have to address this, if the related code and documentation is removed as part of #293.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant