Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle numeric NA values #187

Merged
merged 11 commits into from
Mar 22, 2023
Merged

Handle numeric NA values #187

merged 11 commits into from
Mar 22, 2023

Conversation

hneth
Copy link
Collaborator

@hneth hneth commented Mar 22, 2023

This PR improves the handling of NA values in numeric predictors (by adding the option of replacing NA by means) and revises corresponding tests.

@hneth hneth merged commit 8d69cb1 into ndphillips:master Mar 22, 2023
@pa-nathaniel
Copy link
Collaborator

Hi Hans! Just lurking here without fully reading the code and only the description. But is this PR really enabling missing value imputation logic for within FFTrees? If so is it enabled by default or not? For both training and test data or just one?

Sorry if I missed the description in NEWS or readme, but I didn't see it, thanks!

@hneth
Copy link
Collaborator Author

hneth commented Mar 23, 2023

Hi Nathaniel,

good points, of course. Some clarifications to answer your questions:

But is this PR really enabling missing value imputation logic for within FFTrees?

Yes, it’s replacing NA values in numeric predictors by their mean (per predictor), as commonly done in simulation studies. (Other replacement policies could easily be implemented in the same way.)

If so is it enabled by default or not?

Yes, but there are several global constants that allow enabling / disabling functionality for handling NA values in data. The relevant ones here are:

allow_NA_pred (to dis/allow NA values in predictors)

  • replace_NA_num_pred (to dis/en-able replacing NA values in numeric predictors)

Both are currently set to TRUE by default, but generate a bunch of warnings when actually encountering and replacing NA values.

If allow_NA_pred = TRUE, but replace_NA_num_pred = FALSE, NA values in numeric predictors are ignored in FFT-generation (but issue corresponding messages) and classified according to the current choice from fin_NA_options (when reaching a final node). For the competing models, cases with NA values are removed in this case (to avoid crashing).

For both training and test data or just one?

Both training and test data are handled in the same way — and detecting and replacing NA values issues corresponding warnings for both.

Overall, and especially in combination with the default handling of NA values in categorical predictors (as distinct categories), these changes should provide pretty comprehensive options for a variety of cases.

I’d be happy to discuss which of the options should become defaults and which should become user-controlled parameters (of the main FFTrees() function). Also, the new functionality still needs to be tested and documented properly before being ready for release. (If we wanted to be conservative, setting the global NA-related options to FALSE removes all of this immediately and restores the former functionality.)

Looking forward to hear your thoughts,
Hans

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants