Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Variable roles in tidymodels recipe and workflow... are they respected by rSAFE? #10

Open
jacekkotowski opened this issue Jun 22, 2022 · 3 comments

Comments

@jacekkotowski
Copy link

jacekkotowski commented Jun 22, 2022

Example (I am playing with bicycle demand data from Kaggle

bike_recipe <- recipe(count ~ . , data = bike_training) %>%
  step_date(datetime, features = c("doy", "dow", "month", "year"), abbr = TRUE) %>%
   update_role("datetime", new_role = "id_variable") %>%
    step_rm("atemp")

will create time features out of the datetime index and then datetime will not take part in modelling.
I also removed "atemp" variable altogether (temp and atemp were strongly correlated). It is not taking part in the modelling either.

Next I run the explainer:

explainer <- explain_tidymodels(bike_final_fit, data = bike_all %>% select(-count), y = bike_all$count)
safe_extractor <- safe_extraction(explainer)

Safe extractor seems to ignore the lack of datetime and atemp in modelling process and proposes:

 Variable 'datetime' - selected intervals:
	(-Inf, 2011-02-16 23:00:00]
 	(2011-02-16 23:00:00, 2011-06-17 23:00:00]
 	(2011-06-17 23:00:00, 2012-04-15 23:00:00]
 	(2012-04-15 23:00:00, 2012-07-08 23:00:00]
 	(2012-07-08 23:00:00, Inf)
Variable 'season' - selected intervals:
	(-Inf, 3]
 	(3, Inf)
Variable 'holiday' - no transformation suggested.
Variable 'workingday' - no transformation suggested.
Variable 'weather' - selected intervals:
	(-Inf, 1]
 	(1, Inf)
Variable 'temp' - selected intervals:
	(-Inf, 12.3]
 	(12.3, 22.96]
 	(22.96, Inf)
Variable 'atemp' - selected intervals:
	(-Inf, 24.24]
 	(24.24, Inf)
Variable 'humidity' - selected intervals:
	(-Inf, 30]
 	(30, 48]
 	(48, 67]
 	(67, 84]
 	(84, Inf)
Variable 'windspeed' - selected intervals:
	(-Inf, 7.0015]
 	(7.0015, Inf)

How to tell rSAFE these two vars (one is time index another has been removed in the bake) are not taking part?
I am attaching my quick and dirty workflow:

timeseries_modelling_xgboost_short.zip
@agosiewska

@jacekkotowski jacekkotowski changed the title recipe - variable roles. Are they respected by rSAFE Variable roles in tidymodels recipe and workflow... are they respected by rSAFE? Jun 22, 2022
@agosiewska
Copy link
Member

agosiewska commented Jun 22, 2022

I believe it is a matter of how DALEX treats the datasets in the explainer, could you, please prepare a reproducible example and share session info?

@jacekkotowski
Copy link
Author

jacekkotowski commented Jun 23, 2022

I attached a rendered html and rmd file with my analysis and session info at the bottom.
timeseries_modelling_xgboost_short _2922_06_23a.zip

Is it ok just to ignore from the output the variables that did not take part in modelling? And do the data transformation with the existing variables as they are?
Or these excluded variables have impact on all the break points in the variables?

My session info:

R version 4.1.3 (2022-03-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=Polish_Poland.1250  LC_CTYPE=Polish_Poland.1250   
[3] LC_MONETARY=Polish_Poland.1250 LC_NUMERIC=C                  
[5] LC_TIME=C                     
system code page: 65001

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] shiny_1.7.1

loaded via a namespace (and not attached):
  [1] colorspace_2.0-3   ellipsis_0.3.2     class_7.3-20       timetk_2.8.0      
  [5] base64enc_0.1-3    fs_1.5.2           rstudioapi_0.13    listenv_0.8.0     
  [9] furrr_0.3.0        farver_2.1.0       dials_0.1.1        DT_0.23           
 [13] prodlim_2019.11.13 fansi_1.0.3        lubridate_1.8.0    codetools_0.2-18  
 [17] splines_4.1.3      R.methodsS3_1.8.1  doParallel_1.0.17  cachem_1.0.6      
 [21] knitr_1.39         polyclip_1.10-0    jsonlite_1.8.0     workflows_0.2.6   
 [25] pROC_1.18.0        R.oo_1.24.0        yardstick_0.0.9    ggforce_0.3.3     
 [29] tune_0.2.0         clipr_0.8.0        compiler_4.1.3     assertthat_0.2.1  
 [33] Matrix_1.4-1       fastmap_1.1.0      cli_3.3.0          later_1.3.0       
 [37] tweenr_1.0.2       htmltools_0.5.2    tools_4.1.3        gtable_0.3.0      
 [41] glue_1.6.2         dplyr_1.0.9        Rcpp_1.0.8.3       jquerylib_0.1.4   
 [45] styler_1.7.0       DiceDesign_1.9     vctrs_0.4.1        iterators_1.0.14  
 [49] parsnip_0.2.1      timeDate_3043.102  gower_1.0.0        xfun_0.31         
 [53] globals_0.15.0     mime_0.12          miniUI_0.1.1.1     lifecycle_1.0.1   
 [57] pacman_0.5.1       future_1.26.1      MASS_7.3-57        zoo_1.8-10        
 [61] scales_1.2.0       ipred_0.9-12       promises_1.2.0.1   parallel_4.1.3    
 [65] yaml_2.3.5         ggplot2_3.3.6      sass_0.4.1         rpart_4.1.16      
 [69] corrplot_0.92      foreach_1.5.2      lhs_1.1.5          hardhat_0.2.0     
 [73] lava_1.6.10        repr_1.1.4         rlang_1.0.2        pkgconfig_2.0.3   
 [77] rsample_0.1.1      evaluate_0.15      lattice_0.20-45    purrr_0.3.4       
 [81] recipes_0.2.0      htmlwidgets_1.5.4  tidyselect_1.1.2   parallelly_1.31.1 
 [85] plyr_1.8.7         magrittr_2.0.3     R6_2.5.1           generics_0.1.2    
 [89] DBI_1.1.2          pillar_1.7.0       withr_2.5.0        xts_0.12.1        
 [93] survival_3.3-1     DALEX_2.4.2        nnet_7.3-17        tibble_3.1.7      
 [97] future.apply_1.9.0 crayon_1.5.1       xgboost_1.6.0.1    utf8_1.2.2        
[101] rmarkdown_2.14     grid_4.1.3         data.table_1.14.2  reprex_2.0.1      
[105] digest_0.6.29      xtable_1.8-4       R.cache_0.15.0     tidyr_1.2.0       
[109] httpuv_1.6.5       R.utils_2.11.0     GPfit_1.0-8        munsell_0.5.0     
[113] finetune_0.2.0     skimr_2.1.4        bslib_0.3.1  

@agosiewska
Copy link
Member

Thank you, by reproducible example, I meant some toy example that is simple and fast to run, this .Rmd is taking a lot of time to compute and when I decreased the number of trees in xgboost to speed the script up I got an error:

> bike_rf_rs <-
+   bike_rf_wkfl %>%
+     finetune::tune_sim_anneal(
+     resamples = bike_folds,
+    param_info = xgboost_set,
+       metrics = bike_metrics,
+          iter = 30,
+       initial = 10)

>  Generating a set of 10 initial parameter results
<U+221A> Initialization complete

Error in UseMethod("mutate") : 
  no applicable method for 'mutate' applied to an object of class "NULL"

Anyway, if you pass the data frame with all columns (bike_all) to the DALEX::explainer, SAFE will compute transformations for all of them.
However, as long as you don't use interactions in SAFE (I saw in the script that you don't), then you can ignore the transformations for columns not used by the model. They are calculated for each variable independently.

Variable filtering perhaps should be a feature in a future version of SAFE. At this point, I would suggest filtering out variables before feeding data into the explainer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants