Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: modelReport using RF model fails due to mismatch in levels of categorical variables #36

Open
1 task done
lidefi87 opened this issue Mar 14, 2024 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@lidefi87
Copy link

lidefi87 commented Mar 14, 2024

Describe the bug

Hi there! I am using SDMtune version 1.3.1 and I tried using the modelReport function with a Random Forest model and I get the same error reported in closed issues #11 and #8:

Error in `predict.randomForest()`:
! Type of predictors in new data do not match that of the training data.

I understand that this issue was addressed in issue #8 by adding a factors parameter to the modelReport function, where the levels of the categorical variables included in the model could be provided. However, this parameter is not available in version 1.3.1, I checked the documentation for this function, as well as the source code, and it is definitely not there.

It would be great if I could get some ideas on how to address this issue.

Steps to reproduce the bug

library(SDMtune)

files <- list.files(path = file.path(system.file(package = "dismo"), "ex"),
                    pattern = "grd",
                    full.names = TRUE)

predictors <- terra::rast(files)

# Prepare presence and background locations
p_coords <- virtualSp$presence
bg_coords <- virtualSp$background

# Create SWD object
data <- prepareSWD(species = "Virtual species",
                   p = p_coords,
                   a = bg_coords,
                   env = predictors,
                   categorical = "biome")

# Split presence locations in training (80%) and testing (20%) datasets
datasets <- trainValTest(data,
                         test = 0.2,
                         only_presence = TRUE)
train <- datasets[[1]]
test <- datasets[[2]]

# Train a model
model <- train(method = "RF",
               data = train)

#Produce report
modelReport(model, folder = "test", test = test,
            response_curves = T, only_presence = TRUE, jk = TRUE,
            permut = 2)

Session information

R version 4.2.2 (2022-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] SDMtune_1.3.1

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.0     terra_1.7-29         xfun_0.39            bslib_0.4.2         
 [5] lattice_0.20-45      colorspace_2.1-0     vctrs_0.6.5          generics_0.1.3      
 [9] htmltools_0.5.5      viridisLite_0.4.2    yaml_2.3.7           utf8_1.2.4          
[13] rlang_1.1.2          pillar_1.9.0         jquerylib_0.1.4      withr_2.5.2         
[17] glue_1.6.2           sp_1.6-0             plyr_1.8.8           lifecycle_1.0.4     
[21] stringr_1.5.0        munsell_0.5.0        gtable_0.3.4         ragg_1.2.4          
[25] rvest_1.0.3          raster_3.6-20        codetools_0.2-18     kableExtra_1.3.4    
[29] evaluate_0.21        labeling_0.4.3       knitr_1.43           fastmap_1.1.1       
[33] fansi_1.0.6          highr_0.10           Rcpp_1.0.11          scales_1.3.0        
[37] cachem_1.0.8         webshot_0.5.4        jsonlite_1.8.4       farver_2.1.1        
[41] systemfonts_1.0.4    textshaping_0.3.6    ggplot2_3.4.4        digest_0.6.31       
[45] stringi_1.7.12       dplyr_1.1.2          dismo_1.3-9          grid_4.2.2          
[49] cli_3.6.2            tools_4.2.2          magrittr_2.0.3       sass_0.4.6          
[53] tibble_3.2.1         randomForest_4.7-1.1 pkgconfig_2.0.3      xml2_1.3.3          
[57] rmarkdown_2.21       svglite_2.1.0        httr_1.4.6           rstudioapi_0.15.0   
[61] plotROC_2.3.0        R6_2.5.1             compiler_4.2.2

Additional information

No response

Reproducible example

  • I have done my best to provide the steps to reproduce the bug
@lidefi87 lidefi87 added the bug Something isn't working label Mar 14, 2024
@lidefi87 lidefi87 changed the title [Bug]: modelReport using RF method fails due to mismatch in levels of categorical variables [Bug]: modelReport using RF model fails due to mismatch in levels of categorical variables Mar 14, 2024
@lidefi87
Copy link
Author

lidefi87 commented Mar 21, 2024

I found a solution to this issue. It seems that the cause is the mismatch in the levels of categorical variables found in the model object and the new data frame created to make predictions.

In line 234 of plotResponse.R these new data frame to be used for predictions is created. This is what it looks like now:

if (var %in% cont_vars) {
    var_min <- min(model@data@data[var])
    var_max <- max(model@data@data[var])
    data[var] <- seq(var_min, var_max, length.out = n_rows)
  } else {
    data[var] <- factor(categ)
  }

I added two lines before the else statement, and now it looks like this:

if (var %in% cont_vars) {
      var_min <- min(model@data@data[var])
      var_max <- max(model@data@data[var])
      data[var] <- seq(var_min, var_max, length.out = n_rows)
      for (c in cat_vars) {
        levels(data[, c]) <- levels(df[, c])
      }
    } else {
      data[var] <- factor(categ)
    }

This way we keep the same number of levels as the original data used to train the model.

Now I can produce plots without issues.

PR #37 implements this proposed change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants