Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add argument boot = FALSE? #11

Open
rempsyc opened this issue Dec 11, 2022 · 17 comments · Fixed by #15
Open

Add argument boot = FALSE? #11

rempsyc opened this issue Dec 11, 2022 · 17 comments · Fixed by #15

Comments

@rempsyc
Copy link

rempsyc commented Dec 11, 2022

I know that this package is all about bootstrapping. I use your package inside my own package, rempsyc, when calculating effect sizes for contrasts. By default, it thus uses bootstrapping, but I would like to give my users to ability to not bootstrap, should they wish so.

Now, I could rely on a completely different package to accomplish this objective, but it would be simpler on my end if I could stick to the same package and only change a single option from the parameters, especially since few effect size packages to my knowledge explicitly support contrasts like you do.

Do you think there is any chance you might make this change? Alternatively, is there already a way to request no bootstraps through the ... argument passed to boot?

@dgerlanc
Copy link
Owner

Thanks for opening an issue!

You can request to not bootstrap by passing an argument R=0, the number of bootstrap replicates (this matches the original boot package).

The non-resampled statistic is then available through the t0 element of the result.

@rempsyc
Copy link
Author

rempsyc commented Dec 12, 2022

Oh fantastic! I didn't know about this! Am I doing it right though? I'm getting the error below:

library(bootES)
#> Loading required package: boot
bootES(1:10, R=0)
#> Error in bootES(1:10, R = 0): R must be an integer of length 1 and greater than 0

Created on 2022-12-11 with reprex v2.0.2

@rempsyc
Copy link
Author

rempsyc commented Dec 12, 2022

I think it comes from lines 20-22 of the bootES function:

r.is.valid = (length(R) == 1) && is.numeric(R) && R > 0
  if (!r.is.valid) 
    stop("R must be an integer of length 1 and greater than 0")

@dgerlanc
Copy link
Owner

I am getting the same error, which we added to avoid users passing an R less than 0.

I tested boot with an R=0 and it appears to work:

library(boot)

boot(1:100, function(x, i) mean(x[i]), R=0)
#> 
#> ORDINARY NONPARAMETRIC BOOTSTRAP
#> 
#> 
#> Call:
#> boot(data = 1:100, statistic = function(x, i) mean(x[i]), R = 0)
#> 
#> 
#> Bootstrap Statistics :

Created on 2022-12-12 with reprex v2.0.2.9000

I believe I should be able to update this check so that it has R >= 0. Will that work for you?

@rempsyc
Copy link
Author

rempsyc commented Dec 13, 2022

Yes, that would work well, thank you so much!

By the way, is there any argument to handle missing values, like na.rm = TRUE? That's another thing I am currently struggling with in rempsyc. I currently use na.omit on the data but for large data sets that excludes almost all of the data.

library(bootES)
#> Loading required package: boot
bootES(c(1:10, NA), R=1000)
#> Error in bca.ci(boot.out, conf, index[1L], L = L, t = t.o, t0 = t0.o, : estimated adjustment 'w' is infinite

Created on 2022-12-12 with reprex v2.0.2

@dgerlanc
Copy link
Owner

Currently, there is not any argument to handle missing data.

I'm not sure this can be handled automatically in bootES unless we defined bootstrapping methods that incorporated specific NA handling with associated assumptions and statistical adjustments. Otherwise, I imagine that excluding NAs in the statistic calculation functions would introduce bias.

I think the safest approach (and what we're doing now), at least without us thinking of a principled way to do this in bootES, is to fail when there is missing data, which forces the user to deal with it upstream.

@rempsyc
Copy link
Author

rempsyc commented Dec 13, 2022

I see. The new approach to handle this I am using in my convenience nice_contrasts function is:

es.lists <- lapply(groups.contrasts, function(y) {
    lapply(response.list, function(x) {
      bootES::bootES(
        data = stats::na.omit(data[, c(group, x)]),
        R = bootstraps,
        data.col = x,
        group.col = group,
        contrast = y,
        effect.type = effect.type,
        ...
      )
    })
  })

In this way, instead of feeding the whole data set (and excluding almost all variables with lots of missing data), it feeds a new data set with only the two variables: the group (for contrasts) and dependent variable of the current loop, thus minimizing damage/maximizing data for each DV.

@dgerlanc
Copy link
Owner

dgerlanc commented Dec 13, 2022

In this way, instead of feeding the whole data set (and excluding almost all variables with lots of missing data), it feeds a new data set with only the two variables: the group (for contrasts) and dependent variable of the current loop, thus minimizing damage.

Seems like the correct approach!

======

A thought - take the subset before calling the inner lapply to avoid running na.omit multiple times:

es.lists <- lapply(groups.contrasts, function(y) {
    data_column_subset = stats::na.omit(data[, c(group, x)]) 
    lapply(response.list, function(x) {
      bootES::bootES(
        data = data_column_subset,
        R = bootstraps,
        data.col = x,
        group.col = group,
        contrast = y,
        effect.type = effect.type,
        ...
      )
    })
  })

@rempsyc
Copy link
Author

rempsyc commented Dec 13, 2022

Thanks for the suggestion! I believe that would not work however, because x in this case refers to the lapply object response.list, which is a list of dependent variables (so I do need to run na.omit as many times as there are DVs). That allows me to do something like this:

library(rempsyc)
x <- nice_contrasts(data = mtcars, response = c("mpg", "disp"), group = "cyl")
x
#>   Dependent Variable Comparison df          t            p         d   CI_lower
#> 1                mpg      4 - 8 29   8.904534 8.568209e-10  3.587739  2.7021636
#> 2                mpg      6 - 8 29   3.111825 4.152209e-03  1.440495  0.8006059
#> 3                mpg      4 - 6 29   4.441099 1.194696e-04  2.147244  1.3594089
#> 4               disp      4 - 8 29 -11.920787 1.064054e-12 -4.803022 -5.7803822
#> 5               disp      6 - 8 29  -7.104461 8.117219e-08 -3.288726 -4.3449495
#> 6               disp      4 - 6 29  -3.131986 3.945539e-03 -1.514296 -2.2711031
#>     CI_upper
#> 1  4.5238179
#> 2  1.9758494
#> 3  3.0538567
#> 4 -3.8404060
#> 5 -2.3207227
#> 6 -0.8926851

nice_table(x)

Created on 2022-12-13 with reprex v2.0.2

@dgerlanc
Copy link
Owner

Good point, would likely not help out in that case. The output looks great!

@rempsyc
Copy link
Author

rempsyc commented Mar 14, 2023

I believe I should be able to update this check so that it has R >= 0. Will that work for you?

I can submit a PR for this, would it be helpful? :)

@dgerlanc
Copy link
Owner

I'll get the PR in shortly!

dgerlanc added a commit that referenced this issue Apr 9, 2023
dgerlanc added a commit that referenced this issue Apr 9, 2023
@rempsyc
Copy link
Author

rempsyc commented Apr 11, 2023

library(bootES)
#> Loading required package: boot
packageVersion("bootES")
#> [1] '1.3.0'
bootES(1:10, R=0)
#> [1] "All values of t are equal to  NaN \n Cannot calculate confidence intervals"
#> Error in matrix(c(t0, x[["bounds"]], bias, std.error), nrow = 1, dimnames = list(NULL, : length of 'dimnames' [2] not equal to array extent

Created on 2023-04-11 with reprex v2.0.2

We should probably add a test for that

@dgerlanc
Copy link
Owner

We should add a test for this!

It is a bit tricky because you will get the same error if you use an R=1.

Also, the error is thrown when the object is printed because that is when the confidence intervals are calculated:

"res = bootES(1:10, R=0)"
#> [1] "All values of t are equal to  NaN \n Cannot calculate confidence intervals"

Created on 2023-04-11 with reprex v2.0.2

I will investigate where exactly CIs are being calculated outside of the call to print.bootES. I know part of the calculation happens on printing. Generally, we want to throw an warning/error if CIs can't be calculated as the goal of the package is to calculated CIs.

I'm ok with not calculating the CIs in the R=0 case as the intent of the user, in this case, is most likely to not calculate CIs, or they would not have passed R=0.

@rempsyc
Copy link
Author

rempsyc commented Apr 11, 2023

My goal is to allow my users to calculate and report confidence intervals, even when they do not wish to use bootstrapping (re: the table screenshot example I presented a few posts above). So I think that would be important actually 😬

If the calculation for the CI differs on bootstrapped values, one approach could be to detect how many simulations were used, and if < 2, use a classical approach to calculating the CI on Cohen's d.

@dgerlanc dgerlanc reopened this Apr 14, 2023
@dgerlanc
Copy link
Owner

So how would you like this to work?

We don't really want to encourage folks to use classical approaches to CIs. If the simulations is < 2 I'd argue to just run more simulations!

@rempsyc
Copy link
Author

rempsyc commented Apr 14, 2023

So how would you like this to work?

The way I see it, quite simply, if R = 0, then compute classic CI, else, the bootstrapped CI. For R = 1, bootstrapped CI won't work you say? You could add an intentional error there, if you wanted.

We don't really want to encourage folks to use classical approaches to CIs. If the simulations is < 2 I'd argue to just run more simulations!

Well, if you don't want to do it by principle, I respect that. On my end, I have to provide what users request. Let me tell you a story. My statistical mentor who first introduced me to bootstrapped CIs recently told me that he's stopped using them because when he conduced his meta-analysis, it was a nightmare to find back the original p values and so on from the bootstrapped confidence intervals since they were not matching. So yeah, some want non-bootstrapped CIs.

But as package maintainers, I think we should offer choice, even though we can also provide a note in the documentation that this is not recommended and provide a reference.

Thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants