Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bootdht could stand a performance boost #44

Closed
erex opened this issue Jan 31, 2020 · 36 comments
Closed

Bootdht could stand a performance boost #44

erex opened this issue Jan 31, 2020 · 36 comments
Assignees
Labels
enhancement mrds problems likely with mrds rather than Distance
Milestone

Comments

@erex
Copy link
Member

erex commented Jan 31, 2020

I know Distance now lives with the CRAN gods. But next release could certainly use some help from parallel processing.

Running 99 replicates of the pretty average savannah sparrow data set (150 detections) takes something like 10-15 minutes. Doing a real bootstrap on this data set would take 2 hours.

The multi-model (hn, hr, Fourier) savannah sparrow bootstrap takes effectively the same amount of time.

@dill dill added the mrds problems likely with mrds rather than Distance label May 13, 2020
@dill dill added this to the 1.0.2 release milestone Jul 13, 2020
@dill
Copy link
Contributor

dill commented Jul 31, 2020

Per @lenthomas suggestion in #75 the bootstrap could be simply parallelised to give a speed-up.

@dill dill modified the milestones: 1.0.3 release, 1.0.4 release Sep 7, 2020
@dill dill self-assigned this Jan 12, 2021
@dill dill closed this as completed in 9450d68 Jun 8, 2021
@dill
Copy link
Contributor

dill commented Jun 8, 2021

This was a bit tricky to implement while still including the progress bar. It doesn't seem possible to do this with the R "recommended" package parallel at the moment (meaning no additional dependencies). The best solution I could find was using foreach.

As of 9450d68 you can now specify cores= and if cores >1 it'll use the doMC backend to run your bootstrap over multiple cores. In future we could do something fancier here and allow user-specified backends (e.g., using snow) but I thought this would cover many use cases (e.g., people with laptops with multiple cores).

A quick test on my laptop:

library(Distance)
data("Savannah_sparrow_1981")
ss81 <- Savannah_sparrow_1981
cf <- convert_units("meter", NULL, "hectare")
it <- ds(ss81, key="hr", formula = ~Region.Label, convert.units = cf)
what <- dht2(it, flatfile = ss81, convert_units = cf,
             strat_formula=~Region.Label,
             stratification = "geographical")
print(what, report = "density")


# bootdht, no parallel
system.time(boo <- bootdht(it, flatfile = ss81))

# with parallel?
system.time(boo <- bootdht(it, flatfile = ss81, cores=3))

Timings, without parallelization:

   user  system elapsed 
400.198   3.572 404.495 

vs. with the parallelization:

   user  system elapsed 
429.654   4.747 150.935 

Please let me know of any successes/failures with this, including with installation/intial setup.

@dill
Copy link
Contributor

dill commented Jun 9, 2021

Re-opening due to issues listed at 9450d68#commitcomment-51931831

@dill dill reopened this Jun 9, 2021
@dill
Copy link
Contributor

dill commented Jun 9, 2021

@erex @lenthomas 2f8334d moves to the doParallel backend, hopefully this works on non-unix systems now.

@erex
Copy link
Member Author

erex commented Jun 9, 2021

No luck I'm afraid.

Getting out our old friends the minkes

library(Distance)
data("minke")
easy <- ds(data=minke, key="hr", truncation=1.5)
easyboot <- bootdht(model=easy, flatfile=minke, nboot=10, cores = 3)
Performing 10 bootstraps
  |                                                                                                |   0%
Error in { : task 1 failed - "object 'models' not found"

I can't make much sense of the interactive debugger so I'm not sure where it falls over, here are the last few lines from the console with debug(bootdht)

Browse[2]> n
debug: cl <- parallel::makeCluster(cores)
Browse[2]> n
debug: doParallel::registerDoParallel()
Browse[2]> n
debug: `%dopar2%` <- foreach::`%dopar%`
Browse[2]> n
debug: boot_ests <- foreach::foreach(i = 1:nboot, .combine = rbind.data.frame) %dopar2% 
    {
        r <- bootit(dat, our_resamples = our_resamples, summary_fun = summary_fun, 
            convert.units = convert.units, pb = list(increment = function(pb) {
                invisible()
            }))
        pb$increment(pb$pb)
        r
    }
Browse[2]> n
Error in { : task 1 failed - "object 'models' not found"

I'm leaving this alone for the rest of the day.

dill pushed a commit that referenced this issue Jun 9, 2021
@dill
Copy link
Contributor

dill commented Jun 9, 2021

I've tried to fix this, which required a bit more faff but see how that goes.

@erex
Copy link
Member Author

erex commented Jun 10, 2021

Success with 10 bootstraps of the minke data set 🎉. A more serious test comes with bootstrapping the Howe camera trap analysis from the case study web page. That analysis produces no results, but does not report an error.

I've stripped out the debris from the Rmarkdown file leaving only the necessaries below

## ---- readin, message=FALSE----------------------------------------------------------------
library(Distance)
data("DuikerCameraTraps")
## ----fit-----------------------------------------------------------------------------------
conversion <- convert_units("meter", NULL, "square kilometer")
trunc.list <- list(left=2, right=15)
mybreaks <- c(seq(2,8,1), 10, 12, 15)
hr0 <- ds(DuikerCameraTraps, transect = "point", key="hr", adjustment = NULL,
          cutpoints = mybreaks, truncation = trunc.list)

## ---- sampfrac-----------------------------------------------------------------------------
viewangle <- 42 # degrees
samfrac <- viewangle / 360
conversion <- convert_units("meter", NULL, "square kilometer")
peak.hr.dens <- dht2(hr0, flatfile=DuikerCameraTraps, strat_formula = ~1,
                     sample_fraction = samfrac, er_est = "P2", convert_units = conversion)
print(peak.hr.dens, report="density")

## ---- bootstrap, results='hide', eval=TRUE-------------------------------------------------
mysummary <- function(ests, fit){
  return(data.frame(Dhat = ests$individuals$D$Estimate))
}
duiker.boot.hr <- bootdht(model=hr0, flatfile=DuikerCameraTraps, resample_transects = TRUE,
                       nboot=99, summary_fun=mysummary, sample_fraction = samfrac,
                       convert.units = conversion, cores = 2)

## ----bootresult, eval=TRUE-----------------------------------------------------------------
print(summary(duiker.boot.hr))

The object created by bootdht fills with NAs and the execution time of the bootstrap is nearly instantaneous. Structure of bootdht result is odd:

> str(duiker.boot.hr)
List of 1
 $ c.NA..NA..NA..NA..NA..NA..NA..NA..NA..NA..NA..NA..NA..NA..NA..: logi [1:99] NA NA NA NA NA NA ...
 - attr(*, "class")= chr "dht_bootstrap"
 - attr(*, "row.names")= int [1:99] 1 2 3 4 5 6 7 8 9 10 ...
 - attr(*, "nboot")= num 99
 - attr(*, "nbootfail")= num 0

Afraid I can't provide any more information.

@dill
Copy link
Contributor

dill commented Jun 10, 2021

Thanks for testing Eric.

This is frustrating as again this works fine on my machine. Will test out on other platforms and see if I can see if there's something else going on here. That seems weird though if that's the case and minke works fine for you.

@erex
Copy link
Member Author

erex commented Jun 10, 2021

Yep, frustrating is right. The duiker code is identical to the code that ran back in July 2020 when the case study was compiled.

I've rerun minkes with 99 reps. It still runs to conclusion, however the summary report:

> summary(easyboot)
Bootstrap results

Boostraps          : 99 
Successes          : 99 
Failures           : 0 

       median     mean       se     lcl      ucl   cv
Nhat 11448.12 13935.86 19646.03 2325.94 46261.87 1.72

0 failures however, when looking at the replicate estimates,

> sum(is.na(easyboot$Nhat))
[1] 1
> length(easyboot$Nhat)
[1] 295

which is a bit odd, because 99 replicates X (2 strata + 1 total) = 297

So all doesn't seem to be completely fine with the minke's either.

@dill
Copy link
Contributor

dill commented Jun 14, 2021

Okay, trying this on bluewhale I get the same error, so there's still a platform-dependent issue here somewhere...

@dill
Copy link
Contributor

dill commented Jun 14, 2021

Some further debugging reveals that this is down to the way the global environment is handled in Windows for doParallel. I found the error was:

[1] "Error in get_truncation(truncation, cutpoints, data) : \n  object 'trunc.list' not found\n"
attr(,"class")
[1] "try-error"
attr(,"condition")
<simpleError in get_truncation(truncation, cutpoints, data): object 'trunc.list' not found>

Looking into work-arounds...

@dill
Copy link
Contributor

dill commented Jun 14, 2021

can you try 6fa2826 @erex, I think that fixes your issue (works on bluewhale)

@erex
Copy link
Member Author

erex commented Jun 14, 2021

sorry, not able to try your fix to bootdht as I'm travelling with a Chromebook not a Windows machine. Won't be sitting in front of my home computer until late in the day 18June

@lenthomas
Copy link
Member

Did a quick test using the Duiker code @erex posted above, using 3 cores and doing 9 bootstraps (on my 4-core machine). Results look sensible. I did notice that the progress bar didn't update until it had finished. Tried to find our more about this but it seems not straightforward to remedy, if it is indeed an issue. I found
https://gist.github.com/kvasilopoulos/d49499ea854541924a8a4cc43a77fed0
and
http://5.9.10.113/66604588/showing-progress-bar-with-doparallel-foreach
in case either are helpful.

@dill
Copy link
Contributor

dill commented Jun 15, 2021 via email

@dill
Copy link
Contributor

dill commented Jun 15, 2021

@lenthomas I've just pushed a change that might fix this problem on Windows (has null effect on my Mac). Let me know how you go with it.

@lenthomas
Copy link
Member

No luck: didn't update until it had finished, and this time I got a warning message also (see below). I suppose it migth be more efficient for you to test fixes on bluewhale than to wait for me/Eric?

> duiker.boot.hr <- bootdht(model=hr0, flatfile=DuikerCameraTraps, resample_transects = TRUE,
+                           nboot=9, summary_fun=mysummary, sample_fraction = samfrac,
+                           convert.units = conversion, cores = 3)
Performing 9 bootstraps
  |===========================================================| 100%
Warning message:
In e$fun(obj, substitute(ex), parent.frame(), e$data) :
  already exporting variable(s): pb

dill pushed a commit that referenced this issue Jun 15, 2021
@dill
Copy link
Contributor

dill commented Jun 15, 2021

Okay so switching the backend over to doSNOW seems to solve this issue (working on my laptop and bluewhale with both progress bar frameworks for me at least). Closing now but please re-open if you have issues.

@dill dill closed this as completed Jun 15, 2021
@lenthomas lenthomas reopened this Jun 16, 2021
@lenthomas
Copy link
Member

The first time I ran it, I got

Error in loadNamespace(name) : there is no package called ‘snow’

I guess you need to list the package in the dependencies? Ditto for doSNOW!

On the plus side, I'm happy to report that after I installed those packages, it ran and gave a nice progress bar that updated as it went along. Yeah!

On the minus, side, after I ran again with 20 reps and 10 cores, and checked the Windows process monitor, I found a whole load of left-over R sessions (see below). I guess you need to clear up the cluster after running - with stopCluster() or some such (I have not used snow so very likely have the suggested solution incorrect).

Screenshot 2021-06-16 190431

dill pushed a commit that referenced this issue Jun 16, 2021
dill pushed a commit that referenced this issue Jun 16, 2021
@dill
Copy link
Contributor

dill commented Jun 16, 2021

Thanks @lenthomas

On dependencies, I had been included these as "suggested" packages and then check with a requireNamespace() in bootdht when cores>1 to avoid installing too many packages on installation. I hadn't added snow but have corrected that.

Thanks for the tip on stopCluster(), this seems to be handled differently on Mac. I'll try that out on bluewhale after dinner.

@lenthomas
Copy link
Member

Great, thanks @dill. One other thing that occurs to me is whether it is possible to set a RNG seed in a consistent and reprodcible way in the new parallel code, for the sake of reproducible research, etc. There is a doRNG package for this in the context of foreach; I expect there are other ways to do it also, probably. If not time to do now, perhaps raise as an issue for the next release?

@dill
Copy link
Contributor

dill commented Jun 16, 2021

Noted and a good idea. I'll add that as a separate issue and see if I have time to work this out when I get back.

Running on bluewhale I saw the nodes shutdown so I think that's working now. I'll close again for now.

@lenthomas
Copy link
Member

Yup, @dill, confirm it works as planned on my desktop machine also. Thanks!

@lenthomas
Copy link
Member

Tried some code

remotes::install_github("https://github.com/DistanceDevelopment/Distance")

library(Distance)
data("minke")
easy <- ds(data=minke, key="hr", truncation=1.5)
easyboot <- bootdht(model=easy, flatfile=minke, nboot=99, cores = 3)
summary(easyboot)
str(easyboot)

First time I ran it I got the following message
mespace(x) : there is no package called ‘snow’
and second time, after installing snow I got
mespace(x) : there is no package called ‘doSNOW’
So I guess these need added to the list of requirements?

@lenthomas lenthomas reopened this Jun 25, 2021
@dill
Copy link
Contributor

dill commented Jun 28, 2021

See comment above but I'd added these packages as suggested since they will not be needed by every user and bootdht will run with cores=1 without them.

FWIW, if you do:

remotes::install_github("https://github.com/DistanceDevelopment/Distance", dependencies=TRUE)

then all of "Depends", "Imports", "LinkingTo" and "Suggests" will be installed for you.

@erex
Copy link
Member Author

erex commented Jun 28, 2021

Continuing the minke saga... Seems that a replicate failure causes problems. See minke example:

library(Distance)
data("minke")
easy <- ds(data=minke, key="hr", truncation=1.5)
easyboot <- bootdht(model=easy, flatfile=minke, nboot=99, cores = 3)

nindex <- seq(1, length(easyboot$Nhat), by=3)
sindex <- seq(2, length(easyboot$Nhat), by=3)
totindex <- seq(3, length(easyboot$Nhat), by=3)

> summary(easyboot)
Bootstrap results

Boostraps          : 99 
Successes          : 98 
Failures           : 1 

       median     mean       se     lcl      ucl   cv
Nhat 11761.19 13000.87 15163.89 2224.63 33617.11 1.29    # of course this summary is not useful as strata are not recognised

> str(easyboot)
List of 1
 $ Nhat: num [1:292] 10663 2556 13220 11761 5989 ...
 
 ** 292 elements seems odd; 99*3=297, 98*3=294, why are there 292 elements?
 > summary(easyboot$Nhat[nindex])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   9263   13138   15739   18780   18823  198211       1 
> summary(easyboot$Nhat[sindex])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1724    7484   11464   12575   14194  123824 
> summary(easyboot$Nhat[totindex])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1226    3206    4048    7648    7415   74386 

Seems there are the wrong number of elements in the resulting Nhat vector. Consequently, trying to "unshuffle" the stratum-specific estimates might go wrong.


Here is a second attempt with 450 replicates:

bob3 <- function(ests, fit) {
  return(list(#data.frame(params=fit$par),
              data.frame(north=ests$individuals$N$Estimate[1]),
              data.frame(south=ests$individuals$N$Estimate[2]),
              data.frame(total=ests$individuals$N$Estimate[3])))
}

easyboot <- bootdht(model=easy, flatfile=minke, nboot=450, cores = 3, 
                    summary_fun = bob3, progress="progress")

tmp <- sapply(easyboot, FUN=sum, simplify=TRUE)
north <- tmp[1:450]
south <- tmp[451:900]
total <- tmp[901:1350]
summary(north)
summary(south)
summary(total)
> summary(north)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   2770   10274   12601   18902   15640  796622       4 
> summary(south)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
   617.4   3085.8   3839.7   7783.2   4862.5 611410.3        4 
> summary(total)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   3991   14057   16525   26690   19915 1408032       7 

I can't understand why the odd number of NAs, the model is a pooled detection function. If there were 4 missing for North, there should be the same number missing for South and Total, I would have expected.

@dill
Copy link
Contributor

dill commented Jun 29, 2021

Is this an issue with the parallelization or the bootstrapping code in general? It sounds like this is a problem when doing geo stratification and some bootstrap replicates have no copies of some strata? I'll look into this further tonmorrow, but that is an eventuality I had not considered. Two options if that's the case: (1) bootdht needs to know about strata or (2) more sophisticated error handling in the summary function.

dill pushed a commit that referenced this issue Jun 30, 2021
@dill
Copy link
Contributor

dill commented Jun 30, 2021

Combination of mrds 2.2.4.9005 and 549cbba should work-around this issue. Running your examples now to test but that might take a while.

@dill
Copy link
Contributor

dill commented Jun 30, 2021

Unfortunately CRAN will reject Distance as-is because doSNOW and snow are "superceeded". Looking into this now...

@dill
Copy link
Contributor

dill commented Jun 30, 2021

Okay, so snow/doSNOW are superceeded by parallel/doParallel (see here), which is what I had originally used sigh.

@dill
Copy link
Contributor

dill commented Jun 30, 2021

Switching to parallel/doParallel done, but this means that that progress bar will not work very well. Restricting to the base progress bar and using the hack found here I can get something working. Unfortunately this leads to other additional output, e.g.,

> boo <- bootdht(it, flatfile = ss81, cores=3, nboot=10, summary_fun=fn)
Performing 10 bootstraps
  |                                                                      |   0%s
tarting worker pid=56057 on localhost:11375 at 17:06:25.410
starting worker pid=56059 on localhost:11375 at 17:06:25.409
starting worker pid=56058 on localhost:11375 at 17:06:25.409
Loading required package: Distance
Loading required package: mrds
This is mrds 2.2.4.9004
Built: R 4.1.0; ; 2021-06-30 12:57:25 UTC; unix

Attaching package: 'Distance'

The following object is masked from 'package:mrds':

    create.bins

loaded Distance and set parent environment
Loading required package: Distance
Loading required package: mrds
This is mrds 2.2.4.9004
Built: R 4.1.0; ; 2021-06-30 12:57:25 UTC; unix

Attaching package: 'Distance'

The following object is masked from 'package:mrds':

    create.bins

loaded Distance and set parent environment
Loading required package: Distance
Loading required package: mrds
This is mrds 2.2.4.9004
Built: R 4.1.0; ; 2021-06-30 12:57:25 UTC; unix

Attaching package: 'Distance'

The following object is masked from 'package:mrds':

    create.bins

loaded Distance and set parent environment
  |======================================================================| 100%
> 

The ticker will also bounce back and forward a bit because it updates from the progress bar object that was passed, so possibly other jobs will have completed and updated before a given one is able to make its update. Very annoying.

@dill
Copy link
Contributor

dill commented Jun 30, 2021

Of course, this doesn't work in Windows.

@dill
Copy link
Contributor

dill commented Jun 30, 2021

Proposal: submit Distance 1.0.3 to CRAN with parallel support but without progress bars (progress bars still work in non-parallel settings). Push progress bars for parallel bootstrapping to 1.0.4.

@lenthomas @erex @LHMarshall: let me know if you have other suggestions. I need to submit mrds first, so I will not submit Distance immediately.

@erex
Copy link
Member Author

erex commented Jun 30, 2021

not that fussed about progress bars, TBH. Usually running bootstrap for reports (i.e. knitting documents) so progress bars are uninformative.

dill pushed a commit that referenced this issue Jun 30, 2021
@dill
Copy link
Contributor

dill commented Jul 1, 2021

thanks @erex, in which case I will close here and reopen a new issue for the progress bar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement mrds problems likely with mrds rather than Distance
Projects
None yet
Development

No branches or pull requests

3 participants