Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mrgsim in nested futures/parallel settings #1178

Open
MahmAbdelwahab opened this issue Mar 14, 2024 · 4 comments
Open

mrgsim in nested futures/parallel settings #1178

MahmAbdelwahab opened this issue Mar 14, 2024 · 4 comments

Comments

@MahmAbdelwahab
Copy link

Hello everyone,

I am setting up a big simulation workflow and I am making use of HPC cluster to submit the jobs. the workflow is a follow:

  • Compile the model code in the main R session and call loadso
  • Setup the plan settings, first level is slurm plan and then multi-session plan
  • Chunking the dataset generation and calling mrgsim on each chunk
  • Collect the results and postprocess

What I noticed is that with the above steps/workflow I get th following error : MultisessionFuture (doFuture2-1) failed to receive message results from cluster RichSOCKnode #1 (PID 14433 on localhost ‘localhost’). The reason reported was ‘error reading from connection’. Post-mortem diagnostic: No process exists with this PID, i.e. the localhost worker is no longer alive. The total size of the 4 globals exported is 396.99 KiB. The three largest globals are ‘modList’ (382.41 KiB of class ‘list’), ‘...future.x_ii’ (7.86 KiB of class ‘list’) and ‘makeEventDataset’ (6.18 KiB of class ‘function’) Calls: %dofuture% -> doFuture2

if I move the model code into the innermost foreach loop (Compile the model for each chunk) the workflow works fine
or when using future_mrgsim_d (setting nchunk to 1, but maybe that's not an issue ).

any idea for that behavior ?

best,

Mahmoud

@kylebaron
Copy link
Collaborator

kylebaron commented Mar 14, 2024

Hi @MahmAbdelwahab -

A couple of thoughts:

  1. You might be already doing this, but just checking: set the "build" directory to something local
mod <- mread(..., soloc = "build-dir")

This makes sure to build the model locally rather than in the temporary directory specific to your (main) R session.

  1. Rather than plan(multisession), can you try plan(callr) from the future.callr package? A bit of a long shot, but sometimes we have issues with multisession.

  2. Could you try caching the compiled model and then read back for each chunk? I know this is a bit inelegant and probably what you are trying to avoid, but that was the motivation for implementing those features (cache, soloc etc).

So if you load and cache the model locally prior to starting the parallel job

mod <- mread_cache(..., soloc = "build-dir")

Then for each chunk it's the same, but when you do this on the chunk, it's a quick read ... no complie:

mod <- mread_cache(..., soloc = "build-dir")

Could you let me know what happens. If it's still not working, please email me at my github email address and we can meet on zoom to look at this.

Kyle

@MahmAbdelwahab
Copy link
Author

Hello @kylebaron

Setting plan(callr) has solved the issue!, also I have tested loading the cached model in every chunk and no significant difference in computation time.

many thanks for your help!

Best,

Mahmoud

@kylebaron
Copy link
Collaborator

Thanks for reporting back, @MahmAbdelwahab and glad it got resolved.

Wondering if you'd be willing to share the relevant parts of your setup? I've done this a long time ago with future batchtools on sge but it got unstable on our system. It sounds like your setup is working well outside of the multisession issue.

Kyle

@MahmAbdelwahab
Copy link
Author

Hello @kylebaron,

Here's the relevant parts of the setup, I will try to post a full example later if needed.

# setting up slurm plan
slurm <- future::tweak(future.batchtools::batchtools_slurm,
    template = system.file("templates/slurm-simple.tmpl", package = "batchtools"),
    workers = 2,
    resources = list(
        partition = "general",
        walltime = 60 * 5,
        ncpus = 4
    )
)


nsims <- 1E6 # number of simulated patients/profiles
# chunking the nsims
# function taken from https://cran.r-project.org/web/packages/bhmbasket/bhmbasket.pdf  (used internally)
# bhmbasket:::chunkVector
chunkVector <- function(x, n_chunks) {
    if (n_chunks <= 1) {
        chunk_list <- list(x)
    } else {
        chunk_list <- unname(split(x, cut(seq_along(x), n_chunks, labels = FALSE)))
    }

    return(chunk_list)
}
set.seed() # seed needs to be set outside the foreach call
plan(list(slurm, callr))
# plan(list(slurm, multisession)) # ran into some issues with loading model object in the worker node
registerDoFuture() 
chunk_outer <- chunkVector(seq_len(Ntasks), getDoParWorkers())
sim_results <-
    foreach(k = chunk_outer, .combine = c) %dorng% { # uses slurm plan
        chunk_inner <- chunkVector(k, getDoParWorkers())
        foreach(j = chunk_inner, .combine = c) %dorng% { # uses multisession/callr plan
            lapply(j, function(x) {
                sim_chunk <- expand.ev(
                    ID = x,
                    dose =
                    amt = 
                    ii = ii,
                )
                mrgsim(mod, sim_chunk) %>% ..
            })
        }
    }

Additionally, you can wrap the whole foreach(s) code block with future({}) or future_promise({}) and run the code without blocking the main R session, I think it's possible then to send multiple independent nested foreach(s)/simulation setup, but haven't fully tested that yet.

Mahmoud

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants