mrgsim in nested futures/parallel settings #1178

MahmAbdelwahab · 2024-03-14T11:47:28Z

Hello everyone,

I am setting up a big simulation workflow and I am making use of HPC cluster to submit the jobs. the workflow is a follow:

Compile the model code in the main R session and call loadso
Setup the plan settings, first level is slurm plan and then multi-session plan
Chunking the dataset generation and calling mrgsim on each chunk
Collect the results and postprocess

What I noticed is that with the above steps/workflow I get th following error : MultisessionFuture (doFuture2-1) failed to receive message results from cluster RichSOCKnode #1 (PID 14433 on localhost ‘localhost’). The reason reported was ‘error reading from connection’. Post-mortem diagnostic: No process exists with this PID, i.e. the localhost worker is no longer alive. The total size of the 4 globals exported is 396.99 KiB. The three largest globals are ‘modList’ (382.41 KiB of class ‘list’), ‘...future.x_ii’ (7.86 KiB of class ‘list’) and ‘makeEventDataset’ (6.18 KiB of class ‘function’) Calls: %dofuture% -> doFuture2

if I move the model code into the innermost foreach loop (Compile the model for each chunk) the workflow works fine
or when using future_mrgsim_d (setting nchunk to 1, but maybe that's not an issue ).

any idea for that behavior ?

best,

Mahmoud

The text was updated successfully, but these errors were encountered:

kylebaron · 2024-03-14T13:31:21Z

Hi @MahmAbdelwahab -

A couple of thoughts:

You might be already doing this, but just checking: set the "build" directory to something local

mod <- mread(..., soloc = "build-dir")

This makes sure to build the model locally rather than in the temporary directory specific to your (main) R session.

Rather than plan(multisession), can you try plan(callr) from the future.callr package? A bit of a long shot, but sometimes we have issues with multisession.
Could you try caching the compiled model and then read back for each chunk? I know this is a bit inelegant and probably what you are trying to avoid, but that was the motivation for implementing those features (cache, soloc etc).

So if you load and cache the model locally prior to starting the parallel job

mod <- mread_cache(..., soloc = "build-dir")

Then for each chunk it's the same, but when you do this on the chunk, it's a quick read ... no complie:

mod <- mread_cache(..., soloc = "build-dir")

Could you let me know what happens. If it's still not working, please email me at my github email address and we can meet on zoom to look at this.

Kyle

MahmAbdelwahab · 2024-03-15T20:08:02Z

Hello @kylebaron

Setting plan(callr) has solved the issue!, also I have tested loading the cached model in every chunk and no significant difference in computation time.

many thanks for your help!

Best,

Mahmoud

kylebaron · 2024-03-15T22:06:20Z

Thanks for reporting back, @MahmAbdelwahab and glad it got resolved.

Wondering if you'd be willing to share the relevant parts of your setup? I've done this a long time ago with future batchtools on sge but it got unstable on our system. It sounds like your setup is working well outside of the multisession issue.

Kyle

MahmAbdelwahab · 2024-03-22T13:56:51Z

Hello @kylebaron,

Here's the relevant parts of the setup, I will try to post a full example later if needed.

# setting up slurm plan
slurm <- future::tweak(future.batchtools::batchtools_slurm,
    template = system.file("templates/slurm-simple.tmpl", package = "batchtools"),
    workers = 2,
    resources = list(
        partition = "general",
        walltime = 60 * 5,
        ncpus = 4
    )
)


nsims <- 1E6 # number of simulated patients/profiles
# chunking the nsims
# function taken from https://cran.r-project.org/web/packages/bhmbasket/bhmbasket.pdf  (used internally)
# bhmbasket:::chunkVector
chunkVector <- function(x, n_chunks) {
    if (n_chunks <= 1) {
        chunk_list <- list(x)
    } else {
        chunk_list <- unname(split(x, cut(seq_along(x), n_chunks, labels = FALSE)))
    }

    return(chunk_list)
}
set.seed() # seed needs to be set outside the foreach call
plan(list(slurm, callr))
# plan(list(slurm, multisession)) # ran into some issues with loading model object in the worker node
registerDoFuture() 
chunk_outer <- chunkVector(seq_len(Ntasks), getDoParWorkers())
sim_results <-
    foreach(k = chunk_outer, .combine = c) %dorng% { # uses slurm plan
        chunk_inner <- chunkVector(k, getDoParWorkers())
        foreach(j = chunk_inner, .combine = c) %dorng% { # uses multisession/callr plan
            lapply(j, function(x) {
                sim_chunk <- expand.ev(
                    ID = x,
                    dose =
                    amt = 
                    ii = ii,
                )
                mrgsim(mod, sim_chunk) %>% ..
            })
        }
    }

Additionally, you can wrap the whole foreach(s) code block with future({}) or future_promise({}) and run the code without blocking the main R session, I think it's possible then to send multiple independent nested foreach(s)/simulation setup, but haven't fully tested that yet.

Mahmoud

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mrgsim in nested futures/parallel settings #1178

mrgsim in nested futures/parallel settings #1178

MahmAbdelwahab commented Mar 14, 2024

kylebaron commented Mar 14, 2024 •

edited

MahmAbdelwahab commented Mar 15, 2024

kylebaron commented Mar 15, 2024

MahmAbdelwahab commented Mar 22, 2024

mrgsim in nested futures/parallel settings #1178

mrgsim in nested futures/parallel settings #1178

Comments

MahmAbdelwahab commented Mar 14, 2024

kylebaron commented Mar 14, 2024 • edited

MahmAbdelwahab commented Mar 15, 2024

kylebaron commented Mar 15, 2024

MahmAbdelwahab commented Mar 22, 2024

kylebaron commented Mar 14, 2024 •

edited