Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parallel support AddModuleScore #6369

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

samuel-marsh
Copy link
Collaborator

@samuel-marsh samuel-marsh commented Aug 31, 2022

Hi Seurat Team,

Just a PR based on discussion in previous PR request #6348 to add support for AddModuleScore parallel processing. My solution uses future/future.apply packages so no additional dependencies.

Quick single test (can run more realistic benchmark with bench package but don't feel it's really necessary) adding 100 scores of 100 genes each to object with ~47,000 nuclei and ~28,000 features sequential vs parallel with 4 cores was 1.7 times faster.

library(tidyverse)
library(Seurat)
library(scCustomize)
library(qs)
library(tictoc)
library(future)
library(future.apply)

test <- qread("marsh.qs")

# Extract Gene Lists from All Objects
all_genes_marsh <- rownames(test@assays$RNA)

# Create 100 random gene lists of 100 genes
random_gene_sets_micro <- lapply(vector("list", 100), function(x){sample(all_genes_marsh, length(1:100))})

tic()
test <- AddModuleScore(object = test, features = random_gene_sets_micro)
toc()
429.236 sec elapsed

# restart R

library(tidyverse)
library(Seurat)
library(scCustomize)
library(qs)
library(tictoc)
library(future)
library(future.apply)

plan("multisession", workers = 4)
options(future.globals.maxSize = 3000 * 1024^2)

test <- qread("marsh.qs")

# Extract Gene Lists from All Objects
all_genes_marsh <- rownames(test@assays$RNA)

# Create 100 random gene lists of 100 genes
random_gene_sets_micro <- lapply(vector("list", 100), function(x){sample(all_genes_marsh, length(1:100))})

tic()
test <- AddModuleScore(object = test, features = random_gene_sets_micro)
toc()
251.93 sec elapsed

One thing I did debate and it's up to you is whether to add additional function parameter specifying parallel processing and make the internal function check something like this:

 if (nbrOfWorkers() > 1 && is.TRUE(parallel) 

The reason being that the gains with parallel processing with future for this function are most useful with large numbers of gene lists. However, if just adding single gene list or couple it's probably slightly faster to run normally. I left out in PR to keep everything the same but if this is something you think would be helpful I can easily add.

Thanks!
Sam

p.s. tagging author of original PR here so he can follow this @scottgigante

@scottgigante
Copy link
Contributor

Thanks @samuel-marsh for the quick work! I don't think parallel=TRUE is necessary because a user could always use plan(sequential).

@samuel-marsh
Copy link
Collaborator Author

agreed though some people set and forget at top of script. Overall I lean towards not adding extra param too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants