Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The nCore option does not speed up enrichment compute #1

Open
benoitballester opened this issue May 3, 2019 · 12 comments
Open

The nCore option does not speed up enrichment compute #1

benoitballester opened this issue May 3, 2019 · 12 comments
Labels
bug Something isn't working invalid This doesn't seem right

Comments

@benoitballester
Copy link
Member

We noticed that using the -nCore option (2 or more) in the enrichment, would take longer to compute that using 1 core. The more core used, the longer it runs (strange).

There is a rscript to test this in :
misc/example1.R

library(ReMapEnrich)
demo.dir <- "~/ReMapEnrich_demo" 
remapCatalog2018hg38 <- downloadRemapCatalog(demo.dir) 
remapCatalog <- bedToGranges(remapCatalog2018hg38)
ENCFF001VCU <- bedToGranges(downloadEncodePeaks("ENCFF001VCU", demo.dir))

Enrichment shuffle 6

time1 = Sys.time()
enrichment <- enrichment(ENCFF001VCU, remapCatalog, shuffle=6)
time2 = Sys.time()
difftime(time2, time1, units="auto")
# Time difference of 36.88208 secs

Enrichment shuffle 6 nCores 3

time1 = Sys.time()
enrichment <- enrichment(ENCFF001VCU, remapCatalog, shuffle=6, nCores=3)
time2 = Sys.time()
difftime(time2, time1, units="auto")
# Time difference of 55.98755 secs

Enrichment shuffle 6 nCores 6

time1 = Sys.time()
enrichment <- enrichment(ENCFF001VCU, remapCatalog, shuffle=6, nCores=6)
time2 = Sys.time()
difftime(time2, time1, units="auto")
# Time difference of 1.082652 mins
@benoitballester benoitballester changed the title The nCore option doe not speed up enrichment compute The nCore option does not speed up enrichment compute May 3, 2019
@benoitballester benoitballester added bug Something isn't working invalid This doesn't seem right labels May 3, 2019
@MartinMestdagh
Copy link
Collaborator

I will try another methods for spot parralelization

@ZacharieMenetrier
Copy link
Contributor

It seems to be related to overhead and RAM usage.
Try to observe your CPUs and RAM usage when doing enrichments.
You will see that the RAM usage sky rocket very early.
The CPUs stop their multi-tasking very early too, leaving all the deserialization to 1 core.
Trying with more than 6 cores on a 16Gb RAM computer, will reach the swap and that's when things become even slower.

What's strange is that we didn't see this issue before.
I remember doing lots of benchmark to test the parallelization and evaluating its performance.
Any clues on what could have changed since last time ?

@jvanheld
Copy link
Contributor

jvanheld commented Jul 7, 2019 via email

@benoitballester
Copy link
Member Author

Here is the latest tests with another query set.

time1 = Sys.time()
enrichment <- enrichment(ENCFF784QFH, remapCatalog, shuffle=6)
time2 = Sys.time()
difftime(time2, time1, units="auto")
Time difference of 38.29763 secs

time1 = Sys.time()
enrichment <- enrichment(ENCFF784QFH, remapCatalog, shuffle=6, nCores=4)
time2 = Sys.time()
difftime(time2, time1, units="auto")
Time difference of 1.021091 mins

time1 = Sys.time()
enrichment <- enrichment(ENCFF784QFH, remapCatalog, shuffle=6, nCores=8)
time2 = Sys.time()
difftime(time2, time1, units="auto")
Time difference of 1.54096 mins 

This is done on an iMac i7 32Go RAM

@ZacharieMenetrier
Copy link
Contributor

ZacharieMenetrier commented Jul 10, 2019

After more investigations I suspect the catalog that is passed around for each worker, the main reason for parallel computation to be slower.
If you test functions in the details you will see that the computation of shuffles is actually faster with more cores. It is the theoretical means that is slower.
The thing is that to compute theoretical means we need the catalog to do the overlaps. Still, serializing it takes a lot of time.

So Jacques I think you were right about the increase of ReMap being the reason behind slower parallel computations.

As for now I can't find a nice solution that will prevent such massive (but needed) data to be passed around.

Jacques could you please elaborate on your idea for redesigning the code ? I'm not sure to get what you mean.

Edit: After trying with the 2015 catalog it seems the same issue happens again. I think the reason of the catalog (still a huge variable ~= 200Mb) being passed around is still valid. Maybe benchmarking was not done seriously enough at that time (my bad)

@jvanheld
Copy link
Contributor

jvanheld commented Jul 10, 2019 via email

@ZacharieMenetrier
Copy link
Contributor

I think I need an update to remember the analysis in details. Maybe we could plan a video call soon ?

Retrieving the intersections is what takes the most computational time, let's call this time T.

We do intersections between the query and the catalog (1T).
We then create n shuffled versions of the query.
We do intersections between the shuffles and the catalog (nT).

I think now I understand what Jacques is saying about computing each peakset separately, so it would mean to do the following.

We create n shuffled versions of the query.
For each category:

  • We do intersections between the query and the reduced version of the catalog.
  • We do intersections between the shuffles and the reduced version of the catalog.

We then merge the results.

In my opinion this may improve performance, for a sequential version of the code, as the intersections would be faster, but would not necessarily allow us to do parallel computing faster.

Parallel computing works well when there is a small number of big tasks, and when those big tasks have a small memory footprint of input/output. The reason is that each worker need to copy the data in its own thread to be able to work with. This has nothing to do with peaks being loaded as bed files or RData but rather that variables (pure RAM data) must be passed to each worker to become a variable of the thread. That's why RAM usage increase so much. It is because the whole catalog is copied for each worker in the RAM.

For now we have a small number of big tasks (e.g. doing intersections for 6 shuffles) but the input is massive because it needs the whole catalog.

If we try to separate each categories of the catalog, we will end up with a more lightweight input but with a lot more small tasks, so still not ideal for parallel computing.

What could still be possible is to chunk the categories (one chunk for each core), do parallel computing for each chunk, and then merge the results. However, chunking the categories would still mean to pass around some part of the catalog (let's say for 6 cores we would need to copy a 6th of the catalog so a 33Mb variable).

chunking

Another solution would be to use fast forking and shared memory with the mclapply function. This would allow workers to access a shared memory without having to copy the inputs. This solution would greatly improve performance in my opinion, but is only possible on Mac and Linux sadly.

shared memory

@benoitballester
Copy link
Member Author

benoitballester commented Jul 11, 2019 via email

@jvanheld
Copy link
Contributor

jvanheld commented Jul 11, 2019 via email

@benoitballester
Copy link
Member Author

benoitballester commented Jul 11, 2019 via email

@ZacharieMenetrier
Copy link
Contributor

For what I understand we could not split the catalog by chromosomes (in fact there is an option for the shuffles to be done by chromosome or in the whole genome) but by factors (we call them categories in the code, e.g. TAL1, FOXP1, etc.) this would result in a 485 split. That's why I talked about a lot of smaller tasks.

Are we talking about the same thing for now ?

@benoitballester
Copy link
Member Author

benoitballester commented Jul 11, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working invalid This doesn't seem right
Projects
None yet
Development

No branches or pull requests

4 participants