Reduce memory usage of facets #162

natir · 2021-06-21T15:56:21Z

Hi,

In my laboratory we use facets on many whole genome human dataset, on this data facets have a huge memory usage, approximately 150 Gib.

The purpose of this PR is to try to reduce facets memory usage, for this I replace some classic R data.frame by tidyverse tibble data-structure, I also use tydiverse pipe syntaxe to perform some operation on this tibble.

With all this change, I divide memory usage by 2.

On my test dataset result is same between my PR and version v0.6.1, but maybe I miss some stuff.

I'm not a good R developer, maybe I include some stupid mistake, so if you want just take the idea of my change and rewrite it please do it.

Thank

veseshan · 2021-06-21T21:18:37Z

Can you give me some breakdown of where this memory explosion occurs. My back of the envelope calculation says

R:> x = rnorm(12e6) # one locus every 250 bases across 3000 Megabase
R:> format(object.size(x), units="Mb")
[1] "91.6 Mb"

The jointseg data frame has 16 columns but even that wouldn't translate to 150Gib memory use.

Have you tried using the readSnpMatrixDT.R in path/facets/extRfns/ to read in the data?

Thanks

natir · 2021-06-22T11:51:07Z

With v0.6.1 the memory peak is during file reading, use readSnpMatrixDT.R like my change solve this issue.

But another peak occur during preProcSample I assume, it's more specifically in procSnps (some duplication, column creation, calling of Fortran code and filtration not run in place).

With v0.6.1 and readSnpMatrixDT.R memory usage is 85Gib, my version use 70Gib.

veseshan · 2021-06-22T17:13:35Z

Can you tell me how big is the pileup matrix i.e. how many loci? And how many end up in jointseg? Thanks.

natir · 2021-06-23T09:32:19Z

The pileup matrix contains 546,700,164 loci.

To evaluate number of jointseg I consider $jointseg in output produce by procSample, I get 5,583,831 jointseg.

veseshan · 2021-06-23T17:22:56Z

Given that the whole genome is around 3 Gigabase, the pileup seems to have a locus every 6 bases. That is a lot of redundant data as they will be highly serially correlated. You can DM me if you want to talk about this further.

I will look into how your code can be used to reduce the memory use of procSnps.

Thanks

natir added 2 commits June 18, 2021 11:58

Replace read.csv by readr::read_csv to save many memory

39e527d

Use tidyverse pipe system to work on some dataframe

c14d854

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce memory usage of facets #162

Reduce memory usage of facets #162

natir commented Jun 21, 2021

veseshan commented Jun 21, 2021

natir commented Jun 22, 2021

veseshan commented Jun 22, 2021

natir commented Jun 23, 2021

veseshan commented Jun 23, 2021

Reduce memory usage of facets #162

Are you sure you want to change the base?

Reduce memory usage of facets #162

Conversation

natir commented Jun 21, 2021

veseshan commented Jun 21, 2021

natir commented Jun 22, 2021

veseshan commented Jun 22, 2021

natir commented Jun 23, 2021

veseshan commented Jun 23, 2021