Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory consumption #14

Open
mossjacob opened this issue Dec 14, 2021 · 7 comments
Open

Memory consumption #14

mossjacob opened this issue Dec 14, 2021 · 7 comments

Comments

@mossjacob
Copy link

Hi there,

Thank you for releasing the code for your algorithm. I have a question regarding the memory-efficiency.

My cell-gene matrix is approximately 600,000 cells x 6,000 genes. Assuming 32-bit floats, this should be around 15gb.
Despite this, I am running out of memory despite having 128gb RAM. How can I run this algorithm memory-efficient?
The command i'm using so far is:

res <- newWave(sce,X = "~site.Site", K=10, verbose = FALSE, children=6, n_gene_disp = 100, n_gene_par = 100, n_cell_par = 100)

Many thanks

@fedeago
Copy link
Owner

fedeago commented Dec 15, 2021

Hi Jacob,

I suggest you to use HDF5 file for your cell-gene matrix.

Here : https://www.biorxiv.org/content/10.1101/2021.08.02.453487v1 in figure 1.D you can see how the the RAM consumption decreases using the same dataset(TENxBraindata) as dense matrix or DelayedArray based on HDF5. The time consumption will increase but I think it is a good trade-off.

You need a SingleCellExperiment with a counts assay that is a DelayedArray.

The usage of newWave is the same but here, in section "NewWave on DelayedArray" you can find an example : https://fedeago.github.io/SurfingNewWave/articles/vignette.html#newwave-on-delayedarray-1

Another tips, if for tyo it is good to have a dispersion parameter different for each gene I suggest you to use these parameters:
res <- newWave(sce,X = "~site.Site", K=10, verbose = FALSE, children=6, n_gene_par = 100, n_cell_par = 100, commondispersion=FALSE)
thus not to use the n_gene_disp parameter and set commondispersion=FALSE. I suggest you also to use bigger minibatch, in general I set them equal to 10% of cells or genes.

I hope that those information will be useful.

Best regards

Federico

@mossjacob
Copy link
Author

Hi Federico,

Many thanks for your suggestions. I will try these now.

best wishes

@eltonjrv
Copy link

eltonjrv commented Feb 2, 2023

Dear Federico,
Following up on this same issue:
I followed your recommendations on creating a DelayedArray class for my SCE assay (14k genes x 227k cells):

My SCE object

A1Bmtx_sce
class: SingleCellExperiment
dim: 13966 271111
metadata(0):
assays(1): counts
rownames(13966): LINC00115 NOC2L ... MT-ND6 MT-CYB
rowData names(0):
colnames(271111): SRR7666705 SRR7666706 ... TTTGTCAAGCTGAACG.1
TTTGTCACAATCGGTT.1
colData names(3): cells batch Biological_Condition

Transforming the "batch" field from colData to a factor

colData(A1Bmtx_sce)$batch <- as.factor(colData(A1Bmtx_sce)$batch)

Converting the assay (counts) to a DelayedArray

library(DelayedArray)
assay(A1Bmtx_sce) = DelayedArray(assay(A1Bmtx_sce))
class(assay(A1Bmtx_sce))
[1] "DelayedMatrix"
attr(,"package")
[1] "DelayedArray"

Running newWave

A1Bzinb <- newWave(A1Bmtx_sce, K=9, X="~batch", children=24, n_gene_par=1500, n_cell_par=30000, verbose=FALSE, commondispersion=FALSE)

I submitted this code as job script to a large-memory node (768 Gb) from my cluster, and it stopped after ~30 min with the SGE exit status #37: "failed 37 : qmaster enforced h_rt, h_cpu, or h_vmem limit", reaching a maxvmem of 487 Gb, while I've asked for 720Gb.

I wonder if you would have any tip to circumvent this issue.

Thanks in advance.

@fedeago
Copy link
Owner

fedeago commented Feb 6, 2023

Hi Elton,

thank you for your interest on NewWave functionalities.

I think that transform it in a DelayedArray object is not enough because it is still based on an in-memory object instead of an on-disk object.

You should save it as an hdf5 file and then read it using the Delayedarray backend.

https://bioconductor.org/packages/release/bioc/html/DelayedArray.html

Please let me know for any other doubts,
Federico

@eltonjrv
Copy link

Thanks for your directions, Federico.
And sorry for the late response as I've been caught in other parallel projects that left this one on a "stand by" mode.

Here's what I've done following your suggestion:
I wrote a hdf5 file;

saveHDF5SummarizedExperiment(A1Bmtx_sce, dir="A1B_h5_se", prefix="A1B", as.sparse=F)
Read it using HDF5ArraySeed and DelayedArray, like the following:
hdf5_seed = HDF5ArraySeed("A1B_h5_se/A1Bassays.h5", name="assay001")
A1Bmtx_sce_h5 = DelayedArray(seed = hdf5_seed)
And then noticed a drastic reduction on the object size, which seems correct:
object.size(A1Bmtx_sce)
15214529936 bytes
object.size(A1Bmtx_sce_h5)
2568 bytes

class(A1Bmtx_sce_h5)
[1] "HDF5Matrix"
attr(,"package")
[1] "HDF5Array"

The problem now appears to be that, even though HDF5Matrix and HDF5Array are sort of DelayedMatrix and DelayedArray, respectively, newWave doesn't seem to recognise them as such:

A1Bzinb <- newWave(A1Bmtx_sce_h5, K=9, X=colData(A1Bmtx_sce)$batch, children=24, n_gene_par=1500, n_cell_par=30000, verbose=FALSE, commondispersion=FALSE)
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘newWave’ for signature ‘"HDF5Matrix"’

I tried converting those HDF5 Matrix and Array classes to DelayedArray ones with:

A1Bmtx_sce_h5_da = DelayedArray(A1Bmtx_sce_h5)
But it doesn't seem to precisely convert:
class(A1Bmtx_sce_h5_da)
[1] "HDF5Matrix"
attr(,"package")
[1] "HDF5Array"

Any shedded light would be much appreciated, as it seems I'm nearly there.
Thanks again,
Elton

@eltonjrv
Copy link

Dear Federico,

Following up on the issue above, I wonder whether there's a possibility/capability for newWave to read HDF5Matrix and HDF5Array classes, which are generated by the HDF5Array package that uses DelayedArray.

Thanks,
Elton

@eltonjrv
Copy link

Hi Federico,
Sorry for launching several issues in a row, but this is just to let you know that I was able to create my own DelayedArray backend package (myH5Array) in order to properly read my h5 file as a DelayedMatrix.
Please see below as I'm getting the same error as before, but this time for an actual DelayedMatrix class rather than HDF5Matrix one.

library(myH5Array)
myH5seed <- myH5ArraySeed("A1B_h5_se/A1Bassays.h5", name="assay001")
myH5seed
An object of class "myH5ArraySeed"
Slot "filepath":
[1] "/nobackup/fbsev/LeedsOmics/Mihaela-tmp/scAnalyses/04-DE/DelayedArray-run/A1B_h5_se/A1Bassays.h5"

Slot "name":
[1] "assay001"

Slot "dim":
[1] 13966 271111

A1Bmtx_sce_myh5 <- DelayedArray(myH5seed)
object.size(A1Bmtx_sce)
15214529936 bytes
object.size(A1Bmtx_sce_myh5)
1896 bytes
class(A1Bmtx_sce_myh5)
[1] "DelayedMatrix"
attr(,"package")
[1] "DelayedArray"
A1Bzinb <- newWave(A1Bmtx_sce_myh5, K=9, X=colData(A1Bmtx_sce)$batch, children=24, n_gene_par=1500, n_cell_par=30000, verbose=FALSE, commondispersion=FALSE)
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘newWave’ for signature ‘"DelayedMatrix"’

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants