Does LSI require all peaks for optimal results, especially when dealing with ultra-scale datasets? #108

YH-Zheng · 2023-11-29T16:15:06Z

Hello, I currently have scATAC data with approximately 3.43 million cells and around 160,000 peaks. When I attempt LSI dimensionality reduction using all peaks, it takes an incredibly long time (seemingly more than a day, which I eventually terminated).

However, when I use guidance to map highly variable genes from RNA to ATAC, involving 15,868 highly variable peaks, LSI takes less time, and I successfully complete the model training. The final cell type transfer seems to work well, but when I visualize the merged ATAC and RNA, I notice that the cell subtypes aren't completely separated, unlike in the downsampled ATAC dataset. I wonder if this is due to the use of highly variable peaks.

As for training with RNA data, my dataset is also large. Currently, I'm employing random downsampling. Do you have any suggestions for handling such ultra-scale datasets?

Jeff1995 · 2024-01-08T01:50:22Z

Hi @YH-Zheng. Thanks for your interest in GLUE! Yes, empirically speaking, LSI does work better when more peaks are included. Selecting highly variable peaks in the LSI step would likely result in lower cell type resolution.

If the number of cells is too large, one solution might be to obtain the loading matrix from subsampled cells, and apply it to all cells, which is exactly what we did with the human fetal atlas integration.

YH-Zheng · 2024-01-08T02:56:41Z

Thanks for your reply @Jeff1995, you mean it would be better to select all the peaks in the LSI step. Is the subsampling scheme you are talking about similar to Metacell's method, which integrates Metacell to obtain cell type tags and then propagates them to lower-level single cells?

Jeff1995 · 2024-01-08T05:52:44Z

I was thinking about randomly subsampling single cells, but using metacells would theoretically be a better choice as there is less information loss.

However, using metacells to obtain LSI loading matrix might be a bit tricky and needs extra caution, because the aggregated ATAC profile of metacells may deviate from the distribution of lower-level single-cells, so the loading matrix of metacells could be suboptimal for single-cells.

YH-Zheng · 2024-01-16T08:36:21Z

Hi, @Jeff1995
I have read the code related to the atlas section. However, the code appears to be somewhat complex, and I am not entirely clear on how the weights of organs and tissues impact the training results. In my dataset, all samples are derived from PBMC, and there are various cell types. Since the ATAC data lacks cell type annotation information, it seems challenging for me to calculate downsampling ratios based on cell types.

I noticed in the code that you used the downsampled data for the initial training of the GLUE model and treated it as pretraining for the entire dataset. Should I follow a similar approach? Should I downsample initially by a certain ratio, train once, and save the results as pretraining input for the entire dataset?

I attempted to use other computing frameworks to accelerate the computation of LSI, such as the Mars framework. It performs well with small-scale data, but it seems challenging to create a specific tensor from atac.X for subsequent computations. Therefore, I had to abandon the LSI calculation for all ATAC peaks.

I have successfully implemented LSI dimensionality reduction using HVG mappings from RNA, but the final results seem somewhat mediocre.

I would appreciate discussing details further with you. Thank a lot!

Jeff1995 · 2024-02-02T16:08:09Z

Sorry for the late reply!

Regarding the first problem, the code in our experiment downsampled cells per organ to balance the organ distribution across modalities. You wouldn't need to do that unless you also have highly unbalanced cell types. Simple random downsampling would work.

And for the second problem. Yes, I'd recommend pre-training the model on downsampled data if downsampling still retains a descent number of cells (say 10^4 cells), mainly because it would save time (you would have the opportunity to check whether the model alignment is reasonable before tuning it on the whole dataset).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does LSI require all peaks for optimal results, especially when dealing with ultra-scale datasets? #108

Does LSI require all peaks for optimal results, especially when dealing with ultra-scale datasets? #108

YH-Zheng commented Nov 29, 2023

Jeff1995 commented Jan 8, 2024

YH-Zheng commented Jan 8, 2024

Jeff1995 commented Jan 8, 2024

YH-Zheng commented Jan 16, 2024

Jeff1995 commented Feb 2, 2024

Does LSI require all peaks for optimal results, especially when dealing with ultra-scale datasets? #108

Does LSI require all peaks for optimal results, especially when dealing with ultra-scale datasets? #108

Comments

YH-Zheng commented Nov 29, 2023

Jeff1995 commented Jan 8, 2024

YH-Zheng commented Jan 8, 2024

Jeff1995 commented Jan 8, 2024

YH-Zheng commented Jan 16, 2024

Jeff1995 commented Feb 2, 2024