Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does LSI require all peaks for optimal results, especially when dealing with ultra-scale datasets? #108

Open
YH-Zheng opened this issue Nov 29, 2023 · 5 comments

Comments

@YH-Zheng
Copy link

Hello, I currently have scATAC data with approximately 3.43 million cells and around 160,000 peaks. When I attempt LSI dimensionality reduction using all peaks, it takes an incredibly long time (seemingly more than a day, which I eventually terminated).

However, when I use guidance to map highly variable genes from RNA to ATAC, involving 15,868 highly variable peaks, LSI takes less time, and I successfully complete the model training. The final cell type transfer seems to work well, but when I visualize the merged ATAC and RNA, I notice that the cell subtypes aren't completely separated, unlike in the downsampled ATAC dataset. I wonder if this is due to the use of highly variable peaks.

As for training with RNA data, my dataset is also large. Currently, I'm employing random downsampling. Do you have any suggestions for handling such ultra-scale datasets?

@Jeff1995
Copy link
Collaborator

Jeff1995 commented Jan 8, 2024

Hi @YH-Zheng. Thanks for your interest in GLUE! Yes, empirically speaking, LSI does work better when more peaks are included. Selecting highly variable peaks in the LSI step would likely result in lower cell type resolution.

If the number of cells is too large, one solution might be to obtain the loading matrix from subsampled cells, and apply it to all cells, which is exactly what we did with the human fetal atlas integration.

@YH-Zheng
Copy link
Author

YH-Zheng commented Jan 8, 2024

Thanks for your reply @Jeff1995, you mean it would be better to select all the peaks in the LSI step. Is the subsampling scheme you are talking about similar to Metacell's method, which integrates Metacell to obtain cell type tags and then propagates them to lower-level single cells?

@Jeff1995
Copy link
Collaborator

Jeff1995 commented Jan 8, 2024

I was thinking about randomly subsampling single cells, but using metacells would theoretically be a better choice as there is less information loss.

However, using metacells to obtain LSI loading matrix might be a bit tricky and needs extra caution, because the aggregated ATAC profile of metacells may deviate from the distribution of lower-level single-cells, so the loading matrix of metacells could be suboptimal for single-cells.

@YH-Zheng
Copy link
Author

Hi, @Jeff1995
I have read the code related to the atlas section. However, the code appears to be somewhat complex, and I am not entirely clear on how the weights of organs and tissues impact the training results. In my dataset, all samples are derived from PBMC, and there are various cell types. Since the ATAC data lacks cell type annotation information, it seems challenging for me to calculate downsampling ratios based on cell types.

I noticed in the code that you used the downsampled data for the initial training of the GLUE model and treated it as pretraining for the entire dataset. Should I follow a similar approach? Should I downsample initially by a certain ratio, train once, and save the results as pretraining input for the entire dataset?

I attempted to use other computing frameworks to accelerate the computation of LSI, such as the Mars framework. It performs well with small-scale data, but it seems challenging to create a specific tensor from atac.X for subsequent computations. Therefore, I had to abandon the LSI calculation for all ATAC peaks.

I have successfully implemented LSI dimensionality reduction using HVG mappings from RNA, but the final results seem somewhat mediocre.

I would appreciate discussing details further with you. Thank a lot!

@Jeff1995
Copy link
Collaborator

Jeff1995 commented Feb 2, 2024

Sorry for the late reply!

Regarding the first problem, the code in our experiment downsampled cells per organ to balance the organ distribution across modalities. You wouldn't need to do that unless you also have highly unbalanced cell types. Simple random downsampling would work.

And for the second problem. Yes, I'd recommend pre-training the model on downsampled data if downsampling still retains a descent number of cells (say 10^4 cells), mainly because it would save time (you would have the opportunity to check whether the model alignment is reasonable before tuning it on the whole dataset).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants