Broader control gene sets #30

kitscorp · 2022-09-16T11:31:05Z

Hey,

congratulations on your paper's acceptance!

You mention in the Discussion that it might make sense when using specialized data sets to select matched control genes based on a broad cell atlas in order to increase power. Could you explain how this can be done, e.g. based on the TMS data set? (Several of my colleagues planning to work with scDRS are using specialized data sets, so I imagine it might be quite useful to more people).

Also, do you have recommendations for the choice of this broad data set? I imagine that if I use a highly specialized data set and pair it with a broad data set that does not include the general type of tissue in my specialized data set at all, I might introduce a bias. What do you think? Would merging the data sets be an option, or would that be problematic / insufficient?

martinjzhang · 2022-09-17T01:34:36Z

Hi,

Thank you for the questions. We are actively investigating this direction. Consider the case where we have a target data set (e.g., a T cell data set) and a reference data set (e.g., TMS FACS). We are thinking of two routes forward:

Select the control genes based on the mean and variance of genes in the reference data set. This can be easily adapted into the current implementation. After calling scdrs.preprocess, overwrite the mean-variance bin column adata.uns["SCDRS_PARAM"]["GENE_STATS"]["mean_var"] by the mean-variance bin precomputed using the reference data. We are currently assessing if this method controls type I errors and produces consistent results across different reference data sets. We probably want to adjust for cell type proportion when computing the mean-variance bins using the reference data set.
Use existing data integration methods to integrate the target data set into the reference data set, before running scDRS on the integrated data set.

I personally favor the first option. We will release this feature after we are confident about it.

For the broad data set, I think single-cell atlas like TMS FACS, TMS droplet, TS FACS, and TS droplet are great choices, which contain most tissues in mouse/human. We probably want to match the sequencing technology and species between the target and reference data sets. Also, some tissue-specific data sets are also promising. For example, if I want to investigate the subtypes of neurons, a brain cell data set is probably more appropriate than the whole-organism data set.

I can't promise it but we will try to report what we find as soon as possible. Please let me know if you have further questions.

Best,
Martin

martinjzhang added enhancement New feature or request question Further information is requested labels Sep 17, 2022

martinjzhang mentioned this issue Sep 29, 2022

Data sparsity is a critical problem for the inference stability (ex. 10x data) #32

Closed

martinjzhang self-assigned this Sep 30, 2022

martinjzhang mentioned this issue Jun 4, 2023

perform-downstream outputs zero significant cells #64

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broader control gene sets #30

Broader control gene sets #30

kitscorp commented Sep 16, 2022 •

edited

martinjzhang commented Sep 17, 2022

Broader control gene sets #30

Broader control gene sets #30

Comments

kitscorp commented Sep 16, 2022 • edited

martinjzhang commented Sep 17, 2022

kitscorp commented Sep 16, 2022 •

edited