Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broader control gene sets #30

Open
kitscorp opened this issue Sep 16, 2022 · 1 comment
Open

Broader control gene sets #30

kitscorp opened this issue Sep 16, 2022 · 1 comment
Assignees
Labels
enhancement New feature or request question Further information is requested

Comments

@kitscorp
Copy link

kitscorp commented Sep 16, 2022

Hey,

congratulations on your paper's acceptance!

You mention in the Discussion that it might make sense when using specialized data sets to select matched control genes based on a broad cell atlas in order to increase power. Could you explain how this can be done, e.g. based on the TMS data set? (Several of my colleagues planning to work with scDRS are using specialized data sets, so I imagine it might be quite useful to more people).

Also, do you have recommendations for the choice of this broad data set? I imagine that if I use a highly specialized data set and pair it with a broad data set that does not include the general type of tissue in my specialized data set at all, I might introduce a bias. What do you think? Would merging the data sets be an option, or would that be problematic / insufficient?

@martinjzhang
Copy link
Owner

Hi,

Thank you for the questions. We are actively investigating this direction. Consider the case where we have a target data set (e.g., a T cell data set) and a reference data set (e.g., TMS FACS). We are thinking of two routes forward:

  1. Select the control genes based on the mean and variance of genes in the reference data set. This can be easily adapted into the current implementation. After calling scdrs.preprocess, overwrite the mean-variance bin column adata.uns["SCDRS_PARAM"]["GENE_STATS"]["mean_var"] by the mean-variance bin precomputed using the reference data. We are currently assessing if this method controls type I errors and produces consistent results across different reference data sets. We probably want to adjust for cell type proportion when computing the mean-variance bins using the reference data set.

  2. Use existing data integration methods to integrate the target data set into the reference data set, before running scDRS on the integrated data set.

I personally favor the first option. We will release this feature after we are confident about it.

For the broad data set, I think single-cell atlas like TMS FACS, TMS droplet, TS FACS, and TS droplet are great choices, which contain most tissues in mouse/human. We probably want to match the sequencing technology and species between the target and reference data sets. Also, some tissue-specific data sets are also promising. For example, if I want to investigate the subtypes of neurons, a brain cell data set is probably more appropriate than the whole-organism data set.

I can't promise it but we will try to report what we find as soon as possible. Please let me know if you have further questions.

Best,
Martin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants