Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Largest model coefficients for barely expressed genes? #86

Open
gabprg opened this issue Dec 5, 2017 · 4 comments
Open

Largest model coefficients for barely expressed genes? #86

gabprg opened this issue Dec 5, 2017 · 4 comments

Comments

@gabprg
Copy link

gabprg commented Dec 5, 2017

Hi there!

I had a question regarding the interpretation of the hurdle model coefficients, in particular with respect to genes expressed at low levels. Specifically: after running MAST to obtain differentially expressed genes on single-cell RNA-Seq data, we have noticed that the genes with the largest continuous model coefficients correspond to genes that barely/don't change in expression between conditions, and are also barely expressed. We're unsure how to interpret this: intuitively we would interpret that large coefficients would mean these genes play a role in explaining the differences between groups, but this doesn't make sense if they don't change/almost aren't expressed. Is this an artifact of how the model fitting occurs in MAST, and should one pre-filter the data to only include genes expressed in a larger fraction of cells?

I've attached 2 plots to illustrate what I'm talking about. The first plots the coefficient for the continuous component vs the logFC, and the 2nd plots it vs the number of cells each gene is expressed in.

Any guidance would be appreciated, thanks!

Gabriela

@gfinak
Copy link
Member

gfinak commented Dec 5, 2017

Gabriela, I don't see your plots attached to the post. Can you paste the again, and also include a code snippet showing how you pre-filter (if at all) and fit the model?
Thanks
Greg

@gabprg
Copy link
Author

gabprg commented Dec 5, 2017

Sorry about the plots. They should be attached this time. As for the code, I extract the data from a Seurat object and then submit that to MAST.

Creation of the Seurat object (excludes genes expressed in less than 10 cells):
"
seur = CreateSeuratObject(raw.data = merged.10x, project = "S35", min.cells = 10, min.genes = 100, normalization.method = "LogNormalize", do.scale = TRUE, do.center = TRUE, names.field = 2, names.delim = "_")
"

Running of MAST (condition corresponds to treatment, orig.ident corresponds to sample of origin):
"

seur@meta.data$ngeneson = scale(seur@meta.data$nGene)[,1]

seur@meta.data$condition = as.factor(substr(as.character(seur@meta.data$orig.ident), 1, nchar(as.character(seur@meta.data$orig.ident))-1))

mast.simple.raw = FromMatrix(as.matrix(seur@data), droplevels(seur@meta.data))

mast.simple.zlmFit <- zlm(
formula = formula(~condition + (1|orig.ident) + ngeneson),
sca = mast.simple.raw,
parallel = TRUE,
method = 'glmer',
ebayes = FALSE
)
print("Summarizing results...")
sum.lrt <- MAST::summary(mast.simple.zlmFit, doLRT = T)
"

I have run MAST on subsets of this data without re-performing the filtering, so in certain instances you would have genes expressed in less than 10 cells. But we've seen this in a few different cases so far, which makes me think the exact threshold for including a gene (at least if that low) does not have much of an impact.

Thanks!
Gabriela

coefcvslogfc

coefcvsncellsexpr

@gfinak
Copy link
Member

gfinak commented Dec 5, 2017

I guess I'd be curious about 1. the standard errors around those estimates, 2. What is the expression across conditions, and 3. Whether the effect is still observed if you fit a model with empirical Bayes shrinkage.
Can you show the expression across conditions for some of these?
@amcdavid might have some insights as well.

@gabprg
Copy link
Author

gabprg commented Dec 5, 2017

Yep! Here's a plot with what the expression looks like for the top 9 genes with the largest absolute value for the continuous component coefficient. As for the standard error, do you mean just as in the width of the confidence interval which is also one of the MAST outputs? If so, I also plotted the coefficient vs the CI width for that coefficient.
samplegenesexprbycond
coefcvsciwidthc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants