Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

determining the less trustworthy log2fc values #27

Open
ceesu opened this issue Jul 30, 2021 · 4 comments
Open

determining the less trustworthy log2fc values #27

ceesu opened this issue Jul 30, 2021 · 4 comments

Comments

@ceesu
Copy link

ceesu commented Jul 30, 2021

Hello, thanks very much for your package. I just want to follow up on this point from the vignette:

The large lfc values come from groups were nearly all counts are 0

It seems that depending on what my design is, the threshold to separate the "three groups" of log2fc values can be as small as 5. I also got the warnings "“encountered non-positive size factor estimates” and “singular gradient” when I was running glm_gp for the fit, I don't know if it's related. I'm assuming these are still "large lfc values" though they are < 20. Is there a better way you could recommend to separate out the genes with less trustworthy log2fc values than by looking visually?

@const-ae
Copy link
Owner

const-ae commented Aug 6, 2021

Hi Cathy,

that is a fair question. If you could provide a reproducible, I am happy to discuss specifics of the issues that you encountered. But I will try to give some pointers which are hopefully already useful:

I also got the warnings "“encountered non-positive size factor estimates”

This is a warning generated by scran::computeSumFactors. It might suggest that you have a wide range for the number of reads assigned to each cell. Do you do some quality control to remove poor quality cells?

I also got the warnings [...] and “singular gradient” when I was running glm_gp for the fit

That warning is interesting, as I am not sure where it is coming from. Here, I would need a reproducible example to say more.

Is there a better way you could recommend to separate out the genes with less trustworthy log2fc values than by looking visually?

In my opinion the p-value associated with a log2fc is still the best measure to understand credible a certain change is. By default the p-value is calculated with a likelihood ratio test. However, you might also be interested in this earlier discussion about using the standard error associated with each coefficient fit as an alternative. For more details see #12.

Best,
Constantin

@ceesu
Copy link
Author

ceesu commented Oct 4, 2021

Sorry for this late response. I performed some filtering which may have dealt with the errors of “encountered non-positive size factor estimates” and “singular gradient” for now.

However I am actually thinking about a case such as #22 because my plots are similar distribution, and in that case p-value Is not always useful as a filter. In that issue it's suggested to do something such as set all LFC above 15 to Inf. However I've found sometimes the threshold as determined by eye is smaller than 15. Do you have any suggestions for how I can discard lfc values from the two extremes of this 'pattern' systematically without looking by eye?

Thanks!

@const-ae
Copy link
Owner

const-ae commented Oct 5, 2021

Hi Cathy,

thanks for reaching out again and for your feedback :)

However I am actually thinking about a case such as #22 because my plots are similar distribution, and in that case p-value Is not always useful as a filter. In that issue it's suggested to do something such as set all LFC above 15 to Inf

Can you explain a bit more why the p-values are not a good filter? Note that the recommendation to change LFC > 15 to Inf is just for plotting. It uses the trick that ggplot automatically plots values with infinity on the boundary of the plot, which makes the plot look nicer.

However I've found sometimes the threshold as determined by eye is smaller than 15. Do you have any suggestions for how I can discard lfc values from the two extremes of this 'pattern' systematically without looking by eye?

Good question. Unfortunately, not really right now. The cause of the extreme LFC is that the parameter estimation algorithm converges to an extreme value if one of the groups consists of only zeros and the other group has non-zero counts. One option would be to specifically filter for such cases, but that can get quite complicated for more complex models.

Best, Constantin

@ceesu
Copy link
Author

ceesu commented Oct 6, 2021

Thanks for your reply!

Can you explain a bit more why the p-values are not a good filter? Note that the recommendation to change LFC > 15 to Inf is just for plotting. It uses the trick that ggplot automatically plots values with infinity on the boundary of the plot, which makes the plot look nicer.

My thinking is that since for some of these genes because the counts are much smaller in one group, the lfc might not be trustworthy even if the p-value is very small (which I am seeing sometimes). I guess this should be partly dealt with by filtering but as you mention it's complicated to perform this filtering to account for multiple types of groups.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants