-
Notifications
You must be signed in to change notification settings - Fork 267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alpha Diversity indices produce zeros where previously undefined (NaN) #2014
Comments
(had the wrong PR linked, updated) |
Hello all, I recently carefully reviewed the alpha diversity functionality in scikit-bio, and improved some of them. There are several things we need to be cautious about and try to improve:
I noticed this a long time ago. I planned to bring this up after more thorough literature review. However, given the current discussion, I think it is time for me to mention it. |
Should we have a I'm also curious, under what scenarios should we expect a sample to have 0 features? Shouldn't we have an assert statement rather than return 0? If it returns 0, its basically a silent error, since this scenario shouldn't exist in the first place. EDIT: ignore the previous comment -- there are certainly scenarios (aka sterile communities) where there isn't anything present). The first comment still stands though -- treating communities with 0 species and 1 species still seems off, there should at least be a way to distinguish between these two scenarios. |
I think that the ideal solution is to have a graduate student to do literature review of alpha diversity, to summarize what original authors thought and debated, and to implement "native" solutions. In my limited literature review efforts, there were historical debates on this question. For example, should Faith's PD of a one-species community be zero? (Because there is no branch connecting species.) After debates, it was acknowledged that Faith's PD should include the root of the tree no matter what, therefore there is always some branches to calculate with. Such metric-specific discussions and adjustments are everywhere. We should ideally address each of them specifically. However that costs human power. At present, one graduate student in my lab is spending some time on it. However, this amount of time is limited. If such investment of human power is not feasible, I think that a workaround is to force all alpha diversity metrics of empty communities to be zero. Re: a |
I appreciate the discussion here, can I ask what the particular issue with Re: biom, what about numpy masked arrays, or dense arrays? Those will still exist, and I don't think it necessarily follows that an empty community has a richness or evenness of zero (wouldn't an empty community be maximally even while also being maximally depleted?). Again, fan of Re: |
I'd like to suggest removing biom from this discussion. The transformation the model underneath operates on is inconsequential to the correctness and interpretation of the metrics. |
@wasade Agreed. This can go to another discussion (metrics dealing with only positive numbers). I will leave the @ebolyen For the base. Shannon defaults to |
@ebolyen I think part of the issue here is that a 0-member community isn't necessarily ill-considered, since there could be real biological scenarios that do not have any measurements (i.e. qPCR targeting v4 in sterile placenta samples). Thus, throwing I proposed to replace
|
I don't think using another (non- We're set on the QIIME 2 side for now as we've moved the old implementations over to Here's an example of how the change could impact QIIME 2 plugins: In [1]: import pandas as pd
In [2]: import numpy as np
## previous behavior
# imagine these are vectors of alpha diversity scores (or an alpha diversity score and some metadata column)
In [5]: df1 = pd.DataFrame([(1,1), (2, np.nan), (3,3), (4,4), (np.nan,5), (6,6)])
In [6]: df1
Out[6]:
0 1
0 1.0 1.0
1 2.0 NaN
2 3.0 3.0
3 4.0 4.0
4 NaN 5.0
5 6.0 6.0
# some call that could be made in a QIIME 2 action
In [7]: df1.dropna().corr()
Out[7]:
0 1
0 1.0 1.0
1 1.0 1.0
## new behavior
In [8]: df2 = pd.DataFrame([(1,1), (2, -1), (3,3), (4,4), (-1,5), (6,6)])
In [9]: df2
Out[9]:
0 1
0 1 1
1 2 -1
2 3 3
3 4 4
4 -1 5
5 6 6
# the same call now results in different output, so we can't integrate this without advance warning.
# unrelated, but i would consider this to be invalid output
In [10]: df2.dropna().corr()
Out[10]:
0 1
0 1.000000 0.315754
1 0.315754 1.000000 Keep us in the loop on how you decide to move forward with these methods. Also, from what I can tell this change wasn't mentioned in the release notes or announced ahead of time anywhere. As a downstream dependency we really need to know about changes like this to be able to rely on scikit-bio. Was that just an oversight? (Totally understandable if so - it happens.) |
Hi @gregcaporaso I think the point regarding inputs into to downstream regression are appropriate. I'm in favor of reverting back to the original Yes, this was missed during review and should have been flagged earlier. Thanks for catching @ebolyen @gregcaporaso |
Hi @gregcaporaso @ebolyen Thanks for pointing out this issue. Fully agreed. I wanted to check one thing with you: Should the NaN behavior apply to Shannon and Pielou only, or any alpha diversity metric? For example, the observed features of an empty sample would be zero. Do you think it should be NaN as well? In the previous code before any editing, there wasn't a universal mechanism to enforce zero or NaN. With this discussion, we can confirm whether a universal mechanism or specific ones are more appropriate. Thanks! |
Hi all, the edits I previously added to handle empty samples are in PR #1894 . Look at line 887 and below. It appears that the issue is not that straightforward. Prior to the edit, Shannon and multiple other matrics involves division by the sum of counts. When the sample is empty, this results in a zero division error. Therefore, the previous code could not deal with empty samples. This should be fixed. The question is how. Now we have four options: A. 0 |
The vegan manual says:
|
I think the zero division error depends on the incoming number. A numpy-tagged float (i.e. the overloaded division operator) will warn and return We have a number of tests to confirm the NaN behavior (which is how we noticed). I suspect coercing plain lists into an ndarray first would mostly prevent case I think returning |
@ebolyen Thanks for the additional input! I agree with this. All, I just did some investigation into the current and previous code. I am proposing the following solution:
This is based on the observation that most of the currently implemented metrics involves fraction or zero division. Specifically, they include:
Only a few metrics don't suffer from this and a 0 value is meaningful:
We could choose to return zero for these few metrics. However, a difficulty is that these metrics are somehow related to the fraction-based metrics. For example, species richness and Shannon can be uniformized into Hill number; Faith' PD can be generalized into abundance-weighted phylogenetic diversity. If we make them special, we will get different results from the specialized and generalized metrics.
I am going to work on this right away. Please let me know if you have other thoughts. |
I like this line of reasoning, but just to point out: I'm still fine with |
For information, here are some test results.
>>> entropy([1])
0.0
>>> entropy([0])
nan
>>> entropy([])
0.0 Some popular R packages: single <- c(1)
zero <- c(0)
empty <- numeric(0) Species richness (a.k.a. sobs):
Shannon index (note: base in all three packages is
Gini-Simpson (a.k.a. simpson):
|
#1894 added short-circuits to produce
0
where previously the formula would produceNaN
.This is currently impacting QIIME 2's diversity measures and represents a breaking change from our perspective. For the moment, we're going to copy the effected indices prior to these updates into q2-diversity-lib, but we don't want this to be a long-term solution.
What was the rationale for this particular change?
Reviewing Pielou's
J'
for instance, there doesn't appear to be a definition for zero-feature samples, so the formula producingNaN
seems appropriate as it isn't defined (this also lets the user know that they shouldn't interpret that sample). Shannon has a similar reasoning.cc @gregcaporaso
The text was updated successfully, but these errors were encountered: