Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alpha Diversity indices produce zeros where previously undefined (NaN) #2014

Open
ebolyen opened this issue Apr 30, 2024 · 18 comments
Open

Alpha Diversity indices produce zeros where previously undefined (NaN) #2014

ebolyen opened this issue Apr 30, 2024 · 18 comments

Comments

@ebolyen
Copy link
Contributor

ebolyen commented Apr 30, 2024

#1894 added short-circuits to produce 0 where previously the formula would produce NaN.

This is currently impacting QIIME 2's diversity measures and represents a breaking change from our perspective. For the moment, we're going to copy the effected indices prior to these updates into q2-diversity-lib, but we don't want this to be a long-term solution.

What was the rationale for this particular change?

Reviewing Pielou's J' for instance, there doesn't appear to be a definition for zero-feature samples, so the formula producing NaN seems appropriate as it isn't defined (this also lets the user know that they shouldn't interpret that sample). Shannon has a similar reasoning.

cc @gregcaporaso

@ebolyen
Copy link
Contributor Author

ebolyen commented Apr 30, 2024

(had the wrong PR linked, updated)

@mortonjt
Copy link
Collaborator

mortonjt commented May 1, 2024

Thanks @ebolyen I'm looking at the code more carefully and I'm coming to the same conclusion. If 0 is returned for samples with zero features, that would imply that samples with zero features have the same alpha diversity as samples with one feature (which is weird).

@qiyunzhu thoughts?

@qiyunzhu
Copy link
Contributor

qiyunzhu commented May 1, 2024

Hello all, I recently carefully reviewed the alpha diversity functionality in scikit-bio, and improved some of them. There are several things we need to be cautious about and try to improve:

  1. Empty community: It should return 0. The code should force that. Some metrics already have this implemented (I did them). Though not all of them. This is a dilemma that many original authors have also explored (e.g., as @mortonjt noticed, a sample with one species may have the same alpha diversity as a sample with zero species).

  2. It has been planned that we let all alpha diversity metrics to operate on sparse matrices provided by BIOM. Once this is implemented, some of the challenges discussed here will be gone. That is, instead of a vector with some zeros, you will start with a vector with only positive numbers. A community without any species will be an empty numpy array.

  3. Pielou's J (a.k.a., Shannon's equitability, EH): There is a logical problem in the current implementation: By definition, J is the ratio of observed Shannon Index vs. maximum Shannon Index allowed by the observed number of species. However, in scikit-bio, Shannon index is calculated by a base of 2, whereas Pielou's J is calculated by a base on e. This is problematic. I investigated literature and other software packages, as well as those of derivatives of Shannon (e.g., Pielou's J, and complex metrics like Hill number), and am under impression that most (if not all) calculates Shannon index using a base of e. This even includes SciPy's entropy. Therefore I would suggest that we replace 2 with e in scikit-bio.

I noticed this a long time ago. I planned to bring this up after more thorough literature review. However, given the current discussion, I think it is time for me to mention it.

@mortonjt
Copy link
Collaborator

mortonjt commented May 1, 2024

Should we have a base parameter that enables users to easily swap bases without thinking too much? There are certainly scenarios where base 2 is preferable (it is often easier to do mental arithmetic on base 2 or 10, instead of e=2.71828...).

I'm also curious, under what scenarios should we expect a sample to have 0 features? Shouldn't we have an assert statement rather than return 0? If it returns 0, its basically a silent error, since this scenario shouldn't exist in the first place.

EDIT: ignore the previous comment -- there are certainly scenarios (aka sterile communities) where there isn't anything present). The first comment still stands though -- treating communities with 0 species and 1 species still seems off, there should at least be a way to distinguish between these two scenarios.

@qiyunzhu
Copy link
Contributor

qiyunzhu commented May 1, 2024

I think that the ideal solution is to have a graduate student to do literature review of alpha diversity, to summarize what original authors thought and debated, and to implement "native" solutions.

In my limited literature review efforts, there were historical debates on this question. For example, should Faith's PD of a one-species community be zero? (Because there is no branch connecting species.) After debates, it was acknowledged that Faith's PD should include the root of the tree no matter what, therefore there is always some branches to calculate with.

Such metric-specific discussions and adjustments are everywhere. We should ideally address each of them specifically. However that costs human power. At present, one graduate student in my lab is spending some time on it. However, this amount of time is limited.

If such investment of human power is not feasible, I think that a workaround is to force all alpha diversity metrics of empty communities to be zero.

Re: a base parameter: scikit-bio's shannon has this parameter, which defaults to 2. Other relevant metrics (such as Pielou's J) doesn't have it. It is a good idea to add it to all metrics that are derived from Shannon.

@ebolyen
Copy link
Contributor Author

ebolyen commented May 1, 2024

I appreciate the discussion here, can I ask what the particular issue with NaN was? It's a good value for undefined, and is trivially converted to zero by code that doesn't care about the distinction. For our purposes, we do care that it's undefined, as the interpretation gets really fuzzy (and it is better that our users ask us "what the heck is NaN?" than to boldly proceed with zero).

Re: biom, what about numpy masked arrays, or dense arrays? Those will still exist, and I don't think it necessarily follows that an empty community has a richness or evenness of zero (wouldn't an empty community be maximally even while also being maximally depleted?). Again, fan of NaN here as it renders all future calculations on it as ill-considered.

Re: base, no argument there, although its worth mentioning that there is a difference in what Shannon means to information theory and what it means to ecologists. We've traditionally always gone with original information theory and so chose bits over nats. And it happens that J' is already in nats under skbio, so it does at least match in that respect.

@wasade
Copy link
Collaborator

wasade commented May 1, 2024

I'd like to suggest removing biom from this discussion. The transformation the model underneath operates on is inconsequential to the correctness and interpretation of the metrics.

@qiyunzhu
Copy link
Contributor

qiyunzhu commented May 1, 2024

@wasade Agreed. This can go to another discussion (metrics dealing with only positive numbers). I will leave the NaN vs 0 vs -1 discussion to @mortonjt

@ebolyen For the base. Shannon defaults to 2 and Pielou defaults to e in skbio at present. I think that 2 is more often used in the information theory than in ecology. The original Shannon (1948) paper said any base works, although in theorem 2 the equation is just log. Multiple alpha diversity metrics developed in the history, Rényi entropy, Hill number, phylogenetic Hill number, default to e. In particular, Hill number is a generalization of species richness, Shannon index, Simpson index, etc. It is good to have in scikit-bio. Calculations of those metrics require exponential operations (exp). Although 2 doesn't break the calculation, e will make more sense to the general and ecology audience.

@mortonjt
Copy link
Collaborator

mortonjt commented May 1, 2024

@ebolyen I think part of the issue here is that a 0-member community isn't necessarily ill-considered, since there could be real biological scenarios that do not have any measurements (i.e. qPCR targeting v4 in sterile placenta samples). Thus, throwing np.nan may give the user the impression that there is an error, when in fact there is no error.

I proposed to replace np.nan with -1. This has the following implications

  1. This will replicate the previous handling since -1 vs 0 will differentiate between 0-member and 1-member communities respectively.
  2. -1 is still a numerical value, which implies that the value isn't due to a numerical error. 0-member communities can happen in practice (not necessarily with sequencing technologies, but other tech such as qPCR on sterile samples) -- so flagging these samples rather than throwing "errors" would be ideal. Here -1 can act as both a numerical and a flag, without having the implication of a numerical error / missing value connotation associated with np.nan. This should be treated more of a semantic change to facilitate user interpretability rather than a major change in practical usage.

@gregcaporaso
Copy link
Contributor

I don't think using another (non-np.nan) numeric value is a good solution as it has the potential to result in silent "data integrity" errors downstream if users and other consumers of the data (e.g., regression models) don't have the required context that -1 should be treated as a flag for certain metrics. A lot of the relevant libraries that consume this information already have good support for handling np.nan values in mathematical operations.

We're set on the QIIME 2 side for now as we've moved the old implementations over to q2-diversity-lib to retain the old behavior as this is a backward incompatible change (see example below) and we do our best to not introduce those without advance notice. It's also critical that we update dependencies in this cycle as we're rapidly approaching end of life on Python 3.8, which is the version of Python that QIIME 2 2024.2 depends on (2024.5 will go to Python 3.9 as some plugins have that pinned - 2024.10 will go to Python 3.10).

Here's an example of how the change could impact QIIME 2 plugins:

In [1]: import pandas as pd

In [2]: import numpy as np

## previous behavior
# imagine these are vectors of alpha diversity scores (or an alpha diversity score and some metadata column)
In [5]: df1 = pd.DataFrame([(1,1), (2, np.nan), (3,3), (4,4), (np.nan,5), (6,6)])

In [6]: df1
Out[6]:
     0    1
0  1.0  1.0
1  2.0  NaN
2  3.0  3.0
3  4.0  4.0
4  NaN  5.0
5  6.0  6.0

# some call that could be made in a QIIME 2 action
In [7]: df1.dropna().corr()
Out[7]:
     0    1
0  1.0  1.0
1  1.0  1.0

## new behavior
In [8]: df2 = pd.DataFrame([(1,1), (2, -1), (3,3), (4,4), (-1,5), (6,6)])

In [9]: df2
Out[9]:
   0  1
0  1  1
1  2 -1
2  3  3
3  4  4
4 -1  5
5  6  6

# the same call now results in different output, so we can't integrate this without advance warning. 
# unrelated, but i would consider this to be invalid output
In [10]: df2.dropna().corr()
Out[10]:
          0         1
0  1.000000  0.315754
1  0.315754  1.000000

Keep us in the loop on how you decide to move forward with these methods.

Also, from what I can tell this change wasn't mentioned in the release notes or announced ahead of time anywhere. As a downstream dependency we really need to know about changes like this to be able to rely on scikit-bio. Was that just an oversight? (Totally understandable if so - it happens.)

@mortonjt
Copy link
Collaborator

mortonjt commented May 2, 2024

Hi @gregcaporaso I think the point regarding inputs into to downstream regression are appropriate. I'm in favor of reverting back to the original np.nan implementation, since it appears to cover the most edge cases.

Yes, this was missed during review and should have been flagged earlier. Thanks for catching @ebolyen @gregcaporaso

@qiyunzhu
Copy link
Contributor

qiyunzhu commented May 2, 2024

Hi @gregcaporaso @ebolyen Thanks for pointing out this issue. Fully agreed. I wanted to check one thing with you: Should the NaN behavior apply to Shannon and Pielou only, or any alpha diversity metric? For example, the observed features of an empty sample would be zero. Do you think it should be NaN as well?

In the previous code before any editing, there wasn't a universal mechanism to enforce zero or NaN. With this discussion, we can confirm whether a universal mechanism or specific ones are more appropriate.

Thanks!

@qiyunzhu
Copy link
Contributor

qiyunzhu commented May 2, 2024

Hi all, the edits I previously added to handle empty samples are in PR #1894 . Look at line 887 and below. It appears that the issue is not that straightforward. Prior to the edit, Shannon and multiple other matrics involves division by the sum of counts. When the sample is empty, this results in a zero division error. Therefore, the previous code could not deal with empty samples. This should be fixed. The question is how. Now we have four options:

A. 0
B. -1
C. NaN
D. Raise an error (I guess that's what previous skbio did)
E. metric-specific

@qiyunzhu
Copy link
Contributor

qiyunzhu commented May 2, 2024

The vegan manual says:

None of these diversity indices is usable for empty sampling units without any species, but some of the indices can give a numeric value. Filtering out these cases is left for the user.

In principle, you cannot study species composition without species and you should remove empty sites from community data.

@ebolyen
Copy link
Contributor Author

ebolyen commented May 2, 2024

I think the zero division error depends on the incoming number. A numpy-tagged float (i.e. the overloaded division operator) will warn and return NaN, but plain Python with 0.0/0.0 will raise an exception.

We have a number of tests to confirm the NaN behavior (which is how we noticed). I suspect coercing plain lists into an ndarray first would mostly prevent case D, but for any of our calling code we've always seen C (since we start with pandas series generally).

I think returning NaN on empty communities is reasonable, and am neutral on S_obs/observed_otus. That's the only index that comes to mind as reasonably defined as 0. As noted in Vegan, interpreting empty communities is not an ideal use of diversity measures (so I'm good with NaN if consistency is more important).

@qiyunzhu
Copy link
Contributor

qiyunzhu commented May 2, 2024

@ebolyen Thanks for the additional input! I agree with this.

All, I just did some investigation into the current and previous code. I am proposing the following solution:

  1. Uniformly return NaN for empty samples for all alpha diversity metrics.

This is based on the observation that most of the currently implemented metrics involves fraction or zero division. Specifically, they include:

  • berger_parker_d, brillouin_d, dominance, enspie, esty_ci, goods_coverage, kempton_taylor_q, margalef, mcintosh_d, mcintosh_e, menhinick, pielou_e, robbins, shannon, simpson, simpson_e, strong

Only a few metrics don't suffer from this and a 0 value is meaningful:

  • doubles, sobs (i.e., observed features), singles, osd, faith_pd

We could choose to return zero for these few metrics. However, a difficulty is that these metrics are somehow related to the fraction-based metrics. For example, species richness and Shannon can be uniformized into Hill number; Faith' PD can be generalized into abundance-weighted phylogenetic diversity. If we make them special, we will get different results from the specialized and generalized metrics.

  1. The NaN should be achieved through a short circuit, rather than from zero division, which will generate a ValueError (Python case) or a RuntimeWarning (NumPy case).

  2. The helper function _validate_counts_vector can be modified to strip off all zero values. I hope that it can also automatically return NaN for the outer function, but it is not possible with Python. Therefore, each alpha diversity metric will check whether the return value of this helper function is empty, and returns NaN in that case.

I am going to work on this right away. Please let me know if you have other thoughts.

@ebolyen
Copy link
Contributor Author

ebolyen commented May 2, 2024

For example, species richness and Shannon can be uniformized into Hill number; Faith' PD can be generalized into abundance-weighted phylogenetic diversity. If we make them special, we will get different results from the specialized and generalized metrics.

I like this line of reasoning, but just to point out: 0 * NaN would produce NaN, so I think the generalized and specialized metrics would still both produce NaN on empty communities (even if standardized by a zero-producing measure).

I'm still fine with 0 becoming NaN for empty communities, but wanted to note the above.

@qiyunzhu
Copy link
Contributor

qiyunzhu commented May 4, 2024

For information, here are some test results.

scipy.stats.entropy (default base is e):

>>> entropy([1])
0.0
>>> entropy([0])
nan
>>> entropy([])
0.0

Some popular R packages:

single <- c(1)
zero <- c(0)
empty <- numeric(0)

Species richness (a.k.a. sobs):

vegan adiv entropart
zero 0 error N/A
empty 0 error N/A

Shannon index (note: base in all three packages is e):

vegan adiv entropart
single 0 0 0
zero 0 error N/A
empty 0 error N/A

Gini-Simpson (a.k.a. simpson):

vegan adiv entropart
single 0 0 0
zero 1 error N/A
empty 1 error N/A

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants