Alpha Diversity indices produce zeros where previously undefined (NaN) #2014

ebolyen · 2024-04-30T18:54:54Z

#1894 added short-circuits to produce 0 where previously the formula would produce NaN.

This is currently impacting QIIME 2's diversity measures and represents a breaking change from our perspective. For the moment, we're going to copy the effected indices prior to these updates into q2-diversity-lib, but we don't want this to be a long-term solution.

What was the rationale for this particular change?

Reviewing Pielou's J' for instance, there doesn't appear to be a definition for zero-feature samples, so the formula producing NaN seems appropriate as it isn't defined (this also lets the user know that they shouldn't interpret that sample). Shannon has a similar reasoning.

cc @gregcaporaso

The text was updated successfully, but these errors were encountered:

ebolyen · 2024-04-30T19:10:28Z

(had the wrong PR linked, updated)

mortonjt · 2024-05-01T15:30:25Z

Thanks @ebolyen I'm looking at the code more carefully and I'm coming to the same conclusion. If 0 is returned for samples with zero features, that would imply that samples with zero features have the same alpha diversity as samples with one feature (which is weird).

@qiyunzhu thoughts?

qiyunzhu · 2024-05-01T16:16:53Z

Hello all, I recently carefully reviewed the alpha diversity functionality in scikit-bio, and improved some of them. There are several things we need to be cautious about and try to improve:

Empty community: It should return 0. The code should force that. Some metrics already have this implemented (I did them). Though not all of them. This is a dilemma that many original authors have also explored (e.g., as @mortonjt noticed, a sample with one species may have the same alpha diversity as a sample with zero species).
It has been planned that we let all alpha diversity metrics to operate on sparse matrices provided by BIOM. Once this is implemented, some of the challenges discussed here will be gone. That is, instead of a vector with some zeros, you will start with a vector with only positive numbers. A community without any species will be an empty numpy array.
Pielou's J (a.k.a., Shannon's equitability, E_H): There is a logical problem in the current implementation: By definition, J is the ratio of observed Shannon Index vs. maximum Shannon Index allowed by the observed number of species. However, in scikit-bio, Shannon index is calculated by a base of 2, whereas Pielou's J is calculated by a base on e. This is problematic. I investigated literature and other software packages, as well as those of derivatives of Shannon (e.g., Pielou's J, and complex metrics like Hill number), and am under impression that most (if not all) calculates Shannon index using a base of e. This even includes SciPy's entropy. Therefore I would suggest that we replace 2 with e in scikit-bio.

I noticed this a long time ago. I planned to bring this up after more thorough literature review. However, given the current discussion, I think it is time for me to mention it.

mortonjt · 2024-05-01T16:26:08Z

Should we have a base parameter that enables users to easily swap bases without thinking too much? There are certainly scenarios where base 2 is preferable (it is often easier to do mental arithmetic on base 2 or 10, instead of e=2.71828...).

I'm also curious, under what scenarios should we expect a sample to have 0 features? Shouldn't we have an assert statement rather than return 0? If it returns 0, its basically a silent error, since this scenario shouldn't exist in the first place.

EDIT: ignore the previous comment -- there are certainly scenarios (aka sterile communities) where there isn't anything present). The first comment still stands though -- treating communities with 0 species and 1 species still seems off, there should at least be a way to distinguish between these two scenarios.

qiyunzhu · 2024-05-01T16:34:44Z

I think that the ideal solution is to have a graduate student to do literature review of alpha diversity, to summarize what original authors thought and debated, and to implement "native" solutions.

In my limited literature review efforts, there were historical debates on this question. For example, should Faith's PD of a one-species community be zero? (Because there is no branch connecting species.) After debates, it was acknowledged that Faith's PD should include the root of the tree no matter what, therefore there is always some branches to calculate with.

Such metric-specific discussions and adjustments are everywhere. We should ideally address each of them specifically. However that costs human power. At present, one graduate student in my lab is spending some time on it. However, this amount of time is limited.

If such investment of human power is not feasible, I think that a workaround is to force all alpha diversity metrics of empty communities to be zero.

Re: a base parameter: scikit-bio's shannon has this parameter, which defaults to 2. Other relevant metrics (such as Pielou's J) doesn't have it. It is a good idea to add it to all metrics that are derived from Shannon.

ebolyen · 2024-05-01T19:19:10Z

I appreciate the discussion here, can I ask what the particular issue with NaN was? It's a good value for undefined, and is trivially converted to zero by code that doesn't care about the distinction. For our purposes, we do care that it's undefined, as the interpretation gets really fuzzy (and it is better that our users ask us "what the heck is NaN?" than to boldly proceed with zero).

Re: biom, what about numpy masked arrays, or dense arrays? Those will still exist, and I don't think it necessarily follows that an empty community has a richness or evenness of zero (wouldn't an empty community be maximally even while also being maximally depleted?). Again, fan of NaN here as it renders all future calculations on it as ill-considered.

Re: base, no argument there, although its worth mentioning that there is a difference in what Shannon means to information theory and what it means to ecologists. We've traditionally always gone with original information theory and so chose bits over nats. And it happens that J' is already in nats under skbio, so it does at least match in that respect.

wasade · 2024-05-01T19:28:08Z

I'd like to suggest removing biom from this discussion. The transformation the model underneath operates on is inconsequential to the correctness and interpretation of the metrics.

qiyunzhu · 2024-05-01T19:41:08Z

@wasade Agreed. This can go to another discussion (metrics dealing with only positive numbers). I will leave the NaN vs 0 vs -1 discussion to @mortonjt

@ebolyen For the base. Shannon defaults to 2 and Pielou defaults to e in skbio at present. I think that 2 is more often used in the information theory than in ecology. The original Shannon (1948) paper said any base works, although in theorem 2 the equation is just log. Multiple alpha diversity metrics developed in the history, Rényi entropy, Hill number, phylogenetic Hill number, default to e. In particular, Hill number is a generalization of species richness, Shannon index, Simpson index, etc. It is good to have in scikit-bio. Calculations of those metrics require exponential operations (exp). Although 2 doesn't break the calculation, e will make more sense to the general and ecology audience.

mortonjt · 2024-05-01T19:45:20Z

@ebolyen I think part of the issue here is that a 0-member community isn't necessarily ill-considered, since there could be real biological scenarios that do not have any measurements (i.e. qPCR targeting v4 in sterile placenta samples). Thus, throwing np.nan may give the user the impression that there is an error, when in fact there is no error.

I proposed to replace np.nan with -1. This has the following implications

This will replicate the previous handling since -1 vs 0 will differentiate between 0-member and 1-member communities respectively.
-1 is still a numerical value, which implies that the value isn't due to a numerical error. 0-member communities can happen in practice (not necessarily with sequencing technologies, but other tech such as qPCR on sterile samples) -- so flagging these samples rather than throwing "errors" would be ideal. Here -1 can act as both a numerical and a flag, without having the implication of a numerical error / missing value connotation associated with np.nan. This should be treated more of a semantic change to facilitate user interpretability rather than a major change in practical usage.

gregcaporaso · 2024-05-02T13:56:58Z

I don't think using another (non-np.nan) numeric value is a good solution as it has the potential to result in silent "data integrity" errors downstream if users and other consumers of the data (e.g., regression models) don't have the required context that -1 should be treated as a flag for certain metrics. A lot of the relevant libraries that consume this information already have good support for handling np.nan values in mathematical operations.

We're set on the QIIME 2 side for now as we've moved the old implementations over to q2-diversity-lib to retain the old behavior as this is a backward incompatible change (see example below) and we do our best to not introduce those without advance notice. It's also critical that we update dependencies in this cycle as we're rapidly approaching end of life on Python 3.8, which is the version of Python that QIIME 2 2024.2 depends on (2024.5 will go to Python 3.9 as some plugins have that pinned - 2024.10 will go to Python 3.10).

Here's an example of how the change could impact QIIME 2 plugins:

In [1]: import pandas as pd

In [2]: import numpy as np

## previous behavior
# imagine these are vectors of alpha diversity scores (or an alpha diversity score and some metadata column)
In [5]: df1 = pd.DataFrame([(1,1), (2, np.nan), (3,3), (4,4), (np.nan,5), (6,6)])

In [6]: df1
Out[6]:
     0    1
0  1.0  1.0
1  2.0  NaN
2  3.0  3.0
3  4.0  4.0
4  NaN  5.0
5  6.0  6.0

# some call that could be made in a QIIME 2 action
In [7]: df1.dropna().corr()
Out[7]:
     0    1
0  1.0  1.0
1  1.0  1.0

## new behavior
In [8]: df2 = pd.DataFrame([(1,1), (2, -1), (3,3), (4,4), (-1,5), (6,6)])

In [9]: df2
Out[9]:
   0  1
0  1  1
1  2 -1
2  3  3
3  4  4
4 -1  5
5  6  6

# the same call now results in different output, so we can't integrate this without advance warning. 
# unrelated, but i would consider this to be invalid output
In [10]: df2.dropna().corr()
Out[10]:
          0         1
0  1.000000  0.315754
1  0.315754  1.000000

Keep us in the loop on how you decide to move forward with these methods.

Also, from what I can tell this change wasn't mentioned in the release notes or announced ahead of time anywhere. As a downstream dependency we really need to know about changes like this to be able to rely on scikit-bio. Was that just an oversight? (Totally understandable if so - it happens.)

mortonjt · 2024-05-02T14:08:20Z

Hi @gregcaporaso I think the point regarding inputs into to downstream regression are appropriate. I'm in favor of reverting back to the original np.nan implementation, since it appears to cover the most edge cases.

Yes, this was missed during review and should have been flagged earlier. Thanks for catching @ebolyen @gregcaporaso

qiyunzhu · 2024-05-02T14:41:43Z

Hi @gregcaporaso @ebolyen Thanks for pointing out this issue. Fully agreed. I wanted to check one thing with you: Should the NaN behavior apply to Shannon and Pielou only, or any alpha diversity metric? For example, the observed features of an empty sample would be zero. Do you think it should be NaN as well?

In the previous code before any editing, there wasn't a universal mechanism to enforce zero or NaN. With this discussion, we can confirm whether a universal mechanism or specific ones are more appropriate.

Thanks!

qiyunzhu · 2024-05-02T16:21:23Z

Hi all, the edits I previously added to handle empty samples are in PR #1894 . Look at line 887 and below. It appears that the issue is not that straightforward. Prior to the edit, Shannon and multiple other matrics involves division by the sum of counts. When the sample is empty, this results in a zero division error. Therefore, the previous code could not deal with empty samples. This should be fixed. The question is how. Now we have four options:

A. 0
B. -1
C. NaN
D. Raise an error (I guess that's what previous skbio did)
E. metric-specific

qiyunzhu · 2024-05-02T16:27:11Z

The vegan manual says:

None of these diversity indices is usable for empty sampling units without any species, but some of the indices can give a numeric value. Filtering out these cases is left for the user.

In principle, you cannot study species composition without species and you should remove empty sites from community data.

ebolyen · 2024-05-02T17:04:16Z

I think the zero division error depends on the incoming number. A numpy-tagged float (i.e. the overloaded division operator) will warn and return NaN, but plain Python with 0.0/0.0 will raise an exception.

We have a number of tests to confirm the NaN behavior (which is how we noticed). I suspect coercing plain lists into an ndarray first would mostly prevent case D, but for any of our calling code we've always seen C (since we start with pandas series generally).

I think returning NaN on empty communities is reasonable, and am neutral on S_obs/observed_otus. That's the only index that comes to mind as reasonably defined as 0. As noted in Vegan, interpreting empty communities is not an ideal use of diversity measures (so I'm good with NaN if consistency is more important).

qiyunzhu · 2024-05-02T18:00:32Z

@ebolyen Thanks for the additional input! I agree with this.

All, I just did some investigation into the current and previous code. I am proposing the following solution:

Uniformly return NaN for empty samples for all alpha diversity metrics.

This is based on the observation that most of the currently implemented metrics involves fraction or zero division. Specifically, they include:

berger_parker_d, brillouin_d, dominance, enspie, esty_ci, goods_coverage, kempton_taylor_q, margalef, mcintosh_d, mcintosh_e, menhinick, pielou_e, robbins, shannon, simpson, simpson_e, strong

Only a few metrics don't suffer from this and a 0 value is meaningful:

doubles, sobs (i.e., observed features), singles, osd, faith_pd

We could choose to return zero for these few metrics. However, a difficulty is that these metrics are somehow related to the fraction-based metrics. For example, species richness and Shannon can be uniformized into Hill number; Faith' PD can be generalized into abundance-weighted phylogenetic diversity. If we make them special, we will get different results from the specialized and generalized metrics.

The NaN should be achieved through a short circuit, rather than from zero division, which will generate a ValueError (Python case) or a RuntimeWarning (NumPy case).
The helper function _validate_counts_vector can be modified to strip off all zero values. I hope that it can also automatically return NaN for the outer function, but it is not possible with Python. Therefore, each alpha diversity metric will check whether the return value of this helper function is empty, and returns NaN in that case.

I am going to work on this right away. Please let me know if you have other thoughts.

ebolyen · 2024-05-02T18:57:59Z

For example, species richness and Shannon can be uniformized into Hill number; Faith' PD can be generalized into abundance-weighted phylogenetic diversity. If we make them special, we will get different results from the specialized and generalized metrics.

I like this line of reasoning, but just to point out: 0 * NaN would produce NaN, so I think the generalized and specialized metrics would still both produce NaN on empty communities (even if standardized by a zero-producing measure).

I'm still fine with 0 becoming NaN for empty communities, but wanted to note the above.

qiyunzhu · 2024-05-04T16:03:45Z

For information, here are some test results.

scipy.stats.entropy (default base is e):

>>> entropy([1])
0.0
>>> entropy([0])
nan
>>> entropy([])
0.0

Some popular R packages:

single <- c(1)
zero <- c(0)
empty <- numeric(0)

Species richness (a.k.a. sobs):

	vegan	adiv	entropart
zero	0	error	N/A
empty	0	error	N/A

Shannon index (note: base in all three packages is e):

	adiv	entropart
single	0	0
zero	error	N/A
empty	error	N/A

Gini-Simpson (a.k.a. simpson):

	vegan	adiv	entropart
single	0	0	0
zero	1	error	N/A
empty	1	error	N/A

ebolyen mentioned this issue Apr 30, 2024

LANG: python 3.9 update (amplicon) qiime2/distributions#229

Closed

qiyunzhu mentioned this issue May 12, 2024

WIP Improving alpha diversity calculation #2024

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alpha Diversity indices produce zeros where previously undefined (NaN) #2014

Alpha Diversity indices produce zeros where previously undefined (NaN) #2014

ebolyen commented Apr 30, 2024 •

edited

ebolyen commented Apr 30, 2024

mortonjt commented May 1, 2024

qiyunzhu commented May 1, 2024

mortonjt commented May 1, 2024 •

edited

qiyunzhu commented May 1, 2024 •

edited

ebolyen commented May 1, 2024

wasade commented May 1, 2024

qiyunzhu commented May 1, 2024

mortonjt commented May 1, 2024

gregcaporaso commented May 2, 2024

mortonjt commented May 2, 2024

qiyunzhu commented May 2, 2024

qiyunzhu commented May 2, 2024 •

edited

qiyunzhu commented May 2, 2024

ebolyen commented May 2, 2024

qiyunzhu commented May 2, 2024

ebolyen commented May 2, 2024

qiyunzhu commented May 4, 2024 •

edited

Alpha Diversity indices produce zeros where previously undefined (NaN) #2014

Alpha Diversity indices produce zeros where previously undefined (NaN) #2014

Comments

ebolyen commented Apr 30, 2024 • edited

ebolyen commented Apr 30, 2024

mortonjt commented May 1, 2024

qiyunzhu commented May 1, 2024

mortonjt commented May 1, 2024 • edited

qiyunzhu commented May 1, 2024 • edited

ebolyen commented May 1, 2024

wasade commented May 1, 2024

qiyunzhu commented May 1, 2024

mortonjt commented May 1, 2024

gregcaporaso commented May 2, 2024

mortonjt commented May 2, 2024

qiyunzhu commented May 2, 2024

qiyunzhu commented May 2, 2024 • edited

qiyunzhu commented May 2, 2024

ebolyen commented May 2, 2024

qiyunzhu commented May 2, 2024

ebolyen commented May 2, 2024

qiyunzhu commented May 4, 2024 • edited

ebolyen commented Apr 30, 2024 •

edited

mortonjt commented May 1, 2024 •

edited

qiyunzhu commented May 1, 2024 •

edited

qiyunzhu commented May 2, 2024 •

edited

qiyunzhu commented May 4, 2024 •

edited