ENH: `stats._xp_mean`, an array API compatible `mean` with `weights` and `nan_policy` #20743

mdhaber · 2024-05-18T20:03:14Z

Reference issue

Toward gh-20544

What does this implement/fix?

This function adds _xp_mean, an array-API compatible function which combines the features of np.mean, np.average, and np.nanmean in interface that fits with scipy.stats. This will be needed for making functions like pmean, hmean, and gmean array-API compatible.

Additional information

~~Potential reviewers: would you be willing to write some unit tests with hypothesis? For such a fundamental function, it's particularly important that it works flawlessly!~~

If it doesn't sound too crazy, I'd suggest that this and similar var and std functions be added publicly to scipy.stats because they provide functionality that does not exist with the array API (e.g. weights, which has been explicitly rejected, and nan_policy, which has not been standardized and may not follow SciPy's convention). Even considering NumPy alone, it would be useful to have a single function that has all the functionality of mean, average, and nanmean in an interface consistent with the rest of scipy.stats.

Not pursuing these things right now. Let's just get this in so we can finish the other mean functions.

scipy/_lib/_array_api.py

scipy/_lib/tests/test_array_api.py

[skip ci]

mdhaber · 2024-05-18T23:33:21Z

scipy/stats/tests/test_axis_nan_policy.py

+    (xp_mean_1samp, tuple(), dict(), 1, 1, False, lambda x: (x,)),
+    (xp_mean_2samp, tuple(), dict(), 2, 1, True, lambda x: (x,)),


Most scipy.stats functions use the _axis_nan_policy decorator to implement nan_policy, keepdims, and tuple axis. I've implemented all these features natively for improved performance (e.g. nan_policy='omit' would otherwise loop over each slice), and the function still passes all the tests, which are quite stringent. So if you don't want to write tests with hypothesis, I'm still pretty comfortable with this.

mdhaber · 2024-05-18T23:43:33Z

scipy/_lib/_array_api.py

+
+    if weights is not None and x.shape != weights.shape:
+        try:
+            x, weights = xp.broadcast_arrays(x, weights)


A few thoughts about broadcasting:

Technically x = [1, 2, 3] is broadcastable with weights = [2], and it can be interpreted as giving all observations a weight of 2.

Technically, x = [1] is broadcastable with weights = [1, 2, 3]: now we have x being broadcast to the shape of weights rather than the (more natural) other way around.

Technically x = [] is broadcastable with weights = [1]: weights gets broadcasted to shape (0,), and the weighted mean is NaN.

It's clearly simpler to just accept these sorts of things, but since they're not useful, one could argue that we shouldn't. I'd propose that we just accept them, but if there are strong opinions about not accepting them, LMK.

rgommers · 2024-05-19T08:46:25Z

scipy/_lib/_array_api.py

@@ -475,3 +476,155 @@ def xp_sign(x, xp=None):
    sign = xp.where(x < 0, -one, sign)
    sign = xp.where(x == 0, 0*one, sign)
    return sign
+
+
+def xp_add_reduced_axes(res, axis, initial_shape, *, xp=None):


Could you add a note on why this is needed? Is it temporary, why can't xp.add not be used, etc.?

Type annotations and consistency with other functions in this file would be useful too (at least if you expect this function to stay around for a while).

res should preferably be positional-only.

Perhaps a better name would have been xp_replace_reduced_axes or xp_keepdims: it adds back axes that have been reduced away. However, when there are other comments to respond to, I'll just move the logic back into xp_mean, since I'm not sure if it will be used elsewhere. It can be factored out again as needed. Although the comment wasn't about xp_mean, I can make the first argument of xp_mean positional-only.

fancidev · 2024-05-19T22:11:39Z

Why was weights explicitly rejected for the Array API? Would you by chance have a link or something for the discussion back then?

lucascolley · 2024-05-19T22:38:36Z

Why was weights explicitly rejected for the Array API? Would you by chance have a link or something for the discussion back then?

data-apis/array-api#366

fancidev · 2024-05-19T22:59:23Z

Thanks for the link @lucascolley .

To align with the naming convention of hmean, pmean, and gmean, would it be more appropriate to call the function amean (a for arithmetic)?

…sitional-only

lucascolley · 2024-06-10T07:25:28Z

scipy/stats/tests/test_axis_nan_policy.py

@@ -406,7 +422,7 @@ def unpacker(res):
            res = hypotest(*data1d, *args, nan_policy=nan_policy, **kwds)
        res_1db = unpacker(res)

-        assert_equal(res_1db, res_1da)
+        assert_allclose(res_1db, res_1da, 1e-15)


minor: can we pass the tol as a kwarg

Sure. If there are other things to change, I can commit that then.

Suggested change

assert_allclose(res_1db, res_1da, 1e-15)

assert_allclose(res_1db, res_1da, rtol=1e-15)

lucascolley

all seems pretty reasonable!

lucascolley · 2024-06-10T18:08:29Z

scipy/stats/tests/test_stats.py

+        # Check for warning if omitting NaNs causes empty slice
+        message = 'After omitting NaNs...'
+        with pytest.warns(RuntimeWarning, match=message):
+            res = _xp_mean(x * np.nan,  nan_policy='omit')


Suggested change

res = _xp_mean(x * np.nan, nan_policy='omit')

res = _xp_mean(x * np.nan, nan_policy='omit')

lucascolley · 2024-06-10T18:14:05Z

scipy/stats/tests/test_stats.py

+        # it's really a `SmallSampleWarning`, but not sure
+        # where it will be imported from yet
+        message = 'One or more sample arguments is too small...'
+        with pytest.warns(SmallSampleWarning, match=message):


for my understanding, can you explain this comment?

Comment will be removed. I had this all in _lib so I couldn't import SmallSampleWarning.

lucascolley · 2024-06-10T18:14:49Z

scipy/_lib/_util.py

@@ -707,7 +707,7 @@ def _nan_allsame(a, axis, keepdims=False):
    return ((a0 == a) | np.isnan(a)).all(axis=axis, keepdims=keepdims)


-def _contains_nan(a, nan_policy='propagate', policies=None, *, xp=None):
+def _contains_nan(a, nan_policy='propagate', policies=None, *, xp_ok=False, xp=None):


can you briefly explain the intended semantics of xp_ok?

Temporarily, while _axis_nan_policy does not handle non-NumPy arrays, other functions that call _contains_nan want it to raise an error if nan_policy='omit' and xp is not np.

_xp_mean supports nan_policy='omit' natively, so setting this to True prevents the error from being raised.

In this name, xp was intended to imply xp other than NumPy. Other possibilities include xp_omit_ok and non_numpy_omit_ok. Or we could take another perspective for naming the variable... maybe consider the name to indicate whether the calling function implements nan_policy='omit' itself or whether this function should raise when xp is not NumPy and nan_policy='omit'. Feel free to suggest a preferred name.

Since this function is private , the need for the argument is temporary, and the argument will probably only ever be needed by a handful of functions, I'll probably just change it. Another possibility is to eliminate the argument and just try/except the error as needed.

Maybe I should just try/except

I think the argument sounds fine. xp_omit_okay is probably the clearest name IMO. In any case, it would be nice to add a short docstring to explain. But feel free to just explain to reviewers each time if you would rather not add a docstring.

lucascolley · 2024-06-10T18:20:49Z

scipy/stats/_stats_py.py

+        arrays will be broadcasted before performing the calculation. See
+        Notes for details.
+    keepdims : boolean, optional
+        If this is set to True, the axes which are reduced are left


Suggested change

If this is set to True, the axes which are reduced are left

If this is set to ``True``, the axes which are reduced are left

Well, that would be my preference. In the past, other reviewers have criticized double backticks for short literals, so I have become inconsistent as I go back and forth between that advice and my natural tendency (code should render in monospaced font, and True is code).

Note that I was able to get numpy/numpydoc#525 merged last week, and I opened pydata/pydata-sphinx-theme#1852, but I did not handle this aspect of the issue. I'll go ahead and open another issue along these lines in the numpydoc repo.

sure, I'm certainly +1 for the record

lucascolley · 2024-06-10T18:25:04Z

scipy/stats/_stats_py.py

+
+    # convert integers to the default float of the array library
+    if not xp.isdtype(x.dtype, 'real floating'):
+        dtype = xp.asarray(1.).dtype


I do think we should add an xp_default_float helper at some point. Now may be a good time for it.

Yeah I'm adding something like that in an upcoming PR.

@lucascolley Please see xp_broadcast_promote in gh-20935.

lucascolley · 2024-06-10T18:26:19Z

scipy/stats/_stats_py.py

+               else too_small_nd_not_omit)
+    if xp_size(x) == 0:
+        with warnings.catch_warnings():
+            warnings.simplefilter("ignore")


Which warning is this catching?

Whatever xp.mean with an empty argument wants to emit. It is not consistent among array libraries. This makes it consistent.

Cool. Let's leave this comment unresolved for anyone looking this PR up from the allowed filter list.

lucascolley · 2024-06-10T18:29:02Z

scipy/stats/_stats_py.py

+    message = (too_small_1d_omit if (x.ndim == 1 or axis is None)
+               else too_small_nd_omit)
+    if contains_nan and nan_policy == 'omit':
+        i = xp.isnan(x)


minor: i is not a very readable var name

I suppose I learned my ABCs before learning to read other words, so I find it readable : )
But feel free to suggest a preferred name if the meaning of the variable will be difficult to interpret in the context of this if block. i_nan? nan_mask?

I definitely prefer nan_mask!

Changed just for you as a thank you for reviewing : )

j-bowhay

A few minor comments but otherwise looks good

j-bowhay · 2024-06-10T20:00:32Z

scipy/stats/_stats_py.py

+
+    Parameters
+    ----------
+    x : real floating array


Perhaps misleading given the casting of integers?

Suggested change

x : real floating array

x : real array

j-bowhay · 2024-06-10T20:02:23Z

scipy/stats/_stats_py.py

+    .. math::
+
+        \bar{x}_w = \frac{ \sum_{i=0}^{n-1} w_i x_i }
+                         { \sum_{i=0}^{n-1} i w_i }


typo?

Suggested change

{ \sum_{i=0}^{n-1} i w_i }

{ \sum_{i=0}^{n-1} w_i }

j-bowhay · 2024-06-10T20:05:36Z

scipy/stats/_stats_py.py

+            warnings.warn(message, SmallSampleWarning, stacklevel=2)
+        return res
+
+    # avoid circular import


left over comment from when this lived in _util?

Suggested change

# avoid circular import

j-bowhay · 2024-06-10T20:08:43Z

scipy/stats/_stats_py.py

+               else too_small_nd_omit)
+    if contains_nan and nan_policy == 'omit':
+        i = xp.isnan(x)
+        i = (i | xp.isnan(weights)) if weights is not None else i


optional

Suggested change

i = (i | xp.isnan(weights)) if weights is not None else i

if weights is not None:

i |= xp.isnan(weights)

mdhaber · 2024-06-11T01:07:35Z

Responses to comments committed. Thanks @lucascolley @j-bowhay!

When this is in, would either of you like to tackle conversion of one or more of the other mean functions? I'd be happy to review.

ENH: add xp_mean for mean with weights and nan_policy

13d919c

mdhaber added scipy.stats enhancement A new feature or improvement array types Items related to array API support and input array validation (see gh-18286) labels May 18, 2024

github-actions bot added the scipy._lib label May 18, 2024

mdhaber commented May 18, 2024

View reviewed changes

mdhaber added 3 commits May 18, 2024 17:23

Apply suggestions from code review

f547a6b

[skip ci]

MAINT: xp_mean: remaining revisions

dbe161d

TST: xp_mean: strengthen tests

7352500

mdhaber marked this pull request as ready for review May 18, 2024 23:28

mdhaber commented May 18, 2024

View reviewed changes

rgommers reviewed May 19, 2024

View reviewed changes

mdhaber mentioned this pull request May 19, 2024

ENH: stats: add array API-support #20544

Open

74 tasks

lucascolley changed the title ~~ENH: xp_mean: an array-API compatible mean with weights and nan_policy~~ ENH: xp_mean, an array API compatible mean with weights and nan_policy Jun 2, 2024

mdhaber added 5 commits June 3, 2024 23:49

Merge remote-tracking branch 'upstream/main' into xp_mean

76dc796

MAINT: xp_mean: move keepdims logic into functions; make first-arg po…

c4dc66c

…sitional-only

TST: xp_mean: fix failing test about too small warning

52d7d11

MAINT: xp_mean: use _broadcast_arrays instead of xp.broadcast_arrays

714694c

MAINT: stats._xp_mean: move _xp_mean

f942854

mdhaber changed the title ~~ENH: xp_mean, an array API compatible mean with weights and nan_policy~~ ENH: stats._xp_mean, an array API compatible mean with weights and nan_policy Jun 9, 2024

mdhaber added 2 commits June 9, 2024 15:16

Merge remote-tracking branch 'upstream/main' into xp_mean

4750599

MAINT: stats._xp_mean: match _axis_nan_policy behavior

eb529f6

mdhaber requested review from lucascolley and j-bowhay June 9, 2024 23:27

lucascolley reviewed Jun 10, 2024

View reviewed changes

j-bowhay requested changes Jun 10, 2024

View reviewed changes

MAINT: stats._xp_mean: edits per review

9c95767

		(xp_mean_1samp, tuple(), dict(), 1, 1, False, lambda x: (x,)),
		(xp_mean_2samp, tuple(), dict(), 2, 1, True, lambda x: (x,)),

	assert_allclose(res_1db, res_1da, 1e-15)
	assert_allclose(res_1db, res_1da, rtol=1e-15)

	res = _xp_mean(x * np.nan, nan_policy='omit')
	res = _xp_mean(x * np.nan, nan_policy='omit')

	If this is set to True, the axes which are reduced are left
	If this is set to ``True``, the axes which are reduced are left

	i = (i \| xp.isnan(weights)) if weights is not None else i
	if weights is not None:
	i \|= xp.isnan(weights)

ENH: stats._xp_mean, an array API compatible mean with weights and nan_policy #20743

Are you sure you want to change the base?

ENH: stats._xp_mean, an array API compatible mean with weights and nan_policy #20743

Conversation

mdhaber commented May 18, 2024 • edited

Reference issue

What does this implement/fix?

Additional information

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mdhaber May 20, 2024 • edited

Choose a reason for hiding this comment

fancidev commented May 19, 2024

lucascolley commented May 19, 2024

fancidev commented May 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lucascolley left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mdhaber Jun 11, 2024 • edited

Choose a reason for hiding this comment

j-bowhay left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mdhaber commented Jun 11, 2024 • edited

ENH: `stats._xp_mean`, an array API compatible `mean` with `weights` and `nan_policy` #20743

ENH: `stats._xp_mean`, an array API compatible `mean` with `weights` and `nan_policy` #20743

mdhaber commented May 18, 2024 •

edited

mdhaber May 20, 2024 •

edited

mdhaber Jun 11, 2024 •

edited

mdhaber commented Jun 11, 2024 •

edited