Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: SciPy array types & libraries support #18286

Open
rgommers opened this issue Apr 12, 2023 · 58 comments
Open

RFC: SciPy array types & libraries support #18286

rgommers opened this issue Apr 12, 2023 · 58 comments
Labels
array types Items related to array API support and input array validation (see gh-18286) enhancement A new feature or improvement RFC Request for Comments; typically used to gather feedback for a substantial change proposal SciPEP SciPy Enhancement Proposal

Comments

@rgommers
Copy link
Member

This is a long read and a proposal with quite a large scope and a decent amount of backwards-compatibility impact. I hope it'll make SciPy's behavior much more consistent and predictable, as well as yield significant performance gains. I'll post this on the mailing list too. I'd suggest to discuss the big picture first; will ask for process suggestions on the mailing list too in case we'd like to split up the discussion, open a doc/proposal PR for detailed review, or anything like that.

The basic design principle I propose aiming for in all of SciPy for array libraries is: container type in == container type out.

Python sequences (lists, tuples, etc.) will continue to be converted to NumPy arrays, as will other unknown types that are coercible with np.asarray.

The scope of this proposal is: how to treat array inputs. This includes different kinds of NumPy arrays and ndarrays with particular dtypes, as well as array/tensor objects coming from other libraries.

Out of scope for this proposal are (a) dataframe libraries, and (b) implementation of a dispatch mechanism for when non-numpy array types hit C/C++/Cython/Fortran code inside SciPy. Both of these topics are touched upon in the Appendix section.

I'll dive straight into the design here; for context on why we'd want/need this design or what has been done and discussed before, see the Context and Problems & opportunities sections further down.

array/tensor types support

Array types and how to treat them:

Input type Return type When Behavior Notes
numpy.ndarray numpy.ndarray always
torch.Tensor (CPU) torch.Tensor always
torch.Tensor (GPU) torch.Tensor when possible raise TypeError otherwise for pure Python implementation for now. See text below for more details
cupy.ndarray cupy.ndarray when possible raise TypeError otherwise
has __array_namespace__ same as input type when possible raise TypeError otherwise
array with object dtype n/a raise TypeError
numpy.matrix n/a for NumPy >= 2.0 (or always?) raise TypeError
numpy.ma.MaskedArray numpy.ma.MaskedArray only for stats.mstats raise TypeError otherwise future nan_policy/mstats plans are unaffected by this
numpy.ndarray subclasses same subclass type always so always use np.asanyarray checks
torch.Tensor subclasses same subclass type same as for torch.Tensor on CPU/GPU
PyData/Sparse arrays same sparse array type once it has array API support TypeError otherwise TypeError is already the current state; no auto-densification
dask.Array dask.Array once it has array API support TypeError otherwise (?) bc-breaking, but mixing with current conversion to numpy array may not be healthy
jax.numpy.Array jax.numpy.Array once added to array-api-compat same as PyTorch for CPU and GPU

When a non-NumPy array type sees compiled code in SciPy (which tends to use the NumPy C API), we have a couple of options:

  1. dispatch back to the other library (PyTorch, CuPy, etc.).
  2. convert to a NumPy array when possible (i.e., on CPU via the buffer protocol, DLPack, or __array__), use the compiled code in question, then convert back.
  3. don't support this at all

I'll note that (1) is the long-term goal; how to implement this is out of scope for this proposal - for more on that, see the Appendix section. For now we choose to do (2) when possible, and (3) otherwise. Switching from that approach to (1) in the future will be backwards-compatible.

A note on numpy.matrix: the only place this is needed is scipy.sparse, it can be vendored there and for NumPy >= 2.0 instances of the vendored matrix code can be returned. That allows deprecating it in NumPy. We need to support scipy.sparse.*_matrix long-term for backwards compatibility (they're too widely used to deprecate), however for new code we have sparse.*_array instances and PyData/Sparse.

Regarding array API support: when it's present in an array library, SciPy will require it to be complete (v2022.12), including complex number support and the linalg and fft submodules - supporting partial implementations without linalg support in particular seems unnecessary.

For as-yet-unsupported GPU execution when hitting compiled code, we will raise exceptions. The alternative considered was to transfer to CPU, execute, and transfer back (e.g., for PyTorch). A pro of doing that would be that everything works, and there may still be performance gains. A con is that it silently does device transfers, usually not a good idea. On balance, something like this is only a good idea if there's a well-defined plan to make GPU execution work for most functionality on a reasonable time scale (~12-18 months max). Which means both addressing the dispatch (uarray & co) problem that I am trying hard to avoid diving into here, and having time commitments for doing the work. Since we don't have that, raising exceptions is the way to go.

Development and introduction strategy

The following strategy is meant to allow implementing support in a gradual way, and have control over when to default to a change in behavior - which necessarily is going to have some backwards compatibility impact.

  • develop behind an experimental flag (an environment variable and a global setting starting with an underscore)
  • copy the scikit-learn approach, using array-api-compat as an optional dependency for now
  • switch the new behavior of "input type in == input type out" by default once a submodule has complete support
    • at that point, vendor array-api-compat
    • may be a major release if it can all be done in one release cycle. otherwise do submodule by submodule
  • explicit support can and should be tested in CI for: PyTorch CPU tensors, numpy.array_api for compliance testing in APIs that start supporting array API compliant input, and perhaps pandas Series/DataFrame. GPU CI may materialize at some point, but let's not count on or plan for that.
  • implement new type coercion machinery centrally in scipy._lib and depend on that from other submodules. That means we'll have a single replacement for np.asarray/np.asanyarray in one central location.

Context

  • For NumPy 2.0 we should have full array API standard support in the main numpy namespace
  • The numpy.array_api module is useful for testing that code is actually using only standard APIs, because it's a minimal implementation. It's not going to be used as a layer for production usage though, it's for portability testing only.
  • The array-api-compat library provides the compatibility we need right now between NumPy, CuPy and PyTorch. Other libraries can be added there too. This bridges any gaps between the standard and real-world implementations. The package is light-weight and can be depended on - but is also designed to be vendored to avoid a new runtime dependency.
  • scikit-learn has had experimental array API support for a little while, and has now settled on array-api-compat as the way to go. This looks quite a bit nicer than the previous approach with numpy.array_api, and the work to support PyTorch that way has been quite well-received.
  • As for the myriad of open NEPs on interoperability topics:
    • NEP 30 (__duckarray__), NEP 31 (uarray) and NEP 37 (__array_module__) are all effectively dead - I'll propose rejecting these.
    • NEP 47 (array API support via numpy.array_api) has been implemented, however I'm going to propose marking it superceded via a new NEP for the main numpy namespace
    • __array_function__ and __array_ufunc__ are fully implemented and will continue to exist and be supported in NumPy. We won't support those mechanisms in SciPy though, since we are coercing unknown input types to ndarray and error out if that fails. The exception here is ufuncs in scipy.special, which happen to work already because we're reusing the numpy ufunc machinery there. We can probably leave that as is, since it's not problematic.
  • NumPy array subclasses are a long-standing problem, with masked arrays and np.matrix instances not being Liskov-substitutable (i.e. it's a drop-in replacement, no changes in behavior but only extensions), making it difficult to accept them. We can explicitly start rejecting those with clear exceptions, that will make regular subclasses a lot more useful.
  • Missing data support is a broad and tricky subject:
    • numpy.ma has tons of issues and isn't well-maintained. There's a full rewrite floating around that is ~90% complete with some luck will make it into NumPy 2.0. However, it hasn't seen movement for several years, and the work on that is not planned. numpy.ma in its current form should be considered legacy.
    • scipy.stats.mstats is the only module that specifically supports masked arrays.
    • scipy.stats has a nan_policy keyword in many functions that is well-maintained at this point, and has a documented specification. That is probably not applicable to other submodules though.
    • Support for missing data in other places like interpolate and spatial.KDTree (see gh-18230) may make sense, but ideally that'd use solid numpy support (e.g., via a null-aware dtype) and that does not exist.
    • Dataframe libraries have much better support, but it's a difficult story to use non-numpy-backed dataframes at all.
    • For the purposes of this proposal, I'd prefer leaving it at "reject numpy.ma.MaskedArray instances".

Problems & opportunities

The motivation for all the effort on interoperability is because the current state of SciPy's behavior causes issues and because there's an opportunity/desire to gain a lot of performance by using other libraries (e.g., PyTorch, CuPy, JAX).

Problems include:

  • pandas input with non-numpy dtypes being converted to object arrays and then support being very hit-and-miss; see for example gh-18254
  • general pandas dataframe/series support being patchy; see for example gh-9616 and gh-12164
  • masked arrays giving incorrect results because the mask is silently dropped; see, e.g., gh-1682 and gh-12898
  • sparse array support being patchy; agreement that auto-densification needs to be avoided, but that means current behavior is inconsistent (xref gh-4239)
  • object arrays, including use of Decimal and other random things that people stuff into object arrays on purpose or (more likely) by accident, being handled very inconsistently.

Opportunities include:

  • Large performance gains. See for example mdhaber/scipy#63 (10x-50x for stats bootstrapping functionality with CuPy) and scikit-learn#25956 which shows how PyTorch improves over NumPy (2x-7x on CPU, 60x-100x on GPU)
  • interest in support for Dask which we haven't really been able to do much with, see for example gh-10204, gh-8810 and gh-10087. Xarray is in the same boat, it builds on top of Dask (see, e.g., gh-14824)

References

Appendix

Note: these topics are included for reference/context because they are related, but they are out of scope for this proposal. Please avoid diving into these (I suggest to ping me directly first in case you do see a reason to discuss these topics).

dataframe support

For dataframe library support, the situation is a little trickier. We have to think about pandas Series and DataFrame instances with numpy and non-numpy dtypes, the presence of nullable integers, and other dataframe types which may be completely backed by Apache Arrow, another non-numpy library (e.g., cuDF & CuPy), or have implemented things completely within their own library and may or may not have any NumPy compatibility.

This would be one option, which is somewhat similar to what scikit-learn does (except, it converts nullable integers to float64 with nans):

Input type Return type When Behavior Notes
pandas.Series pandas.Series if dtype is a numpy dtype raise TypeError otherwise non-NumPy dtypes get coerced to object arrays otherwise, avoid that
pandas.DataFrame pandas.DataFrame if dtype is a numpy dtype raise TypeError otherwise
has __dataframe__ same as input type needs a small extension to dataframe interchange protocol to reconstruct input type

There's a lot to learn from the effort scikit-learn went through to support pandas dataframes better. See, e.g., the scikit-learn 1.2 release highlights showing the set_output feature to request pandas dataframe as the return type.

Note: I'd like to not work this out in lots of detail here, because it will require time and that should not block progress on array library support. I just want to put it on the radar, because we do need to deal with it at some point; current treatment of dataframes is quite patchy.

Dispatching mechanism

For compiled code, other array types (whether CPU, GPU or distributed) are likely not going to work at all; the SciPy code is written for the NumPy C API. It's not impossible that some Cython code will work with other array types if those support the buffer protocol and the Cython code uses memoryviews - but that's uncommon (won't work at all on GPU, and PyTorch doesn't support the buffer protocol on CPU either).

There has been a lot of discussion on how this should work. The leading candidate is Uarray, which we already use in scipy.fft (as do matching fft APIs in CuPy and PyFFTW) and has other PRs pending in both SciPy and CuPy. However, there is also resistance to that because it involves a lot of complexity - perhaps too much. So significant work is needed to simplify that. Or switch to another mechanism. This is important work that has to be done, but I'd prefer not to mix that with this proposal.

Whatever the mechanism, it should work transparently such that scipy.xxx.yyy(x) where x is a non-numpy array should dispatch to the library which implements the array type of x.

We have a uarray label in the issue tracker. See gh-14353 for the tracker with completed and pending implementations. For more context and real-world examples, see:

@rgommers rgommers added enhancement A new feature or improvement SciPEP SciPy Enhancement Proposal labels Apr 12, 2023
@tylerjereddy
Copy link
Contributor

I read it over, sounds exciting. I have to admit that evaluating API designs on this scale is not one of my strengths. I'll briefly mention that the US National Labs will probably quietly test out their own array API compatible technologies behind the scenes as well, if this moves forward, to see if performance gains are observed on bleeding-edge supercomputer hardware. One question I have related to this--for nodes with multiples devices (like 4 x A100 is pretty common these days), do we leave the use of multiple devices up to the GPU library under the hood? Do we explicitly not encourage using kwargs for controlling concurrency that is specific to say GPU libs (i.e., too much risk of optional kwarg API bloat)?

You did briefly mention testing a few different CPU-based scenarios by default. I'm curious if we'd use a pytest decorator/magic to auto-expand eligible tests and/or if it might be helpful to be able to have i.e., a testing env variable so that we can request particular "backends" for testing (for example, could be useful to be able to first iteratively develop/test CPU-only, then see what kind of mess arises when you switch to testing with GPU locally/on a cluster node, etc.). That's perhaps more of a detail.

One thing that may be more general re: testing is how we might handle bug reports that are effectively isolated to a subset of pass-throughs--only library X with hardware Y. I suppose the array API is supposed to prevent that from happening, but there is also reality... I guess maybe occasional shims and regression tests that only run with the conditions are right, assuming that we can't just consider these upstream issues most of the time. I will say that I don't see any way for this overall improvement to make our testing situation any simpler, unfortunately/obviously.

Maybe after the first few PRs go in, it will be easier for those of us who aren't familiar with every corner of those NEPs to start contributing since we might be able to understand the changes needed as basic engineering tasks.

I wonder if spatial will unfortunately be one of the least-beneffiting modules--some of our most-used functionality relies heavily on carefully-written compiled backends, like Qhull, the C++ stuff driving KDTree, and the C++ backing much of the distance calculation code. I believe there are a few "Python/NumPy land" corners at least.

@WarrenWeckesser
Copy link
Member

What is the policy for handling mixed inputs when a function accepts more than one argument? This is probably covered in one of the array API documents or discussions, but I think it we need it stated explicitly here, too. I expect the most common use-case is mixing $SOME_OTHER_ARRAY_TYPE and standard numpy arrays (e.g. multiplying a sparse matrix or sparse array by a numpy array, convolving some other array type with a kernel that is a numpy array, etc.).

@asmeurer
Copy link

asmeurer commented Apr 13, 2023

What is the policy for handling mixed inputs when a function accepts more than one argument? This is probably covered in one of the array API documents or discussions, but I think it we need it stated explicitly here, too. I expect the most common use-case is mixing $SOME_OTHER_ARRAY_TYPE and standard numpy arrays (e.g. multiplying a sparse matrix or sparse array by a numpy array, convolving some other array type with a kernel that is a numpy array, etc.).

The array API spec says:

Non-goals for the API standard include:
...

  • Making it possible to mix multiple array libraries in function calls.

    Most array libraries do not know about other libraries, and the functions they implement may try to convert “foreign” input, or raise an exception. This behaviour is hard to specify; ensuring only a single array type is used is best left to the end user.

https://data-apis.org/array-api/latest/purpose_and_scope.html#out-of-scope

See also the interchange section about DLPack https://data-apis.org/array-api/latest/design_topics/data_interchange.html

@AnirudhDagar
Copy link
Member

Thanks Ralf, for the detailed RFC; after reading it a couple of times, I have a few implementation-specific comments/questions/clarifications (some arising from the past work on the Gravitational waves demo).

It seems array-api-compat library will be key.

This bridges any gaps between the standard and real-world implementations.

With my understanding, I'd try to expand the above for everyone with a real-world example taken from scipy.signal and PyTorch (note: PyTorch is just an example, it may apply to any other array/tensor lib for some other case). Specifically, the lines below are used in scipy.signal.welch:

if not same_data:
y = np.asarray(y)
outdtype = np.result_type(x, y, np.complex64)
else:
outdtype = np.result_type(x, np.complex64)

Here np.result_type is used; consider the scenario where scipy.signal.welch is to be made Array API compatible with, say torch tensors (which we tried in this prototype demo). It would require the behaviour of torch.result_type to follow the Array API spec. But the reality is that torch.result_type differs from how result_type is defined in array API. Difference being that, as of today, PyTorch only allows two tensors, and in contrast, array API specifies arbitrary number of arguments being either tensor and/or dtypes.

Now I'd expect at some point, PyTorch will update their implementation to support array API for these functions, which are still left, and there is probably effort around it already (see this result_type PR pytorch/pytorch#61168 ).

But for now, array-api-compat is a brilliant intermediary to solve such problems. See how array-api-compat fixes this specific compatibility issue below by wrapping around the current torch.result_type signature and behaviour:

https://github.com/data-apis/array-api-compat/blob/9ef7f72fa91ddc4561148f906d386d08e65def7d/array_api_compat/torch/_aliases.py#L103


Questions:

  1. Are we considering cases in SciPy which at first might not seem portable with Array API due to the usage of NumPy methods that are outside the scope of Array API, but with slight refactoring (using an alternate NumPy method and making sure no performance regression in default behaviour), we might bring that particular SciPy method/class closer to Array API Spec or just closer to other array/tensor APIs? One close example, but not exactly, is this (MAINT: Replace np.rollaxis with np.moveaxis #14491). So basically, would it be okay to slightly refactor the current SciPy implementation of a function using some NumPy method to some other NumPy method for the sake of achieving interoperability through this work? Of course, BC is still maintained.

  2. array-api-compat adds an extension on top of the Array API Spec. I'd like to know what happens to cases (if any, we will only find more such cases during this work) that are still outside array-api-compat. For example, NumPy and other array/tensor library, for a particular function, might have the same signature and behaviour. If such np.<some_method> (which is still out of the scope of Array API) is used within a particular SciPy method/class; will that SciPy method still be marked as incompatible with this work and be completely ignored for now? Even though it would be technically possible to keep interoperability in such cases. One example is np.iscomplex or cupy.iscomplex, (see use in https://github.com/scipy/scipy/blob/main/scipy/signal/_signaltools.py#L4546). I guess there will be other use cases of such functions, which might block the compatibility of a whole SciPy module/method just because that one NumPy method is still outside of Array API even though, it had an intersection with other array/tensor libraries.

  3. Is there already an identified list of targeted submodules or methods in SciPy that can be made to work with array-api-compat and that we will focus on initially? What if a submodule is largely not compatible with Array API (and needs extra infra like dispatching with uarray etc.), but maybe a few methods/classes inside the submodule are compatible? Do we make only those methods/classes compatible in the largely incompatible submodule or completely ignore the whole submodule in such a case?

Apologies if my questions are a bit too specific with implementation details and are special examples, and if this RFC is about discussing the high-level design, feel free to ignore it.

Overall, excited that all of this work is coming together, and the current state looks pretty solid to me to start baking in this functionality in the "array consumer libraries" (scipy, sklearnetc.). Thanks for setting up the foundation tools like array-api-compat!

@rgommers
Copy link
Member Author

Thanks for the feedback so far! Let me try to answer the high-level questions first.

for nodes with multiples devices (like 4 x A100 is pretty common these days), do we leave the use of multiple devices up to the GPU library under the hood? Do we explicitly not encourage using kwargs for controlling concurrency that is specific to say GPU libs

Yes, we should not carry anything specific to the execution model within SciPy code. That's a key design principle of the array API standard. You could use a multithreaded CPU library, a GPU library, or something even fancier like what you have here. The code should be unchanged. There's quite a bit of variation between different distributed libraries in particular, and that's out of scope to deal with I'd say.

I'm curious if we'd use a pytest decorator/magic to auto-expand eligible tests

Yes, that's what I'd like to aim for. This has worked pretty well for the array API test suite; a single switch allows you to select any compatible libraries. I imagine we'd have that as a decorator for relevant tests, or some such thing.

I wonder if spatial will unfortunately be one of the least-beneffiting modules

Probably. I imagine modules like stats and signal to be prime candidates early on.

What is the policy for handling mixed inputs when a function accepts more than one argument? This is probably covered in one of the array API documents or discussions, but I think it we need it stated explicitly here, too.

Good point, I'll add it to the issue description. @asmeurer gave a good answer already. I'd add that numpy arrays + sequences will work unchanged; non-numpy arrays + sequences should error; and in general it's a bad idea to mix array objects coming from different libraries. Rather get a clear exception there and have the user convert one kind of array to the other, to get the behavior they intended. That's how all libraries except NumPy work, and what NumPy does is an endless source of bugs and user confusion.

I expect the most common use-case is mixing $SOME_OTHER_ARRAY_TYPE and standard numpy arrays (e.g. multiplying a sparse matrix or sparse array by a numpy array, convolving some other array type with a kernel that is a numpy array, etc.).

The sparse-dense mix is the only one that in principle makes sense. I think that could be allowed in the future, but I'm not exactly sure under what circumstances. Right now all sparse array types forbid auto-densification, so where you can mix sparse-dense is quite restricted. PyTorch has the best story here, because it natively provides both dense and sparse tensors - but even there it's still a work in progress.

@rgommers
Copy link
Member Author

So basically, would it be okay to slightly refactor the current SciPy implementation of a function using some NumPy method to some other NumPy method for the sake of achieving interoperability through this work?

Yes indeed, that is expected in many cases I'd say.

2. If such np.<some_method> (which is still out of the scope of Array API) is used within a particular SciPy method/class; will that SciPy method still be marked as incompatible with this work and be completely ignored for now?

I'd say "it depends". If it's simple to reimplement np.xxx with other functions that are part of the standard, then that can be done (special-casing numpy arrays for performance even if that makes a big difference). If it's not simple to do that, then we have a gap. If it's a big gap (hopefully won't happen often) then we should propose it for inclusion in the standard, and once that proposal is accepted implement it in array-api-compat. If it's niche functionality, then we have to deal with that on a case-by-case basis.

3. Is there already an identified list of targeted submodules or methods in SciPy that can be made to work with array-api-compat and that we will focus on initially? What if a submodule is largely not compatible with Array API (and needs extra infra like dispatching with uarray etc.), but maybe a few methods/classes inside the submodule are compatible? Do we make only those methods/classes compatible in the largely incompatible submodule or completely ignore the whole submodule in such a case?

I'd prefer for coherent sets of functionality to be upgraded at once. To stay with Tyler's example of spatial: if 90% of functionality will only work with NumPy, then there's not much point using the array API standard. In those cases we can still use the central infrastructure I proposed for input validation though (e.g., reject object and masked arrays) to gain consistency - there's nothing blocking that.

@seberg
Copy link
Contributor

seberg commented Apr 14, 2023

That's how all libraries except NumPy work, and what NumPy does is an endless source of bugs and user confusion.

Thanks for the proposal! I doubt its urgent to figure out (see end), but I need to push back a little on promotion/mixing story here.

The big issues are really not about mixing/promotion. They are about NumPy thinking that:

  • any arbitrary object that doesn't shout "no" should be promoted to a NumPy array. (Even nested lists/sequences that contain objects which do shout "no". That last part could be something to allow changing even 🤔)
  • we/libraries often call np.asarray(), which coerces and doesn't give proper promotion a chance to begin with (but you would stop doing this here!). (Proper promotion would mean e.g. JAX decides, not NumPy. The __duckarray__ proposal would actually fix it, and maybe that could even still relevant in some form as a simple improvement for __array_function__. But that is not important here.)

Unlike NumPy most array implementers do not pretend that its OK to promote almost everything to their array type.

But promotion/mixing is a different thing. Many do promote: JAX (as far as I can tell), dask.array, sparse arrays, masked arrays, quantities all promote/mix great with NumPy arrays. Also subclasses with their superclass, but if they don't customize the super class namespace, they are actually supported in the current state.

sparse/dense

I admit sparse/dense may be trickier. But I can see that for many things it either makes perfect sense or can be trusted to raise a good error. Some ops densify, some ops don't, and those that where it's not clear raise.
The main way this could end up problematic is if subtle code changes might somehow end up changing the result type. I doubt that is the case, but I don't know. My feeling is that this is something that sparse arrays need to figure out.

Is supporting limited promotion important enough to be needed? I don't know. I tend to think the answer is yes and I somewhat suspect e.g. Dask users may well be annoyed if its missing (I am assuming Dask has a good reason to support it very well, but trying to gauge it better. Would post the result on the array-api issue if there is a clearer one).

Overall my angle/answer would be: Fortunately, it should not be a big mess to add simple promotion/mixing support here. (I am confident about that, but I think its important to clarify since it is not obvious.)
So probably it need not be an urgent thing to figure this out. But it is an interesting question that I am sure will come up again and I expect it may well prove important enough sooner or later.

@rgommers
Copy link
Member Author

Thanks for your thoughts Sebastian!

(Even nested lists/sequences that contain objects which do shout "no". That last part could be something to allow changing even thinking)

yes please:) I think our NumPy 2.0 plate is pretty full already, so I'm hesitant to invest a lot of time now - but if there's easy wins there to avoid promote-to-object-array issues, let's try and take them.

Proper promotion would mean e.g. JAX decides, not NumPy

This is not really "promotion" in any proper sense. JAX and NumPy arrays are two different types of arrays that are unrelated by class hierarchy. There happens to be an ordering though, through __array_priority__. I have to admit I'm not certain how future-proof that is. Best practice is clearly to explicitly convert to the desired kind here, mixing them is not great. But it may work. What my proposal is trying to say is to reject things that aren't related, rather than blindly stuffing them into np.asarray. It should be more precise in what relations are allowed/supported - I had not thought of __array_priority__.

I've worked on quite a few PyTorch issues with the same problem. PyTorch tensors and NumPy arrays are unrelated, so things like x + t working but t + x erroring out (or vice versa) is quite confusing. It should just error.

But promotion/mixing is a different thing. Many do promote: JAX (as far as I can tell), dask.array, sparse arrays, masked arrays, quantities all promote/mix great with NumPy arrays.

Again, only when there is a clear relation. scipy.sparse and pydata/sparse mostly errors out when mixed with ndarray. Masked arrays do work because they're a subclass (modulo silently dropping the mask of course). Some quantities objects work well with numpy because they're a subclass (see, e.g. AstroPy's Quantity) or because they use things like __array_function__. Those cases will continue to work. In the absence of that well-defined relationship, things will not work. So you cannot and should not mix Quantity with JAX/PyTorch/CuPy/sparse.

Overall my angle/answer would be: Fortunately, it should not be a big mess to add simple promotion/mixing support here.

If this is well-defined, I agree. We just need to be very explicit what the mechanisms are, and they should be introspectable. I want to avoid "just try it and see what rolls out, YMMV".

@seberg
Copy link
Contributor

seberg commented Apr 14, 2023

Collapsing everying into details, because... It doesn't matter much for the RFC, but I can't leave some of these things unanswered.

(We could get hung up and say I should use "implicit conversion" like the C-standard making the argument about implicit/explicit conversion – but Julia uses "promotion" the way I do as far as I understand, so I will change my habit when someone asks me to or points out why its very wrong.)

but if there's easy wins there to avoid promote-to-object-array issues, let's try and take them.

This is not about promote-to-object-array issues. Its about JAX sayings its higher priority then NumPy, but NumPy ignoring that when you write: numpy_arr + [jax_array]. Maybe also about a way to write np.asarray(jax_array) that raises an error because JAX says its the higher priority.

This is not really "promotion" in any proper sense

Aha? JAX and NumPy arrays are two different types of arrays that are unrelated by class hierarchy. Just as: int and float are two different types of numbers that are unrelated by class hierarchy.

Unless you also say 3 + 3.0 = 6.0 is not promotion, in that case, I agree.

So yes, there is an (abstract) base class (array and number). There is a clear expected hierarchy everyone agrees on and expects (JAX > NumPy, Dask > NumPy and float > int). And neither of these conversions when mixing the two are perfectly "safe" in some sense, but implicit int -> float conversions are still universally useful and numpy -> dask.array/JAX/masked array implicit conversions seem also useful (otherwise Dask would not do it, no?).

Sure "array" is a bit of a more complicated then "number".

There happens to be an ordering though, through array_priority

__array_priority__ is more of a remnant. The real thing is __array_ufunc__ which can just be None to tell NumPy: "No". It isn't relevant here.

I've worked on quite a few PyTorch issues with the same problem. PyTorch tensors and NumPy arrays are unrelated, so things like x + t working but t + x erroring out (or vice versa) is quite confusing. It should just error.

Fine its confusing and probably has things that need improving. But also __array_ufunc__ = None should guarantee you an error. Yeah, it seems like it doesn't end up with a great error but you could probably improve on that in your __add__ or by implementing a dummy __array_ufunc__.

But I don't even think that matters. If a type is comfortable with implementing __array_ufunc__ such as Dask (or even just the operators like JAX), then that type can mix with NumPy arrays fine in practice (not with np.<func>, but we know we won't use those here).

And there is no problem with allowing such a type to signal that it wants to do just that. That still allows Torch to signal that it doesn't want to mix!

So is it annoyingly hard to tell NumPy "no"? Yes, and maybe that needs to be improved. But frankly, I think it is misleading to use those difficulties as an argument that mixing is generally pretty much doomed to fail.

We have many examples of successfull implementations that mix fine. Like dask, masked arrays, or quantities...

What my proposal is trying to say is to reject things that aren't related, rather than blindly stuffing them into np.asarray

But because you stop stuffing things blindly into np.asarray() you suddenly can start reasoning about their relation. The problem is that right now we break any relation by explicit conversion via np.asarray().

Take np.asarray() out of the equation, and Dask and NumPy already have a well defined relation and mix great!

Masked arrays do work because they're a subclass

No, I would be very surprised if Allan's masked array is a subclass and I am sure it has better behavior than np.ma and mixes great with NumPy arrays.

The important point is that it wants to work. That can just as well be by implementing __array_ufunc__ (or even only operators). Allowing interop doesn't mean that np.func(my_object) needs to work (although mylib.func(numpy_array) must).

If this is well-defined, I agree. We just need to be very explicit what the mechanisms are, and they should be introspectable. I want to avoid "just try it and see what rolls out, YMMV".

This is a reason, unlike the "it's a mess" argument, IMO. We do have quite a bit of prior art, since every other protocol defines a mechanism. It still needs care of course (there are small variations after all).
For me the main point is conservative guidance similar and additionally to NEP 18, though.

But at that point it's: We still need to figure that out because we don't want to get it wrong. And not "its a bad idea".

@rgommers
Copy link
Member Author

But at that point it's: We still need to figure that out because we don't want to get it wrong. And not "its a bad idea".

Maybe we're not disagreeing all that much. I said briefly "it's a bad idea" because (a) it's apparently so complex that after all these years we still haven't figured it out indeed, and (b) the gains are in many cases a minor amount of characters saved to deal with explicit conversions in the user's code instead. So my personal assessment is that this isn't worth the effort. Others may disagree of course, and spend that effort.

I think the appropriate place for discussing this topic further is either on the numpy repo or on pydata/duck-array-discussion, which was/is an effort specifically to sort out this hierarchy.

It doesn't matter much for the RFC

I agree. We're not closing any doors in this RFC, it can in principle be enabled in a future extension.

@peterbell10
Copy link
Member

peterbell10 commented Apr 20, 2023

A few assorted questions/comments:

  1. Would it make sense to introduce the asarray replacement first and eagerly warn about conversions whose behavior would change under this RFC.
  2. For testing, is the idea that you run the test suite multiple times with different ARRAY_API_TEST_MODULE flags? My first thought was this should be a pytest fixture where xp is an explicit argument to the test. This would allow multiple versions to be tested in a single pytest call. The behavior of this fixture could also be configurable via environment flags.
  3. Is there a policy for introducing new compiled code? Do we need to preserve the historical array_api compatible version or otherwise it would be a bc-breaking change for non-cpu arrays.
  4. I agree that mixing array libraries shouldn't be a priority, however I wonder how multiple array types from the same library would be handled, assuming they use the same namespace?

@rgommers
Copy link
Member Author

Thanks @peterbell10, all good questions.

  1. Yes, perhaps. As long as there's a way to opt in to the new behaviour easily to make those warnings go away (maybe with an environment variable?). It'd be annoying to have to change all your code away from, say, PyTorch CPU tensors as input by explicit conversion to NumPy arrays, and then back once the new behavior arrives. It depends a bit on timing - release cycles are pretty slow. Will have to think about this one a bit more.

  2. That was the idea. I do like fancier things too - don't know enough about fixtures, but I assume explicit xp is good practice. I do have in mind to run the pytorch tests only in 1-2 separate CI jobs, and not double the test time everywhere.

  3. Great point. This policy should be written down as part of this RFC (and then in the contributor guide). And yes, I do think that is "keep the pure Python path".

  4. We'll have to play with that. We may just simply retrieve xp from the first array input, and then let that library deal with rejecting mixed arrays. In general when you get two unknown array types that don't have a class hierarchy, it's not really possible to know whether they work together or not. Once you have a fixed library like pytorch, you know how to check, but for unknown libraries I'm not sure. This is the trickiest question to answer though, it'll require a bit of prototyping.

@tylerjereddy
Copy link
Contributor

Let me get a bit more specific now. I started studying the demo work here: https://quansight-labs.github.io/array-api-demo/GW_Demo_Array_API.html

scipy.signal.welch seemed to be a well-behaved target for prototyping based on that so I tried doing that from scratch on a feature branch locally, progressively allowing more welch() tests to accept CuPy arrays when an env variable is set.

Even for this "well-behaved"/reasonable target, I ran into problems pretty early on. For example, both in the original feature branch and in my branch, there doesn't seem to be an elegant solution made for handling numpy.lib.stride_tricks.as_strided. The docs for that function don't even recommend using it, and yet CuPy (and apparently torch from the Quansight example) do provide it, outside of the array API standard proper.

So, I guess my first real-world experience makes me wonder what our policy on special casing in these scenarios will be--ideally, I'd like to just remove the usage of as_strided() and substitute with some other approach that doesn't require conditionals on the exact array type/namespace. While this is a rather specific blocker, if I encounter something like this even for a "well behaved" case, I can imagine quite a few headaches for the cases that are not well behaved.

Anyway, maybe I'll get an idea of what we want to do in general based on reactions to this specific case. I assume we could move much faster if we just accept out-of-array-API common shims by checking the array types like Anirudh and Ralf did here: https://github.com/AnirudhDagar/scipy/blob/array-api-demo/scipy/signal/spectral.py#L2007 in a sort of "perfection is the enemy of done" approach, though this is obviously not quite perfect.

Something like this is clearly out of scope for array-api-compat. If there are going to be a few examples of non-API standard but tricky to work around things, perhaps the recommended work-arounds/shims might be collected somewhere (or maybe scikit-learn has already done a replacement of this nature more recently).

@ilayn ilayn added the RFC Request for Comments; typically used to gather feedback for a substantial change proposal label Apr 25, 2023
@rgommers
Copy link
Member Author

Thanks @tylerjereddy, that's a very good question. I think that indeed this is something that will come up regularly, and the specific cases at hand may determine how fast things can be implemented and whether the end result is more/less maintainable or more/less performant.

Every case will be a bit different, and in some a rewrite will help, in some it may lead to an extra code path or some such thing. Let's have a look at this case - this is the code in question:

    # Created strided array of data segments
    if nperseg == 1 and noverlap == 0:
        result = x[..., np.newaxis]
    else:
        # https://stackoverflow.com/a/5568169
        step = nperseg - noverlap
        shape = x.shape[:-1]+((x.shape[-1]-noverlap)//step, nperseg)
        strides = x.strides[:-1]+(step*x.strides[-1], x.strides[-1])
        result = np.lib.stride_tricks.as_strided(x, shape=shape,
                                                 strides=strides)

The as_strides usage is there to save memory. It's actually pretty difficult to understand what the code does exactly though, so it's good for performance but not great for maintainability. If you check what the code does, you find that it constructs a result array that is typically ~2x larger than the input array, sometimes more. In case of the example in the welch docstring, it's a (100_000,) shaped input, and an intermediate array of (194, 1024) - so 2x larger. However, the very next line in that function is:

result = detrend_func(result)

So now result is a full copy anyway, because the detrending is out-of-place. After that there's more function calls with similar overwriting of result. Hence, the actual peak memory used is very similar to the case where you wouldn't use as_strided but replace it with

result2 = np.empty(shape, dtype=x.dtype)
for ii in range(shape[0]):
    result2[ii, :] = x[ii*step:(ii*step + nperseg)]

Now, to complicate matters a little, when we use empty, we also want/need to add device support to support GPUs:

# This is temporary, once all libraries have device support, this check is no longer needed
if hasattr(x, 'device'):
    result = xp.empty(shape, dtype=x.dtype, device=x.device)
else:
    result = xp.empty(shape, dtype=x.dtype)

for ii in range(shape[0]):
    result2[ii, :] = x[ii*step:(ii*step + nperseg)]

At that point, we can say we're happy with more readability at the cost of some performance (TBD if the for-loop matters, my guess is it will). Or, we just keep the special-casing for numpy using as_strided. At that point, we do have two code paths, but no performance loss and also easier to understand code.

If there are going to be a few examples of non-API standard but tricky to work around things, perhaps the recommended work-arounds/shims might be collected somewhere

Agreed, that will naturally emerge I think.

@asmeurer
Copy link

# This is temporary, once all libraries have device support, this check is no longer needed
if hasattr(x, 'device'):

FWIW, array-api-compat has a device() helper function for exactly this.

@AnirudhDagar
Copy link
Member

Thanks @tylerjereddy, as Ralf said, I'm pretty sure these will have to be dealt case by case, and hence in the previous comment above (Question 1), I wanted to know exactly what will be our policy on such cases. And I guess the answer is again dealing case by case. It would be very hard to know all such cases before starting to work on the transition.

Regarding the demo, at the time of its development, our goal was to get a prototype version using Array API out (which is not perfect). The way it's done in my feature branch, special casing, is not ideal, and in such cases, it would come down to the possibility of refactoring the code to achieve the same using methods compliant with Array API as shown above by Ralf.

@tylerjereddy there is this other demo that was developed by @aktech, and might be of interest too: https://labs.quansight.org/blog/making-gpus-accessible-to-pydata-ecosystem-via-array-api

If there are going to be a few examples of non-API standard but tricky to work around things, perhaps the recommended work-arounds/shims might be collected somewhere

This is a good idea, not only for SciPy, but I expect similar issues/problems will arrive with other array-consuming libraries (eg. sklearn) and having a place to collect all such examples would make development more streamlined and consistent for the same references later across the libraries. But again, this depends on how many such examples we will encounter.

@tylerjereddy
Copy link
Contributor

It sounds like we propose to require libraries to be FFT API implementation complete--curiously, avoiding the usage of np.fft like this on my branch fixes a test failure:

diff --git a/scipy/signal/_spectral_py.py b/scipy/signal/_spectral_py.py
index befe9d3bf..ea1f6eb27 100644
--- a/scipy/signal/_spectral_py.py
+++ b/scipy/signal/_spectral_py.py
@@ -1986,10 +1986,10 @@ def _fft_helper(x, win, detrend_func, nperseg, noverlap, nfft, sides):
 
     # Perform the fft. Acts on last axis by default. Zero-pads automatically
     if sides == 'twosided':
-        func = xp.fft.fft
+        func = sp_fft.fft
     else:
         result = result.real
-        func = xp.fft.rfft
+        func = sp_fft.rfft
     result = func(result, n=nfft)

ARR_TST_BACKEND=numpy python dev.py test -j 10 -t scipy/signal/tests/test_spectral.py

======================================================================================================================================================================================= FAILURES =======================================================================================================================================================================================
___________________________________________________________________________________________________________________________________________________________________________ TestSTFT.test_roundtrip_scaling ____________________________________________________________________________________________________________________________________________________________________________
[gw4] linux -- Python 3.10.6 /home/tyler/python_310_scipy_dev_work/bin/python
scipy/signal/tests/test_spectral.py:1614: in test_roundtrip_scaling
    assert_allclose(Zp[:129, :], Zp0)
        X          = array([   0.+0.j,    0.+0.j,    0.+0.j,    0.+0.j,    0.+0.j,    0.+0.j,
          0.+0.j,    0.+0.j,    0.+0.j,    0....+0.j,
          0.+0.j,    0.+0.j,    0.+0.j,    0.+0.j,    0.+0.j,    0.+0.j,
          0.+0.j,    0.+0.j,    0.+0.j])
        Zp         = array([[ 0.00000000e+00+0.00000000e+00j,  0.00000000e+00+0.00000000e+00j,
         0.00000000e+00+0.00000000e+00j, .......,
         0.00000000e+00+0.00000000e+00j,  0.00000000e+00+0.00000000e+00j,
         2.04124145e-01-5.01096650e-03j]])
        Zp0        = array([[ 0.00000000e+00+0.00000000e+00j,  0.00000000e+00+0.00000000e+00j,
         0.00000000e+00+0.00000000e+00j, .......,
         0.00000000e+00+0.00000000e+00j,  0.00000000e+00+0.00000000e+00j,
        -2.04124145e-01+0.00000000e+00j]])
        Zp1        = array([[-7.25194643e-16+5.66558315e-17j, -7.25194643e-16+5.66558315e-17j,
        -7.25194643e-16+5.66558315e-17j, .......,
        -7.25194643e-16-5.66558315e-17j, -7.25194643e-16-5.66558315e-17j,
         2.04124145e-01-5.01096650e-03j]])
        Zs         = array([[ 0.00000000e+00+0.00000000e+00j,  0.00000000e+00+0.00000000e+00j,
         0.00000000e+00+0.00000000e+00j, .......,
         0.00000000e+00+0.00000000e+00j,  0.00000000e+00+0.00000000e+00j,
        -1.56250000e-02+0.00000000e+00j]])
        power_x    = 2.0
        psd_Zp     = array([2., 2., 2., 2., 2., 2., 2., 2., 2.])
        self       = <scipy.signal.tests.test_spectral.TestSTFT object at 0x7f0e72850b20>
        x          = array([ 2.,  0., -2., ...,  0., -2.,  0.])
        x1         = array([ 2.-7.07631847e-34j,  0.+0.00000000e+00j, -2.+1.51143410e-32j, ...,
        0.+0.00000000e+00j, -2.+7.36976637e-17j,  0.+0.00000000e+00j])
/usr/lib/python3.10/contextlib.py:79: in inner
    return func(*args, **kwds)
E   AssertionError: 
E   Not equal to tolerance rtol=1e-07, atol=0
E   
E   Mismatched elements: 496 / 1161 (42.7%)
E   Max absolute difference: 8.88611995e-16
E   Max relative difference: 15.1575437
E    x: array([[ 0.000000e+00+0.000000e+00j,  0.000000e+00+0.000000e+00j,
E            0.000000e+00+0.000000e+00j, ...,  0.000000e+00+0.000000e+00j,
E            0.000000e+00+0.000000e+00j, -2.041241e-01+0.000000e+00j],...
E    y: array([[ 0.000000e+00+0.000000e+00j,  0.000000e+00+0.000000e+00j,
E            0.000000e+00+0.000000e+00j, ...,  0.000000e+00+0.000000e+00j,
E            0.000000e+00+0.000000e+00j, -2.041241e-01+0.000000e+00j],...
        args       = (<function assert_allclose.<locals>.compare at 0x7f0e727f1120>, array([[ 0.00000000e+00+0.00000000e+00j,  0.00000000e+...,
         0.00000000e+00+0.00000000e+00j,  0.00000000e+00+0.00000000e+00j,
        -2.04124145e-01+0.00000000e+00j]]))
        func       = <function assert_array_compare at 0x7f0f195de200>
        kwds       = {'equal_nan': True, 'err_msg': '', 'header': 'Not equal to tolerance rtol=1e-07, atol=0', 'verbose': True}
        self       = <contextlib._GeneratorContextManager object at 0x7f0f195e5de0>


Also, for stuff like assert_allclose() do we want to assume that both arguments have been copied back to the host? It looks like CuPy has its own assert_allclose() as well, though we probably can't rely on that for libs in general. There is a to_device() adapter in the compat library, but not to_host() I don't think. For CuPy there is stuff like .get() for an explicit pull-back to host, but it is intentionally designed not to copy DtoH automatically/implicitly and will error out if you try.

@tylerjereddy
Copy link
Contributor

tylerjereddy commented Apr 27, 2023

Brief follow-up, that's reproducible on main as well, if you swap in the NumPy fft variants on those two lines. Does it merit its own issue, or nothing particularly surprising for the FFT gurus? I don't see anything searching for that test on our tracker.

@rgommers
Copy link
Member Author

Brief follow-up, that's reproducible on main as well, if you swap in the NumPy fft variants on those two lines. Does it merit its own issue, or nothing particularly surprising for the FFT gurus?

There will be small differences in precision, because the scipy.fft implementation is C++ code that uses SIMD (and hence is faster), while the numpy.fft code is plain C. We'd like to update NumPy to use the same C++ code as SciPy, but that's work. If there's a test failure somewhere, that is business as usual and unrelated to this RFC I think.

@rgommers
Copy link
Member Author

Also, for stuff like assert_allclose() do we want to assume that both arguments have been copied back to the host? It looks like CuPy has its own assert_allclose() as well, though we probably can't rely on that for libs in general. There is a to_device() adapter in the compat library, but not to_host() I don't think.

You should be able to use to_device('cpu') instead of to_host I'd think. Other than that: testing for specific libraries, especially GPU, requires a bit more thought. We do not require testing functions to be present within the array API standard, so I imagine we'll be using something like our own assert_allclose that switches explicitly to CuPy's assert_allclose when the input is a CuPy array.

@sturlamolden
Copy link
Contributor

container type in == container type out

I think this will clutter the SciPy code with endless container type conversions in any exported function or method. It will contribute to code rote and also make things slower.

Basically I think users should be responsible for their own container types.

@tylerjereddy
Copy link
Contributor

The diff for my draft branch for enabling CuPy/NumPy swapping for welch() is over at tylerjereddy#70. I suspect the main thing of interest might be the testing shims to enforce container type in == container type out and to allow swapping between NumPy/CuPy in the tests with an env var. I found no evidence of performance improvements for simple uses with CuPy, but finding the right problem (and problem size) may be best left to the domain experts.

One thing I found myself wanting was a universal/portable to_device(x, "cpu") for testing purposes as noted above. It sounds like it isn't clear what we'll do upstream for that just yet--array-api-compat temporarily merged my shim, but because this changes the array namespace when moving to host (NumPy) with CuPy but not i.e., torch (which has genuine CPU and GPU tensors), things are a bit murky...

@rgommers
Copy link
Member Author

rgommers commented May 3, 2023

I think this will clutter the SciPy code with endless container type conversions in any exported function or method. It will contribute to code rote and also make things slower.

I don't think that will be the case. It's possible there will be a few more type checks, but no more (or even fewer) type conversions. A few examples:

  • numpy.ndarray: unchanged, no conversions now or in the future
  • numpy.ndarray subclasses: now converted to ndarray, afterwards kept unchanged
  • cupy.ndarray and torch.Tensor (gpu): now not accepted, in the future accepted without conversion where possible
  • torch.Tensor (cpu): now converted to numpy.ndarray, in the future (a) unchanged where we have pure Python code and (b) converted once internally and then back in case we hit compiled code.
    • I'll note that it will be ensured that this happens once for case (b), and not a to-from conversion in multiple places when one scipy function internally calls another scipy function

I found no evidence of performance improvements for simple uses with CuPy, but finding the right problem (and problem size) may be best left to the domain experts.

Thanks for sharing your progress on trying this out Tyler! Regarding this performance expectation: typically CuPy will be slower than NumPy for small arrays (due to overhead of data transfer and starting a CUDA kernel) and there will be a cross-over point beyond which CuPy gets faster. Typically that's in the 1e4 to 1e5 elements range. And beyond 1e6 or 1e7 you'll see a roughly constant speedup, the magnitude of which will depend on the details but typically in the 5x-50x range.

@tylerjereddy
Copy link
Contributor

For this small script on that branch:

import sys
import time
import numpy as np
import cupy as cp
from scipy.signal import welch


def main(namespace):
    size = 70_000_000
    if namespace == "numpy":
        x = np.zeros(size)
    elif namespace == "cupy":
        x = cp.zeros(size)
    x[0] = 1
    x[8] = 1
    f, p = welch(x, nperseg=8)
    print("f:", f)
    print("p:", p)


if __name__ == "__main__":
    start = time.perf_counter()
    main(namespace=sys.argv[1])
    end = time.perf_counter()
    print("Elapsed Time (s):", end - start)
  • 7e6 points:
    • NumPy: 1.42 s
    • CuPy: 12.52 s
  • 7e7 points:
    • NumPy: 13.84 s
    • CuPy: 126.54 s

I don't know that we expect that particular incantation to get faster though. Could do some profiling or check the device copy activity with nsight-systems I suppose, but the array type pass-through/preservation tests do pass. I don't know that we're guaranteed to avoid performance issues with devices, depending on the algorithm/need to do a genuine amount of work on the device.

Intel(R) Core(TM) i9-7900X CPU @ 3.30GHz vs NVIDIA GeForce GTX 1080 Ti

tylerjereddy added a commit to tylerjereddy/scipy that referenced this issue Mar 27, 2024
* related to scipygh-18286

- this only drafts array API support for `welch()`, not all of `signal`, and I think we may not have been keen on partial submodule support; nonetheless, initial feedback may be helpful before I expand more broadly (if that's the desired approach)
- this has been adjusted a bit from the initial draft that was used for the `welch()` benchmarks in the array API conference paper, because decisions and internal array API infrastructure changed in the last few weeks, so should double check that those performance improvements remain
- there are some pretty tricky shims in a few places, that are no doubt going to require some discussion

I did check locally that these pass:
- `SCIPY_DEVICE=cuda python dev.py test -b numpy -b pytorch -b cupy -- -k "TestWelch"`
- `SCIPY_DEVICE=cpu python dev.py test -b numpy -b pytorch -b cupy -- -k "TestWelch"`
- `python dev.py test -j 32`

[skip cirrus] [skip circle]
@mdhaber
Copy link
Contributor

mdhaber commented Mar 30, 2024

As I'm working on stats, I'd like to revisit this:

For as-yet-unsupported GPU execution when hitting compiled code, we will raise exceptions. The alternative considered was to transfer to CPU, execute, and transfer back (e.g., for PyTorch). A pro of doing that would be that everything works, and there may still be performance gains. A con is that it silently does device transfers, usually not a good idea. On balance, something like this is only a good idea if there's a well-defined plan to make GPU execution work for most functionality on a reasonable time scale (~12-18 months max). Which means both addressing the dispatch (uarray & co) problem that I am trying hard to avoid diving into here, and having time commitments for doing the work. Since we don't have that, raising exceptions is the way to go.

There are a lot of statistical tests (e.g. pearsonr, gh-20284) for which almost all the work can be done using any arrray API backend, but right now the p-value calculation needs to be done with NumPy. There is a well-defined and funded plan to make the p-value calculations happen on any array API backend (i.e. new distribution infrastructure). In the meantime, for arrays large enough for alternative processing units to make sense, there is a performance advantage to doing the statistic calculation on the native processing unit and p-value calculations on the CPU, even considering the overhead of device transfers.

I'd suggest the addition of an _lib._array_api._asnumpy(x, xp=None) function that performs the appropriate conversion of x to a NumPy array. Of course, it would "only" work on the most common array libraries that we're testing with (NumPy, CuPy, PyTorch, and JAX) and those for which np.asarray(x) just works, but this should cover enough cases to make it useful. Besides performance improvements, this gives us the ability to test the functions on backends that would not be supported otherwise so there are fewer roadblocks when we can perform the whole function with the original backend.

@sturlamolden
Copy link
Contributor

sturlamolden commented Mar 30, 2024

This will be the death of SciPy.

We used NumPy because it used to be scipy.core. One array library is OK.

People are not going to learn a multitude of array libraries just to write generic code that works on all of them. Contributions to SciPy will dwindle. This is the transition from active development to abandonware. This is how software dies.

Future development will be tracing bugs originating from ever chaning array libraries and keeping up with the moving targets. SciPy will die as a mess that nobody will be bothered to maintain.

@lucascolley
Copy link
Member

People are not going to learn a multitude of array libraries just to write generic code that works on all of them.

One does not need to learn a multitude of array libraries. One only needs to write to the array API standard specification. Thanks to NEP 56, this is a subset of the NumPy API.

@rgommers
Copy link
Member Author

[...] In the meantime, for arrays large enough for alternative processing units to make sense, there is a performance advantage to doing the statistic calculation on the native processing unit and p-value calculations on the CPU, even considering the overhead of device transfers.

That seems fine to me to implement like that then.

I'd suggest the addition of an _lib._array_api._asnumpy(x, xp=None) function that performs the appropriate conversion of x to a NumPy array.

Yes we need a function like that. We should add it as a compat function from_dlpack, because the new device and copy keywords that from_dlpack recently gained (for v2023.12) will do exactly this.

So actually I think the right thing to do here is to update array-api-compat's from_dlpack function (xref data-apis/array-api-compat#127), and update our vendored copy of array-api-compat to pull in the new functionality.

tylerjereddy added a commit to tylerjereddy/scipy that referenced this issue Apr 28, 2024
* related to scipygh-18286

- this only drafts array API support for `welch()`, not all of `signal`, and I think we may not have been keen on partial submodule support; nonetheless, initial feedback may be helpful before I expand more broadly (if that's the desired approach)
- this has been adjusted a bit from the initial draft that was used for the `welch()` benchmarks in the array API conference paper, because decisions and internal array API infrastructure changed in the last few weeks, so should double check that those performance improvements remain
- there are some pretty tricky shims in a few places, that are no doubt going to require some discussion

I did check locally that these pass:
- `SCIPY_DEVICE=cuda python dev.py test -b numpy -b pytorch -b cupy -- -k "TestWelch"`
- `SCIPY_DEVICE=cpu python dev.py test -b numpy -b pytorch -b cupy -- -k "TestWelch"`
- `python dev.py test -j 32`

[skip cirrus] [skip circle]
@mdhaber
Copy link
Contributor

mdhaber commented May 15, 2024

Some questions have come up recently w.r.t. dtype conversions:

  1. Some scipy.stats silently promote to float64 without advertising that this is their behavior. For instance, pmean states that integers will be promoted to float64 before performing the calculation, but in fact uses np.float_power, which also promotes lower-precision floats to float64. I'm have assumed that it's OK to consider promotion of lower-precision floats to float64 undesirable behavior and for these functions to begin respecting standard dtype promotion rules as we convert to array API. Is that correct?

  2. I'm not sure if there are any, but there many be some functions that explicitly state that they will promote to a certain dtype before performing calculatoins. For these, shall we raise deprecation warnings when these functions receive lower-precision float types, stating that we will no longer be performing these conversions at the end of the deprecation period? (Users can respond by performing the desired conversions themselves.)

  3. Some scipy.stats functions, such as pmean, have a dtype parameter. Shall we deprecate these and let users be responsible for performing dtype conversions?

  4. Many scipy.stats functions have accepted integral dtypes in the past, but the array API does not define type promotion between integers and floats, and many array API functions (e.g. mean) used by stats functions aren't required to support real floating types. What should we do when converting functions that rely on these behaviors? If the alternative is deprecating and removing support for integer types, I think the easiest and most convenient for both developers and users would be to just promote xp integer dtypes to the default float type of the library (e.g. float64 for NumPy, float32 for Torch) and continue to support this behavior.

@sturlamolden
Copy link
Contributor

scipy.stats might need higher precision than float32 internally, e.g. for numerical stabiliy. Users of scipy.stats should not care what numerical precision it uses internally.

@mdhaber
Copy link
Contributor

mdhaber commented May 16, 2024

I've seen speed and memory arguments made for using lower-precision types internally, so I'm not sure we can make a prescription for everyone. We can add notes to the documentation of functions that may suffer unexpectedly high precision loss with lower precision types, though. And users are of course always welcome to use float64 for calculations by specifying it manually. I just don't want to take away the option to use float32 by hard-coding a type.

@jakevdp
Copy link
Member

jakevdp commented May 16, 2024

scipy.stats might need higher precision than float32 internally, e.g. for numerical stabiliy. Users of scipy.stats should not care what numerical precision it uses internally.

Some hardware (e.g. some types of GPU) is very inefficient when it comes to float64 operations, and some hardware (e.g. some types of TPU) does not support float64 at all. So if you're talking about array API implmentations where users might be using accelerator-backed arrays, I think users will very much care what precision is used internally

@steppi
Copy link
Contributor

steppi commented May 16, 2024

scipy.stats might need higher precision than float32 internally, e.g. for numerical stabiliy. Users of scipy.stats should not care what numerical precision it uses internally.

In a lot of cases, such numerical stability issues can be handled by writing dtype specific implementations of functions. This is the route I'm planning to take for special, which will carry over to stats.

@rgommers
Copy link
Member Author

I'd say that type promotion is not strictly related to the introduction of support for other array types - having SciPy functions behave uniformly and preserving lower-precision input dtypes makes perfect sense in a numpy-only world as well. We just happen to have a lot of historical inconsistencies here.

There are two related questions here:

  1. What's the desired design?
  2. Is it possible to move to that design without too much disruption?

Putting it behind the SCIPY_ARRAY_API switch may be a useful introduction strategy, so it helps with (2). But I'm not sure we can always get there, or that it always pays off to do so. E.g., we have functionality in C/C++/etc. that's double-only, so at that point all speed/memory/accuracy discussions are moot, the only questions is whether or not to downcast before returning a result to the user.

There are other considerations as well, like we don't want to inflate binary size too much by templating over many dtypes. There are few functions which involve compiled code that can afford doing that.

scipy.stats might need higher precision than float32 internally, e.g. for numerical stabiliy. Users of scipy.stats should not care what numerical precision it uses internally.

The last bit here is very much a valid point. The array API design also doesn't care about what happens internally, the type promotion rules are for the expected output dtype. Even JAX and PyTorch will do upcasts internally (e.g., accumulators for float16 reductions are often float32 or float64) when precision requirements can't otherwise be met.

My thoughts on @mdhaber's questions 1-4:

  1. [..] I'm have assumed that it's OK to consider promotion of lower-precision floats to float64 undesirable behavior and for these functions to begin respecting standard dtype promotion rules as we convert to array API. Is that correct?

Not necessarily. I think we should only do this if we have float32 implementations that make sense. And float16 probably almost never makes sense for stats. Large-scale changes that are not improving speed/memory but just downcast at the end are probably not ideal. We can consider it, but then I think it may need a separate feature flag and no deprecation warnings.

  1. [...] . For these, shall we raise deprecation warnings when these functions receive lower-precision float types, stating that we will no longer be performing these conversions at the end of the deprecation period? (Users can respond by performing the desired conversions themselves.)

That seems like churn for no real reason. The end result is the same for the end user, with slightly more verbose code for them and a change they have to make.

  1. Some scipy.stats functions, such as pmean, have a dtype parameter. Shall we deprecate these and let users be responsible for performing dtype conversions?

Only if it makes the stats design as a whole more consistent and makes sense from a numpy-only point of view as well.

  1. Many scipy.stats functions have accepted integral dtypes in the past, but the array API does not define type promotion between integers and floats, and many array API functions (e.g. mean) used by stats functions aren't required to support real floating types. What should we do when converting functions that rely on these behaviors? If the alternative is deprecating and removing support for integer types, I think the easiest and most convenient for both developers and users would be to just promote xp integer dtypes to the default float type of the library (e.g. float64 for NumPy, float32 for Torch) and continue to support this behavior.

That's the most tricky question perhaps, and also the only one that's directly array API standard related. I'd lean toward leaving the behavior for NumPy arrays unchanged for backwards compat reasons, but not accepting integer dtypes for other array types.

@mdhaber
Copy link
Contributor

mdhaber commented May 16, 2024

I'd say that type promotion is not strictly related to the introduction of support for other array types

It's not, but I think it's going to come up a lot with array API conversions. I've planned for a while to go through and look at dtypes once axis/nan_policy support was more consistent. Working on array API conversions gives an excuse for finally doing that, and it forces the issue because 1) the xp_asserts check dtype by default and 2) other array backends follow more regular promotion rules than pre-NEP 50 scalars.

What's the desired design? Is it possible to move to that design without too much disruption?

I'm proposing that the desired design is for stats reducing functions to typically follow the precedent set by e.g. np.var and other reducing functions: use the user-provided type throughout the calculation by default.

import numpy as np
x = np.asarray([30000., 30010., 30020.], dtype=np.float16)
np.var(x)  # inf - expected behavior, unless we change dtypes or work in log-space
# similar for other array backends

Besides following precedent, this gives users control over the dtype used rather than imposing one on them. We could add a dtype keyword for user convenience in the future if there is enough of a push for it.

Is that the appropriate behavior for xp.var?
By extension, would similar behavior be appropriate for, say, scipy.stats.skew and functions that rely on it?


As a more detailed example of issue 1, let's look at a stats.ttest_rel. In main, ttest_rel performs an undocumented conversion to np.float64 near the beginning of the function.

d = (a - b).astype(np.float64)

Alternatively, here is a bare-bones implementation of a one-sided ttest_rel that respects the dtype of the input.

import numpy as np
from scipy import stats, special

def ttest_rel_respect_dtype(a, b):
    dtype = np.result_type(a, b)
    n = np.asarray(a.shape[0], dtype=dtype)
    df = np.asarray(n - 1, dtype=dtype)
    d = a - b
    v = np.var(d, ddof=1)
    dm = np.mean(d)
    t = dm/np.sqrt(v/n)
    p = special.stdtr(df, t)
    return t, p

At least in some cases, if we pass float32 inputs into the two functions, the results match to the best precision we can hope for from float32.

rng = np.random.default_rng()

x = rng.standard_normal(size=100)
y = rng.standard_normal(size=100)

dtype = np.float32
x, y = np.asarray(x, dtype=dtype), np.asarray(y, dtype=dtype)
ref = stats.ttest_rel(x, y, alternative='less')
t, p = ttest_rel_respect_dtype(x, y)
eps = np.finfo(t.dtype).eps
np.testing.assert_allclose(t, ref.statistic, rtol=eps)
np.testing.assert_allclose(p, ref.pvalue, rtol=eps)

(If we use float16, the test still passes, but the p-value is a float32 because special.stdtr does not seem to be implemented for float16. I've also tried scaling the inputs by various factors, and still this very strict test passes. If it were needed, I think it would be justifiable to multiply the test rtol by 10 or even 100.)

I think this is a case in which we could have a solid float32 implementation. While we're going through and converting this function to be array API compatible, it would be a convenient time to eliminate the undocumented conversion to float64 and use the only container dtype throughout. This would extend the principle of "container type in == container type out" to "dtype in == dtype out" and enable benefits associated with use of lower-precision types (speed, memory, ability to use hardware that doesn't like float64).

Do we do that, something else (e.g. perform variance calculation in log-space, add a dtype argument to specify the intermediate dtype, etc.), or do we have to preserve the old behavior (forcing promotion to float64 at the beginning and returning float64)?

ttest_ind is another good example where this question comes up, but for a different reason. Given float32 inputs in SciPy 1.12.0

  • 1D inputs (or ND input with axis=None) result in float64 output with pre-NEP 50 promotion rules
  • ND input results in float32 statistic and p-values

SciPy 1.13 produces float64 p-values even with float32 statistics because p-values started to be generated by scipy.ststs.t instead of directly from special functions. This hasn't been reported.

@sturlamolden
Copy link
Contributor

sturlamolden commented May 17, 2024

Integers should not be promoted to float32 in stats because a float32 does not have the numerical precision to represent all int32 values precisely. The mantissa is too small. That is why float64 is needed. We still have the problem that not all int64 can be represented by a float64, but the problem only exists for very large ones.

It is common in statistics that datasets consist of integers but computations have to be done in floating point. Then we want a floating point mantissa that allow us to represent all values in the dataset precisely. A float32 will often be sufficient, but sometimes it will not. A float64 will nearly always suffice, except for astronomically large integers.

@sturlamolden
Copy link
Contributor

sturlamolden commented May 17, 2024

There is no such thing as a solid float32 implementation when a covariance matrix is close to singularity (which is not uncommon in statistics). I feel there is a general lack of appreciation for one of the most fundamental aspects of scientific computing, i.e. that correctness always beats performance.

If you want statistics to run very fast, you can e.g. use Cholesky or LU instead of QR or SVD for inverting covariance matrices and fitting least-squares models. But there is a reason we don’t do that. And if you are sacrificing that kind of speed for numerical stability, there is no good reason to throw that stability away by introducing float32 and obtaining large rounding and truncation errors, which is often catastrophic in the case of an ill-conditioned covariance.

@rgommers
Copy link
Member Author

Working on array API conversions gives an excuse for finally doing that, and it forces the issue because 1) the xp_asserts check dtype by default and 2) other array backends follow more regular promotion rules than pre-NEP 50 scalars.

(2) isn't that relevant I think, since pre-NEP 50 behavior is going away. We should use the array API and NumPy 2.0 rules. (1) is useful indeed, and should be a consistency improvement - but not necessarily an excuse for large-scale breaking changes.

Is that the appropriate behavior for xp.var?

Not necessarily, that's implementation-defined. Your example is a little misleading to the point you're trying to make, because the end result also doesn't fit in np.float16. However, if the end result fits in float16 then it may be desirable to return that. var doesn't, but sum for example does:

>>> f16_max = np.finfo(np.float16).max
>>> f16_max
65500.0
>>> f16_max + np.array(25.0, dtype=np.float16)
inf
>>> x = np.array([25., f16_max, -25.], dtype=np.float16)
>>> np.sum(x)   # final result fits in float16, but a naive implementation would overflow
65500.0
>>> np.std(x)  # this does overflow
inf
>>> np.std(x.astype(np.float32)).astype(np.float16)  # but the end result fits
30880.0

The returned end result should have dtype float16, but what happens in between is completely up to the implementation. In the example above, sum works but std doesn't even though both could work, or both could overflow for a naive implementation. More work was put in so that that doesn't happen for sum. It's an accuracy/speed/effort trade-off.

I'm proposing that the desired design is for stats reducing functions to typically follow the precedent set by e.g. np.var and other reducing functions: use the user-provided type throughout the calculation by default.
[...]
I think this is a case in which we could have a solid float32 implementation. While we're going through and converting this function to be array API compatible, it would be a convenient time to eliminate the undocumented conversion to float64 and use the only container dtype throughout. This would extend the principle of "container type in == container type out" to "dtype in == dtype out" and enable benefits associated with use of lower-precision types (speed, memory, ability to use hardware that doesn't like float64).

I think this is a potentially huge amount of work, with more breakage than adding array API support. It's also going to be very hard (impossible?) to achieve consistently, since stats as a whole relies on a ton of compiled code and we're not going to be adding float32 implementations of every algorithm. And while float32 versions of sets of functionality in SciPy could make sense, float16 almost certainly doesn't. So it'd be good to keep this as a separate proposal in a new issue.

Integers should not be promoted to float32 in stats because a float32 does not have the numerical precision to represent all int32 values precisely

I agree - but don't think anyone said this? NumPy will promote to float64:

>>> (np.array([3, 4], dtype=np.int32) + 1.5).dtype
dtype('float64')
>>> (np.array([3, 4], dtype=np.int32) + np.float16(1.5)).dtype
dtype('float64')

I feel there is a general lack of appreciation for one of the most fundamental aspects of scientific computing, i.e. that correctness always beats performance. [...]

I somewhat agree with the sentiment. There are types of algorithms that make sense to provide in lower precision variants (float32 in particular), because accuracy is still acceptable and performance really matters. FFT routines are a good example. For many element-wise functions (e.g., in special) this could also be true. But when performance starts to matter less and the amount of work is higher (many stats functions), it's no longer worth it.

@jakevdp
Copy link
Member

jakevdp commented May 17, 2024

I feel there is a general lack of appreciation for one of the most fundamental aspects of scientific computing, i.e. that correctness always beats performance.

I agree this is fundamental in some areas of scientific computing, but there are areas of computational science where it is not. As an example, look at the adoption of bfloat16 for the case of deep neural nets: we literally take a float32 and chop off the last 16 bits of mantissa, resulting in much faster computation—and surprisingly, in many cases this truncation actually improves model performance, as it acts as a sort of implicit regularization.

I wouldn't claim the same exercise would work in every context, but it's an example where what's fundamental in one domain is not fundamental in another. Does SciPy intend to support these kinds of use cases? The answer may be no – but if the answer is no, let's say that explicitly, rather than pretending such domains don't exist!

@mdhaber
Copy link
Contributor

mdhaber commented May 17, 2024

I've changed the var example. It wasn't meant to mislead about the behavior of var. It was just the wrong example to make the point. The NEP 50 stuff has come up at the same time. I'm just explaining why dtypes are coming up as we convert to array API even though their not only relevant to array API.

I'm fine leaving compiled code working with float64 internally and returning the dtype of the input array. That's not going to be running on TPUs, etc., anyway. I'm talking about the code we're converting to use the array API.

I proposed what I did as the typical behavior in part because it's easier than tip-toeing around existing promotions to float64 that have no clear rationale and don't attempt to return the same dtype as the input. Is there an alternative proposal?

I won't post anymore about this here since it may not be strictly related to array API conversion and this seems to be a sticking point. I'll open a separate issue if needed to get to the desired behavior for these stats conversions.

@ilayn
Copy link
Member

ilayn commented May 17, 2024

Just for the special suite; the dtype downcasting is not very helpful in a sizable chunk of special because the very reason they exist is due to the fact that a precise calculation is needed; say, expm1 or log1p is not really too meaningful in float32 because it will underflow quickly. There are more sophisticated examples I saw while translating F77 code but I hope these make the point.

It is on paper great to have them bitsize agnostic, however, not all, but many will have diminishing returns if not confusing the user by the underflows. A zebra-patterned API, some with float32 support and some not might actually be less desirable. And because it will be array api compatible, we can dispatch it to the relevant array generating library. If not fallback to SciPy float64 with numpy conversion.

@rgommers
Copy link
Member Author

it's an example where what's fundamental in one domain is not fundamental in another. Does SciPy intend to support these kinds of use cases?

Agreed - and I'd say yes, we do intend to support such use cases assuming they are judged to fit in some submodule well enough. It's all context/domain-specific.

I'm fine leaving compiled code working with float64 internally and returning the dtype of the input array. That's not going to be running on TPUs, etc., anyway. I'm talking about the code we're converting to use the array API.

Not quite true I think - there's a lot of compiled code we have that will dispatch back to CuPy/JAX/PyTorch when they have matching functions. And that set may change/grow over time.

I proposed what I did as the typical behavior in part because it's easier than tip-toeing around existing promotions to float64 that have no clear rationale and don't attempt to return the same dtype as the input. Is there an alternative proposal?

The obvious alternative is "status quo" with careful changes where we don't like the current behavior. It will really depend per module or per set of functionality. There is significant demand for float32 in particular for some topics, like spatial (distance functions, KDTree):

It's hard to say more though. I think "always return float64" can be reasonable, "dtype in == dtype out" can be reasonable. We just have to avoid making things too zebra-like (nice term @ilayn:)) within a single module.

scipy.stats is one of the least obvious places to me to add more lower-precision support, so it'd be helpful to have a separate issue for that with more concrete details. I can imagine it makes sense for some subset of functionality but not for others. E.g., for distributions we have tons of issues about precision so it may not make sense to support anything but float64, while for hypothesis tests that's less the case plus they can be really slow so float32 versions may be quite useful.

@steppi
Copy link
Contributor

steppi commented May 17, 2024

Just for the special suite; the dtype downcasting is not very helpful in a sizable chunk of special because the very reason they exist is due to the fact that a precise calculation is needed; say, expm1 or log1p is not really too meaningful in float32 because it will underflow quickly. There are more sophisticated examples I saw while translating F77 code but I hope these make the point.

I would also be opposed to a zebra like pattern, but just want to point out that catastrophic cancellation becomes an issue for the naive implementations of expm1 and log1p long before underflow occurs, so 32 bit versions of these functions can still be useful. I haven't personally had a reason to go to lower precision for special functions, but @izaid has mentioned his interest in lower precision versions, so I think there are use cases.

I also want to mention that it should be relatively straightforward to create lower precision versions of special functions. I think the main step would be refitting minimax polynomial approximations to the desired precision; and it's actually easier to find good minimax approximatoins at lower precision because one can evaluate more exhaustively over floating point values. I'm not saying we would go to such lengths, but it's actually feasible to create correctly rounded implementations of mathematical functions in 32 bit, such as here, something which cannot be done for 64 bit math. Beyond refitting minimax polynomials, I think only small adjustments would be needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
array types Items related to array API support and input array validation (see gh-18286) enhancement A new feature or improvement RFC Request for Comments; typically used to gather feedback for a substantial change proposal SciPEP SciPy Enhancement Proposal
Projects
None yet
Development

No branches or pull requests

15 participants