Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ndarray should derive from collections.abc.Sequence? #2776

Open
ghost opened this issue Dec 1, 2012 · 56 comments
Open

ndarray should derive from collections.abc.Sequence? #2776

ghost opened this issue Dec 1, 2012 · 56 comments

Comments

@ghost
Copy link

ghost commented Dec 1, 2012

@juliantaylor raised this in a pandas issue.

The example from the ticket:

import numpy as np
import random
random.sample(np.array([1,2,3]),1)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/user1/py33/lib/python3.3/random.py", line 298, in sample
    raise TypeError("Population must be a sequence or set.  For dicts, use list(d).")
TypeError: Population must be a sequence or set.  For dicts, use list(d).

This occurs on 3.3 with 1.7.0rc1.dev-3a52aa0, and on 3.2 with 1.6.2.
2.7 is unaffected of course.

The relavent code from cpython/Lib/random.py:297

from collections.abc import Set as _Set, Sequence as _Sequence
def sample(self, population, k):
# ...
    if not isinstance(population, _Sequence):
        raise TypeError("Population must be a sequence or set.  For dicts, use list(d).")

I couldn't grep another location in the stdlib with a similar test, but lib2to3
did show an assumed equivalence:

lib2to3/fixes/fix_operator.py
5:operator.isSequenceType(obj)   -> isinstance(obj, collections.Sequence)

in 2.7

In [6]: operator.isSequenceType(np.array([1]))
Out[6]: True

but in 3.3/3.2

>>> isinstance(np.array([1]), collections.Sequence)
False
@rgommers
Copy link
Member

rgommers commented Dec 2, 2012

Python 3.x simply performs a more strict check in random.sample than Python 2.x. On 2.x numpy is also not a Sequence subclass (ndarray has no index, count or __reversed__ methods). So I think you can regard this as either mis-use of random.sample or a backwards compatibility break by Python 3.x.

@njsmith
Copy link
Member

njsmith commented Dec 2, 2012

Fair enough, but that said, we could easily make ndarray a Sequence,
since all of those methods make sense and would be straightforward to
implement. In fact just inheriting from it would be enough, since Sequence
provides all the missing methods as mixins.
On 2 Dec 2012 13:30, "Ralf Gommers" notifications@github.com wrote:

Python 3.x simply performs a more strict check in random.sample than
Python 2.x. On 2.x numpy is also not a Sequence subclass (ndarray has no
index, count or reversed methods). So I think you can regard this as
either mis-use of random.sample or a backwards compatibility break by
Python 3.x.


Reply to this email directly or view it on GitHubhttps://github.com//issues/2776#issuecomment-10929601.

@ghost
Copy link
Author

ghost commented Dec 2, 2012

I would have said the same thing, if it weren't for the 2to3 example which makes this more subtle.
index and count are not needed I believe:Link,
All the abstract methods for Sequence are implemented already, The others have default implementations.

The problem is that MutableSequence is actually more correct, and insert is not implemented.

@seberg
Copy link
Member

seberg commented Dec 2, 2012

don't forget 0-d arrays though, I am not certain what happens with them right now, but they are not really a Sequence are they?

@rgommers
Copy link
Member

rgommers commented Dec 2, 2012

Hmm, Sequence is meant for immutable objects, some of the methods of MutableSequence (extend, pop) don't make sense. So yes, we could add those 3 methods but it doesn't feel quite right.

@rgommers
Copy link
Member

rgommers commented Dec 2, 2012

It looks to me like there's really no reason for random.sample to require a Sequence instance.

@ghost
Copy link
Author

ghost commented Dec 2, 2012

__setitem__ is implemented, but not __delitem__, so maybe SomewhatMutableSequence?

It makes sense of you interpret the interface as "a sequence should have all these, but some may
have terrible O()", which is why they offer naive implementations of some methods.
Surely it's not the semantics of insert , pop that are unclear in the context of an ndarray.
I'm not sure what the right thing is concerning iteration over a 0-d array, however.
One could argue that offering an __iter__ method that just raises a TypeError rather then
StopIteration is a violation of duck typing anyway.

edit: Though I'm sure that was not a careless decision.

@juliantaylor
Copy link
Contributor

ndarray could just be a Sequence, that does not mean it must be immutable.

Btw. the CPython list implementation does also not support efficient insert, pop and extend either but its still a MutableSequence

@njsmith
Copy link
Member

njsmith commented Dec 2, 2012

ndarray cannot support in-place insert/pop/extend (ndarray's have a fixed size), so while it is a mutable sequence, it simply is not a Python MutableSequence (and never will be). It can and should support the Sequence interface though.

It would be nice if random.sample didn't check this, but on a closer look, it does have a plausible reason -- it has a number of different implementations for different types of input arguments, so it needs some way to distinguish between them. It can't just start indexing and hope for the best. Maybe we could file a bug and try and convince them to fall back on the sequence implementation by default for unrecognized types, but the earliest that could help is 3.4...

The point about 0-d arrays is a good one, though; 0-d arrays don't support the Sequence interface (they aren't even Iterable). But for Python purposes, it isn't too terrible if they lie about being Sequences and then wait for the actual access to raise an error -- duck typing means you can always fail if you want :-). It's probably possible somehow to make isinstance(a, Sequence) succeed for multidimensional arrays and fail for 0-d arrays; if we can make that happen then cool. But even if we can't the best thing to do is still probably to make ndarray's into Sequences.

@rgommers
Copy link
Member

rgommers commented Dec 2, 2012

Note that MaskedArray has a count method already, so adding one in ndarray which does something different will break that.

0-D arrays are better thought of as scalars with a few handy methods (at least that's how I think of them); besides not being iterable they're also not indexable which is even more weird if you think of them as arrays. So making 0-D arrays inconsistent in yet another way is not a big issue imho.

@rgommers
Copy link
Member

rgommers commented Dec 2, 2012

@njsmith where do you see multiple implementations? After the isinstance(Sequence) check I only see len(population) and then a conversion to a list. http://hg.python.org/cpython/file/22d891a2d533/Lib/random.py

@rgommers
Copy link
Member

rgommers commented Dec 2, 2012

Pandas Series and DataFrame types also have incompatible count methods, and an index attribute.

@njsmith
Copy link
Member

njsmith commented Dec 2, 2012

@rgommers: Hmm, you're right, I got misled by the error message, and thinking that it also accepted integers as a shorthand for range(), which it doesn't. Even so, they do want to define different behaviour for sets, sequences, and mappings. Maybe we can convince them that they should switch it to

if isinstance(population, _Set):
    population = tuple(population)
if isinstance(population, _Mapping):
    raise Blarrrgh()
# Otherwise assume that we have a sequence and hope

That's also a good point about the existing ndarray subclasses. It doesn't look like there's any way to say that ndarray is a Sequence but its subclasses aren't :-(. So which option is least bad, given that some of these subclasses cannot satisfy the Sequence interface without breaking compatibility?

  • Deprecate the incompatible usages in ndarray subclasses, and eventually remove them and replace them with Sequence-compatible versions. This seems doable for the count methods, but changing the name of Series.index would be hugely disruptive for the pandas folks. (DataFrame isn't a subclass of ndarray so technically it isn't relevant, except I guess the Series and DataFrame should be kept in sync.) I guess we can ask @wesm what he thinks but...
  • Go ahead and declare ndarray and its subclasses to satisfy the Sequence definition, and accept that for some ndarray subclasses this will be a lie. Only on the rarely used parts of the Sequence interface, though, and Python types are usually lies anyway...
  • Re-jigger our inheritance hierarchy to separate functionality and abstract-base-class nonsense. Do
class multidim_ndarray(ndarray, Sequence):
  pass

and make multi-dim arrays instances of this class instead. Subclasses aren't affected because they continue to inherit from ndarray, not multidim_ndarray. Of course, a single ndarray object can transition between 0-d and multidimensional via .resize()...

  • Accept that ndarray will never be a Sequence.

[The isSequenceType thing is a bit of a distraction. That's an old function that predates the existence of abstract base classes (which were added in 2.6), and doesn't even try to nail down the detailed interface required of sequences -- it just checks that your type (1) defines a __getitem__, (2) is not a built-in dict. Obviously this will give the wrong answer in many situations (e.g. anything that acts like a dict but isn't one!). So if one actually does want a sequence type, then isinstance(obj, Sequence) is going to do a better job, and 2to3 is doing the right thing. But it creates a problem for numpy...]

@seberg
Copy link
Member

seberg commented Dec 2, 2012

Maybe it would be possible to convince the python folks to create a new class like SequenceBase that is even below Sequence and does not guarantee .index or .count, but only .__len__ and __getitem__ or such? Its all nice that Sequence has something like index, but it seems a bit weird to force it onto things like numpy by seemingly making it how sequence like things should be duck typed. Are the python folks aware of that this is somewhat problematic?

@rgommers
Copy link
Member

rgommers commented Dec 2, 2012

I like the proposal of @seberg; if Python devs disagree I'd go for the second bullet of @njsmith. Missing option is to just just say that ndarrays don't satisfy the Sequence interface. Not optimal, but better than bullets 1 and 3 imho.

@njsmith
Copy link
Member

njsmith commented Dec 2, 2012

[Whoops, the "missing option" was there as option 4, just somehow the markdown parser decided to fold it into the previous bullet in a confusing and unreadable way. I've edited the comment to fix the formatting.]

@rkern
Copy link
Member

rkern commented Dec 2, 2012

Half of the types that are registered with Sequence, i.e. buffer and xrange, don't have these methods either. It's not clear to me that these are required methods of the interface so much as convenience methods for those who are using collections.Sequence as a base class/mixin.

@njsmith
Copy link
Member

njsmith commented Dec 2, 2012

@rkern: Good catch. So maybe the solution is just to add a call to Sequence.register(np.ndarray) somewhere. (This would also be a workaround for the original reporter.)

We should probably implement __reversed__ at some point as well...

@rgommers
Copy link
Member

rgommers commented Dec 2, 2012

@rkern you're right, this is mentioned as an open issue in the PEP: http://www.python.org/dev/peps/pep-3119/#sequences. Odd that PEPs with status Final can even have open issues.

@garrison
Copy link

I think the title of this bug is a bit misleading, as the bug does not exist only under python3. Sure, random.sample(numpy_array) works in python2, but isinstance(np.array([1]), collections.Sequence) should return True in any python >= 2.6.

@will133
Copy link

will133 commented Nov 18, 2013

I've just encountered this bug in Python 2.7 using the autopep8 module. By default, it converted some of the operator.isSequenceType() calls into isinstance(x, collections.Sequence). The test would become False when I pass in a numpy.ndarray. This can be a very sneaky bug.

@michel4j
Copy link

Just encountered it as well with Python 2.7, using the python-pillow module. Image.point(lut, mode) calls isinstance(lut, collections.Sequence), the previous version used operator.isSequenceType()

@charris
Copy link
Member

charris commented Mar 25, 2014

Now might be a good time to revisit this since the numpy numeric scalar types were registered (#4547),

@vstinner
Copy link
Contributor

vstinner commented Sep 4, 2015

So maybe the solution is just to add a call to Sequence.register(np.ndarray) somewhere.

Yep, that's a good compromise.

@mitar
Copy link

mitar commented Sep 20, 2017

Yes, please simply add Sequence.register(np.ndarray) somewhere.

@seberg
Copy link
Member

seberg commented Jul 6, 2020

@jsbueno my problem is that I don't really see what additional, or in between definition would actually be helpful for users of ndarray. The best I can think of is a Collection which has count() and index(), but is that useful? Anything else would be an ABC for things that Python itself has little or no concept of.

I think SymPy actually got it more right. It iterates through all elements of its matrices, which at least makes it a Collection.
Now, I doubt we can do much about that, and I am not even sure that the SymPy iteration of all elements is super useful (and intuitive), but at least iteration of all elements is consistent with __contains__. Note that this also means that len(Matrix) is the number of elements, and not Matrix.shape[0]!

With the risk of repeating a lot from above, aside from 1-D arrays, what are numpy arrays?:

  • Container: of elements ✔️
  • Sized + Iterable of subarrays (if not 1-D) ❓
  • Reversible: we could just implement that, no worries there. ❓
  • count() and index(): can be implemented for elements (:heavy_check_mark:)
  • Sequence: Mismatch between subarray iterable and element container ❌

So even some of the most fundamental properties clash. NumPy could be a Container which knows how to do .index() and .count(), i.e. a Sequence but without the Iterable part. While it is independently an Iterable, but of subarrays.
And if that seems like a confusing mess, then I agree, but I think that it is by design. The only true solution would be to either go the SymPy path or just not be an Iterable to begin with. (we cannot go the SymPy path, and I doubt deprecating __iter__ has a chance.)

Personally, my expectation is that 1-D arrays aside, array-likes are simply very different beasts compared to Python Collections. When considering the iteration behaviour, you would need a MultidimensionalCollection to specifically signal the mismatch between __contains__ and __iter__ (but is that useful?).

When looking beyond what is currently defined by Sequence, I will restate that I think that the ElementwiseCollection (operators are elementwise operators rather than container operators, e.g. +) is the most defining characteristic of numpy arrays and all array-likes in general (see array programming). It is also a concept completely alien to – and sometimes at odds with – Python itself, though.

@seberg
Copy link
Member

seberg commented Jul 6, 2020

The only thing would be to mark one dimensional arrays, and only one dimensional arrays as sequences, since they do not have the mismatch of subarray vs. element. At which point, yes, __eq__ is of course not defined for them, and __nonzero__ is not defined similar to typical python sequences.

@jsbueno
Copy link

jsbueno commented Jul 7, 2020

Thank you for the response, and I apologise again for jumping in the 8 year long wagon here. With your comment, couple hours after the last e-mail exchange, and chatting with another friend in the middle, I concur most of these things are better left as they are. Sometime in the future Python can opt to have a more formal definition of Sequence than "whatever collections.abc.Sequence implements now".

I will just add, after reading your comments above, that I think that the characteristics you listed as "what makes a Python Sequence" is lacking the most important feature that makes ndarrays resemble sequences like lists and tuples for me: having a contiguous index-space that can address all individual elements. But I don't think formalizing an abc for that would be of any practical value, either in coding or in static-type hinting.

@NeilGirdhar
Copy link
Contributor

NeilGirdhar commented Jul 7, 2020

@seberg That's a great synopsis.

This issue seems to be about using ndarray in contexts that expect Sequence or Iterable or Container. A simple approach would be to have members on ndarray that expose cheap views that promise and provide the appropriate interface and respond to isinstance checks. For example:

class ndarray(Generic[T]):
    def as_container(self) -> Container[T]:
        if self.ndim == 0:
            raise ValueError
        return ContainerView(self)  # correctly answers __len__, __iter__ etc.
    def as_subarray_iterable(self) -> Iterable[np.ndarray[T]]:
        if self.ndim <= 1:
            raise ValueError
        return SubarrayIterableView(self)
    def as_scalar_sequence(self) -> Sequence[T]:
        if self.ndim != 1:
            raise ValueError
        return ScalarView(self)
    def as_subarray_sequence(self) -> Sequence[np.ndarray[T]]:
        if self.ndim <= 1:
            raise ValueError
        return SubarraySequenceView(self)  # this view has to reinterpret __contains__ to do the expected thing.

Instead of ndarray promising to be everything to everyone, the user asks for she needs, and if ndarray can provide it, it does so in the cheapest way possible. If it can't, it raises an exception. This simplifies user code by moving the ndim check the user should be doing (especially when using type annotations) into ndarray.

@eric-wieser
Copy link
Member

Should that last annotation be Sequence and not Iterable?

@NeilGirdhar
Copy link
Contributor

@eric-wieser Yup! Thanks. What are your thoughts?

@eric-wieser
Copy link
Member

eric-wieser commented Jul 7, 2020

Well, as_subarray_sequence is practically list(arr) :)

@NeilGirdhar
Copy link
Contributor

@eric-wieser Yeah, I thought it would be cheaper to provide a view, but I have no idea.

@eric-wieser
Copy link
Member

Well, list(arr) just produces len(arr) views, which you'd end up producing anyway if I you iterated.

@seberg
Copy link
Member

seberg commented Jul 7, 2020

I still think we are focusing too much about what can be done and not enough on what the problems are at this time. In particular, all of the methods you give above are very easy to implement if you know that you have an ndarray-like (I disagree that 0-D arrays are not containers). So they would only be useful if there was a standardized ABC for them, and in that case it would also be sufficient to define that basic indexing is numpy compatible and maybe include the .flat property.

The original issue (random.sample stopped working) seems fairly irrelevant due to passed time. Yes, its mildly annoying, but possibly it is even for the better, since the user may either expect the subarrays or the elements to be chosen.

I am sure we do break some duck-typing code. Some problems probably occur with serialization (I don't have examples at hand). And many of such code will have no issue with using isinstance checks on ABCs, but hate to check for np.ndarray specifically. I do not see how adding methods to ndarray would help with that, we would need a new ABC, likely with little more than the .ndim property and possibly enshrining the nested-sequence style iteration.

Methods such as the above may be reasonable as a consumer protocol to work with any array-like, but is that the problem we are trying to solve :)? They seem not like things typical Python sequences would want to expose.

@NeilGirdhar
Copy link
Contributor

NeilGirdhar commented Jul 7, 2020

@eric-wieser

You're right of course, but you may not iterate over the whole sequence. You might only pick out a few elements.

@seberg

I still think we are focusing too much about what can be done and not enough on what the problems are at this time

I agree with you. What kind of problems are you imagining? I'm imagining that when numpy 1.10 comes out with types, I will sometimes want to use a one-dimensional numpy array as a sequence. If I want to do that currently, I need to:

  • check that it's one-dimensional, and
  • call cast to tell mypy that it's actually a sequence.

That's why I want to provide a method to automatically do that. I hate big interfaces too, but it seems to me like these kinds of methods or bare functions are going to be more and more prevalent as type annotations catch on. What do you think?

(I disagree that 0-D arrays are not containers).

I have no idea, but currently you are raising on __len__ for these, so it seems that they don't work like containers. I think it would be helpful for mypy to report an error if you pass a 0-D array to a function that accepts a container. It won't catch if you make 0-D arrays containers.

we would need a new ABC, likely with little more than the .ndim property and possibly enshrining the nested-sequence style iteration.

I didn't want to add that into my suggestion, but I think that's where you're headed anyway. I'm an avid user of the wonderfully-designed JAX library. I imagine that in the future, numpy.ndarray and jax.numpy.ndarray (which has subclasses) will both inherit from some abstract NDArray of some sort. You could have a lot more than ndim. Ideally, it would be NDArray(Generic[T]) at least, and maybe event have the shape or the number of dimensions too. It could have __eq__ returning NDArray[np.bool_]. You probably know better than me :)

A few years ago, I searched for this issue in order to suggest that numpy.array should inherit from collections.Sequence, but now I find the arguments (especially yours!!) in this thread very convincing. Numpy arrays aren't really sequences, and shoehorning them seems like it will cause more harm than good. Why not just let them be their own thing, and force users to explicitly request the interface they want?

And many of such code will have no issue with using isinstance checks on ABCs,

Now that you mention that, maybe all of my proposed methods should have returned views. That way, they can correctly answer isinstance checks.

Methods such as the above may be reasonable as a consumer protocol to work with any array-like, but is that the problem we are trying to solve :)? They seem not like things typical Python sequences would want to expose.

I definitely agree that the answer to this depends on the problems we are trying to solve. Having drunk the type annotation kool aid, I'm interested in writing succinct numpy code that passes mypy without littering code with # type: ignore. Which problems do you have in mind?

@seberg
Copy link
Member

seberg commented Jul 8, 2020

Well, type hints and interop with other array-like objects are a good motivation probably. I might suggest opening a new issue or mailing list thread. Right now, I am not sure what best to think about here, typing is forming, so maybe that will end up clarifying some things.

@Kentzo
Copy link

Kentzo commented Dec 28, 2022

Without being an instance collection.abc.Sequence ndarray cannot be used with Structural Pattern Matching:

match [[1, 2], [3, 4]]:
    case [a, b]:
        print(a, b)

match np.array([[1, 2], [3, 4]]):
    case [a, b]:
        print(a, b)

@NeilGirdhar
Copy link
Contributor

NeilGirdhar commented Dec 28, 2022

Without being an instance collection.abc.Sequence ndarray cannot be used with Structural Pattern Matching:

Right, but ndarray isn't a sequence:

match np.zeros(()):
    case []:
        pass   # Can't be iterated.

@Kentzo
Copy link

Kentzo commented Dec 28, 2022

Not sure what you tried to demonstrate here.

Structural pattern matches seem to only care about unpacking, IIRC ndarray can be unpacked (a, b = np.array([1, 2]), a, b, *rest = np.array([1,2,3]) etc) so it should work in this context but it doesn't.

@shoyer
Copy link
Member

shoyer commented Dec 28, 2022

The concensus that has emerged in this discussion is that np.ndarray should not derive from Sequence, because it has an incompatible API.

I don't know how to fix pattern matching to support NumPy arrays, but it would need to use a different approach.

@Kentzo
Copy link

Kentzo commented Dec 28, 2022

@alexchandel
Copy link

There's nothing wrong with only 1D arrays implementing Sequence. No one contests that 1D arrays meet all the implied semantics of Sequence. And plenty of typeclass instances are defined only for particular type specializations (in this case, a particular shape type).

This issue is still relevant with NumPy's recent typing.

@NeilGirdhar
Copy link
Contributor

NeilGirdhar commented Apr 12, 2023

There's nothing wrong with only 1D arrays implementing Sequence.

Can you think of any example of the value of an object affects its parent types? Have you read the thread linked from the comment above yours?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests