New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: genfromtxt now ignores names=('column1','column2')
when unpack=True
#20855
Comments
Background and identificationTo be clear, setting Looking at #16650 and #4341 this seems intended to copy loadtxt. The reason it's a bit unfortunate is that when dynamically creating and reading files, one would hope that the arrays keep their dtype name field. What I would expect is that the list of arrays should have the dtype include the name, so that they can be indexed by name (at a minimum). starting at line 2252:
Previously the last line would return a structured array instead of a list of arrays without names set. However, setting unpack to false does correct this and it does match np.loadtxt. It's just that, on top of being breaking and doing the opposite of previous behavior (as noted in the PR), it doesn't preserve the names. Proposed solutionHowever, it is possible to return 1D arrays whose dtypes are named and so preserve the names (pseudocode follows)
where <correct_dtype_with_name> needs to include the given field name and otherwise preserve the dtype. Probably, there's a better way of doing this actually, since we already have this info in CommentsNow I don't quite understand why it's returned as a list instead of a structured array, as I can't imagine the benefits of a scenario where numpy returns lists instead of efficient arrays. As far as I know, whether the output is a list of arrays or an array of arrays (2+D array), the resulting uses would behave the same. Parity with loadfromtxt doesn't seem to be a good enough reason to break the usual numpy paradigm of operations generally returning numpy arrays unless you're getting a single scalar output. However, even if parity is to be kept, I would expect the names to stick around so that after the unpacking, , the array can be repacked using the names. It's possible to fix the bug with dtype=None while also still returning a structured array, is it not? Perhaps I am wrong and the way I am thinking about it in terms of rows and columns is the problem, as there is a transpose step. |
Hmm, looking closer I doubt we should reconsider this change. We can't preserve names on dtypes with unpacking (We can, but then the array returned becomes annoying to work with, since it is not a "numeric" array, but still structured one – and it would be a change in behaviour).
and you can work with |
That said, just because right now I think this is a valid bug fix, that doesn't mean that we can't pull it and/or do a transition via |
The only counter I have is that the behavior of The behavior sort of mimics pandas Series vs. DataFrame, and it would be odd if Pandas data frame columns suddenly became lists when unpacking it instead of Series, just like it's odd that a structured array loses its structured dtype name when you unpack it. It would make more sense to ensure that structured arrays behave properly as 1D arrays instead of avoiding them when the user explicitly tries to create them. Is there a good reason to avoid them that we could start some issues on and start fixing? (You said they are annoying to work with). Finally, the previous fix to genfromtxt was already quite breaking compared to previous behavior, and wasn't complained about, so maybe that's not such a huge worry. (Seems like it's not majorly relied upon in either way). I do agree that a change like making it a dict does make sense, but would be very disruptive, so would need futurewarnings and more discussion. I'm not in favor of it to be honest. I just want the structured dtype names to not be lost whether I am reading writing from files or passing into numpy functions. Currently, this is the one place where the name is just shed without even a warning. Perhaps, at the very least, we can raise a warning when genfromtxt is used with unpack=True and names is not None, saying that the resulting output will not be a structured array. |
Marking as close and will do so soon unless someone else chimes in, I think the situation is the following:
I agree that this change should maybe have went through a If you/anyone would like to argue for reverting the change, we can do that however. |
Could we create an issue to track adding a If either is not wanted, however, I am fine with closing. Do not want a revert as that is more broken. Finally, At the least, a documentation update is warranted about the silent dropping, if nothing else is to be done. |
This change means that genfromtxt now ignores
names=('column1','column2')
whenunpack=True
is passed. Instead of giving a named/rec array as before, a list of arrays with no names is given. Is this the correct behavior? It means that passing the names explicitly into genfromtxt does nothing and the output is not an array of arrays/columns, which was very useful. Lists can't be indexed the same way. Can I be guaranteed that the returned list is in the right order so I can manually assign the names? Should I open an issue about this?The example I am thinking of is, say we have a simple csv or space delimited document with 4 columns:
val1 val2 valignore1 valignore2
val3 val4 valignore3 valignore4
val5 val6 valignore5 valignore6
I read this in using
np.genfromtxt(data.dat, unpack=True, names=('column1', 'column2'), usecols=(0,1))
, the result is a list and the names are ignored. Previously, the names would be attached and the result would be a two column array that can be indexed with the given names. The list obviously can't be.Now we get [array(val1, val3, val5), array(val1, val3, val5)], previously it would be an array of arrays with names (used to be called recarray I think but maybe that name has changed).
Originally posted by @emirkmo in #16650 (comment)
The text was updated successfully, but these errors were encountered: