BUG: genfromtxt now ignores `names=('column1','column2')` when `unpack=True` #20855

seberg · 2022-01-19T15:12:34Z

This change means that genfromtxt now ignores names=('column1','column2') when unpack=True is passed. Instead of giving a named/rec array as before, a list of arrays with no names is given. Is this the correct behavior? It means that passing the names explicitly into genfromtxt does nothing and the output is not an array of arrays/columns, which was very useful. Lists can't be indexed the same way. Can I be guaranteed that the returned list is in the right order so I can manually assign the names? Should I open an issue about this?

The example I am thinking of is, say we have a simple csv or space delimited document with 4 columns:

val1 val2 valignore1 valignore2
val3 val4 valignore3 valignore4
val5 val6 valignore5 valignore6

I read this in using np.genfromtxt(data.dat, unpack=True, names=('column1', 'column2'), usecols=(0,1)), the result is a list and the names are ignored. Previously, the names would be attached and the result would be a two column array that can be indexed with the given names. The list obviously can't be.

Now we get [array(val1, val3, val5), array(val1, val3, val5)], previously it would be an array of arrays with names (used to be called recarray I think but maybe that name has changed).

Originally posted by @emirkmo in #16650 (comment)

The text was updated successfully, but these errors were encountered:

emirkmo · 2022-01-19T18:23:23Z

Background and identification

To be clear, setting unpack=False gives the correct output as a structured array, it's just opposite from what it did before when unpack is True.

Looking at #16650 and #4341 this seems intended to copy loadtxt. The reason it's a bit unfortunate is that when dynamically creating and reading files, one would hope that the arrays keep their dtype name field.

What I would expect is that the list of arrays should have the dtype include the name, so that they can be indexed by name (at a minimum).
The change that leads to this behavior is from:

https://github.com/AndrewEckart/numpy/blob/6b81583634c29ed9dd93d2bd654ae27862a68187/numpy/lib/npyio.py,

starting at line 2252:

    if unpack:
        if names is None:
            return output.T
        elif len(names) == 1:
            # squeeze single-name dtypes too
            return output[names[0]]
        else:
            # For structured arrays with multiple fields,
            # return an array for each field.
            return [output[field] for field in names]
    return output

Previously the last line would return a structured array instead of a list of arrays without names set. However, setting unpack to false does correct this and it does match np.loadtxt. It's just that, on top of being breaking and doing the opposite of previous behavior (as noted in the PR), it doesn't preserve the names.

Proposed solution

However, it is possible to return 1D arrays whose dtypes are named and so preserve the names (pseudocode follows)

[output[field].view(dtype=<correct_dtype_with_name>) for field in names]

where <correct_dtype_with_name> needs to include the given field name and otherwise preserve the dtype. Probably, there's a better way of doing this actually, since we already have this info in output (it's a structured array), we simply want to extract a 1D column without losing the info of the name.

Comments

Now I don't quite understand why it's returned as a list instead of a structured array, as I can't imagine the benefits of a scenario where numpy returns lists instead of efficient arrays. As far as I know, whether the output is a list of arrays or an array of arrays (2+D array), the resulting uses would behave the same. Parity with loadfromtxt doesn't seem to be a good enough reason to break the usual numpy paradigm of operations generally returning numpy arrays unless you're getting a single scalar output. However, even if parity is to be kept, I would expect the names to stick around so that after the unpacking, , the array can be repacked using the names.

It's possible to fix the bug with dtype=None while also still returning a structured array, is it not? Perhaps I am wrong and the way I am thinking about it in terms of rows and columns is the problem, as there is a transpose step.

seberg · 2022-01-20T20:02:44Z

Hmm, looking closer I doubt we should reconsider this change. We can't preserve names on dtypes with unpacking (We can, but then the array returned becomes annoying to work with, since it is not a "numeric" array, but still structured one – and it would be a change in behaviour).
dtypes simply don't have names unless they are "structured dtypes" with multiple columns ("multiple" can be a single one, but such a dtype does not work the same). What would make sense to me if we returned a dict or similar. But that is a too big change and would require some new API like unpack=dict or something :).

unpack=True has a clear meaning that it returns a list of arrays that can be directly unpacked into values, such as:

col1, col2, col2 = np.genfromtxt(filename)

and you can work with col1 without first indexing with the name. The branch that changed broke that contract, while the default unpack=False always did what you want.

seberg · 2022-01-20T20:05:07Z

That said, just because right now I think this is a valid bug fix, that doesn't mean that we can't pull it and/or do a transition via FutureWarning if this is causing problems downstream.

emirkmo · 2022-01-27T10:45:50Z

The only counter I have is that the behavior of unpack=True would not change UNLESS you explicitly gave names (or maybe they were detected in the file), which you expect to be associated with a structured dtype. And if you are taking the effort of explicitly giving names, you want those names to be preserved. To be honest, I don't understand why the output arrays cannot be structured ones if names have been given, as that allows easily combining them into a single structured array in the future with different names (multiple structured dtypes). I haven't found major differences working with them over plain numpy arrays (which don't have to be numeric anyway), but I am sure they exist.

The behavior sort of mimics pandas Series vs. DataFrame, and it would be odd if Pandas data frame columns suddenly became lists when unpacking it instead of Series, just like it's odd that a structured array loses its structured dtype name when you unpack it. It would make more sense to ensure that structured arrays behave properly as 1D arrays instead of avoiding them when the user explicitly tries to create them. Is there a good reason to avoid them that we could start some issues on and start fixing? (You said they are annoying to work with).

Finally, the previous fix to genfromtxt was already quite breaking compared to previous behavior, and wasn't complained about, so maybe that's not such a huge worry. (Seems like it's not majorly relied upon in either way).

I do agree that a change like making it a dict does make sense, but would be very disruptive, so would need futurewarnings and more discussion. I'm not in favor of it to be honest. I just want the structured dtype names to not be lost whether I am reading writing from files or passing into numpy functions. Currently, this is the one place where the name is just shed without even a warning. Perhaps, at the very least, we can raise a warning when genfromtxt is used with unpack=True and names is not None, saying that the resulting output will not be a structured array.

seberg · 2022-02-01T19:35:37Z

Marking as close and will do so soon unless someone else chimes in, I think the situation is the following:

The old behaviour ignored the unpack=True request by the user.
unpack=True might be slightly awkward in the situation (as header is auto-detected), but I am unconvinced that it is helpful to ignore it sometimes. And there is no alternative.

I agree that this change should maybe have went through a FutureWarning, but at this point I am not sure anyone would be better off with such a transition? Anyone who runs into the issue is expected to run into some kind of error that is fixable by unpack=True.

If you/anyone would like to argue for reverting the change, we can do that however.

emirkmo · 2022-02-03T10:23:57Z

Could we create an issue to track adding a warning to genfromtxt reading when names is not None but unpack=True? (About header names being ignored). Or can I create an issue to ask for the arrays inside the list to be structured 1D ones that respect the names (dtype with a name)? Either is preferable to the silent dropping.

If either is not wanted, however, I am fine with closing. Do not want a revert as that is more broken.

Finally, At the least, a documentation update is warranted about the silent dropping, if nothing else is to be done.

seberg added the 06 - Regression label Jan 19, 2022

seberg added this to the 1.22.2 release milestone Jan 19, 2022

charris modified the milestones: 1.22.2 release, 1.22.3 release Jan 30, 2022

seberg added the 57 - Close? Issues which may be closable unless discussion continued label Feb 1, 2022

seberg removed this from the 1.22.3 release milestone Feb 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: genfromtxt now ignores `names=('column1','column2')` when `unpack=True` #20855

BUG: genfromtxt now ignores `names=('column1','column2')` when `unpack=True` #20855

seberg commented Jan 19, 2022

emirkmo commented Jan 19, 2022 •

edited

seberg commented Jan 20, 2022

seberg commented Jan 20, 2022

emirkmo commented Jan 27, 2022 •

edited

seberg commented Feb 1, 2022

emirkmo commented Feb 3, 2022 •

edited

BUG: genfromtxt now ignores names=('column1','column2') when unpack=True #20855

BUG: genfromtxt now ignores names=('column1','column2') when unpack=True #20855

Comments

seberg commented Jan 19, 2022

emirkmo commented Jan 19, 2022 • edited

Background and identification

Proposed solution

Comments

seberg commented Jan 20, 2022

seberg commented Jan 20, 2022

emirkmo commented Jan 27, 2022 • edited

seberg commented Feb 1, 2022

emirkmo commented Feb 3, 2022 • edited

BUG: genfromtxt now ignores `names=('column1','column2')` when `unpack=True` #20855

BUG: genfromtxt now ignores `names=('column1','column2')` when `unpack=True` #20855

emirkmo commented Jan 19, 2022 •

edited

emirkmo commented Jan 27, 2022 •

edited

emirkmo commented Feb 3, 2022 •

edited